AES-256 (Rijndael) Block cipher


Rijndael is a symmetric block cipher designed by Joan Daemen and Vincent Rijmen.

It was published in 1999 and selected by NIST to replace DES in 2001.

It processes 128-bit blocks of data, uses key sizes of 128/192/256-bits with 10 to 14 rounds of encryption depending on key size.

This is neither a tutorial on AES nor implementation written for performance. It is intentionally optimized for size so if you require a version optimized for speed, please consider using OpenSSL or PolarSSL.

Byte oriented implementations have been written by Karl Malbrain, Brian Gladman, Llya Levin, the latter of which had contributions from Hal Finney

Apart from NIST specifications, there’s a good tutorial on AES by Sam Trenholme

The x86/x64 implementations were a joint effort between Peter Ferrie and myself. They only perform AES-256 although wouldn’t be difficult to modify for AES-128 or AES-192 and certainly could become smaller in the future.

Structures used

sbox/sbox_inv arrays are part of the AES context shown below, but they’re only used by another version that generates the sboxes. You can view aes_tbls.c for more details as they are not used in assembly version. The table free version ended up being smaller.

#define Nk 8      // number of words for key
#define Nr 14     // number of rounds for 256-bit
#define Nb 4      // number of words in each block

#define AES_ENCRYPT 1
#define AES_DECRYPT 0

typedef union _aes_blk_t {
  uint8_t   v8[Nb*4];
  uint16_t v16[Nb*2];
  uint32_t v32[Nb];
  uint64_t v64[Nb/2];
} aes_blk;

typedef struct _aes_ctx_t {
  uint32_t w[Nb*(Nr+1)];
  uint8_t sbox[256];
  uint8_t sbox_inv[256];
} aes_ctx;

Galois Multiplication

Sam provides a good explanation for this function. It is used mostly for MixColumns and SubBytes functions.

Essentially we’re working with values in a finite field. Elliptic Curve Multiplication involves addition.
This function has a finite field of 8-bits. The 0x80 test checks if an overflow will occur after and if it does, then we xor by 0x1b.

uint8_t gf_mul2(uint8_t x)
  return (x & 0x80) ? ((x << 1) ^ 0x1b) : x << 1;

Can also be written as

uint8_t gf_mul2(uint8_t x)
  return (x << 1) ^ ((x >> 7) * 0x1b);

The assembly code simply adds x to x and if carry occurs xors by 0x1b

    ; return (x & 0x80) ? ((x << 1) ^ 0x1b) : x << 1;
    add   al, al
    jnc   no_cy
    xor   al, 01bh

A branchless version taking 2-3 extra bytes

    add   al, al
    sbb   ah, ah
    and   ah, 01bh
    xor   al, ah


Each byte of the state is combined with a block of the round key using bitwise xor. The rnd value is used to obtain the correct subkey.

void AddRoundKey (aes_blk *state, uint32_t w[], int rnd)
  uint32_t i;
  uint8_t *key=(uint8_t*)&w[rnd*4];

  for (i=0; i<16; i++) {
    state->v8[i] ^= key[i];
    mov    cl, 4          ; do 4 words
    shl    ebx, cl        ; multiply rnd by 16
    mov    eax, [edi+ebx] ; w[i+rnd*16]
    xor    [esi], eax     ; state[i] ^= w[i+rnd*16]
    loop   ar_l1


A non-linear substitution step where each byte in state is replaced with another according to a lookup table.

We’re not using lookup tables here because the tables themselves would require 512 bytes and the code to generate the tables was slightly bigger than calculating when required.

Having said that, it may be possible to shrink the code further by optimizing generation of sbox tables and using lookups instead.

The initial version used a separate SubBytes function but since the order of SubBytes/ShiftRows doesn’t matter, it’s performed inside ShiftRows instead.

void SubBytes (aes_blk *state, int enc)
  int8_t i;

  for (i=0; i<16; i++) {
    state->v8[i] = SubByte(state->v8[i], enc);


A transposition step where the last three rows of the state are shifted cyclically a certain number of steps.

Imagine you have the state of 0x00-0x0F in 4×4 array.

uint8_t state[4][4]={
  { 0, 1, 2, 3},
  { 4, 5, 6, 7},
  { 8, 9,10,11},

Printed as columns/rows

 0  4  8  C
 1  5  9  D
 2  6  A  E
 3  7  B  F

After ShiftRows applied, this array will become

 0  4  8  C
 5  9  D  1
 A  E  2  6
 F  3  7  B

For encryption, The 1st row is unchanged, the 2nd is shifted left 1, the 3rd left 2 and 4th left 3.
The following function can be used for encryption/decryption based on enc value.

    xor    ebx, ebx          ; i = 0
    mov    ebp, ecx          ; save enc flag in ebp
    mov    cl, 4             ; do 4 rows
    mov    edi, esi
    mov    cl, 4             
sr_l2:                       ; get row
    call   SubByte
    shrd   edx, eax, 8
    loop   sr_l2
    ; ---------------------
    lea    ecx, [ecx+ebx*8]
    test   ebp, ebp
    jz     sr_d
    ror    edx, cl           ; rotate right for encryption
    db     83h ;mask rol
    rol    edx, cl           ; rotate left for decryption
    mov    cl, 4
    mov    [edi], dl
    ror    edx, 8
    loop   sr_l4
    inc    ebx
    inc    esi
    loop   sr_l1

As said, because it doesn’t matter the order of executing ShiftRows/SubBytes, both are combined to reduce space.

You’re probably wondering what 0x83 byte is doing there before sr_d label.
Disassembled, it looks like the following.

/* 003C */ "x85xed"          /* test ebp, ebp             */
/* 003E */ "x74x03"          /* jz 0x43                   */
/* 0040 */ "xd3xca"          /* ror edx, cl               */
/* 0042 */ "x83xd3xc2"       /* adc ebx, 0xffffffc2       */
/* 0045 */ "x92"             /* xchg edx, eax             */
/* 0046 */ "xb1x04"          /* mov cl, 0x4               */

The technique is called “instruction masking”, and dates back to at least the late 70s (and was popular on the 6502 CPU). The idea is that instead of using a branch instruction (that is, a multi-byte instruction) to skip over another instruction, you use a single-byte instruction instead, whenever the effects can be ignored.

In our case, the 0x83 becomes an ADC instruction, which affects a register that we don’t use, and the 0x3D (which you will see used in aes_encryptx) is a CMP instruction, which affects the flags we don’t need at that moment. Although ebx is used as a counter, pushad/popad saves and restores the value before being used again.


A mixing operation which operates on the columns of the state, combining the four bytes in each column.

uint32_t MixColumn (uint32_t w) {
  return ROTR32(w, 8) ^ 
         ROTR32(w, 16) ^ 
         ROTR32(w, 24) ^ 
         gf_mul2(ROTR32(w, 8) ^ w);

void MixColumns (uint32_t *state, int enc)
  uint32_t i, t, w;

  for (i=0; i<4; i++)
    w = state[i];
    if (enc==AES_DECRYPT) {
      t = ROTR32(w, 16) ^ w;
      t = gf_mul2(gf_mul2(t));
      w ^= t;
    state[i] = MixColumn(w);

The assembly version uses 8 xors to perform this operation.

%define w0 eax
%define w1 ebx
%define w2 edx
%define w3 ebp

%define t eax
%define w ebp

    push   t
    mov    t, w
    and    t, 080808080h
    xor    w, t
    add    w, w
    shr    t, 7
    imul   t, t, 01bh
    xor    w, t
    pop    t
; ***********************************************
; void MixColumns (void *state, int enc)
; ***********************************************
    lea    edi, [ebp+(gf_mul2word-ShiftRows)]
    test   ecx, ecx
    mov    cl, 4
    lodsd                    ; w0 = state[i];
    jne    mc_l2
    mov    w3, w0
    ror    w3, 16
    xor    w3, w0
    call   edi               ; gf_mul2word
    call   edi               ; gf_mul2word
    xor    w0, w3
    mov    w3, w0
    ror    w3, 8
    mov    w1, w3
    xor    w3, w0
    call   edi
    xor    w1, w3
    ror    w0, 16
    xor    w1, w0
    ror    w0, 8
    xor    w1, w0
    mov    [esi-4], w1
    loop   mc_l1

Key Expansion

Expands a short key into a number of separate 128-bit round keys. This can be modified to work with 128 or 192-bit by redefining Nk and Nr.

void aes_setkey (aes_ctx *ctx, void *key)
  int i;
  uint32_t x;
  uint32_t *w=(uint32_t*)ctx->w;
  uint8_t rcon=1;

  for (i=0; i<Nk; i++) {

  for (i=Nk; i<Nb*(Nr+1); i++)
    if ((i % Nk)==0) {
      x = RotWord(x);
      x = SubWord(x) ^ rcon;
    } else if (Nk > 6 && i % Nk == 4) {
    w[i] = w[i-Nk] ^ x;

The assembly version is the same except we only expect to setup 256-bit keys so the else if (Nk > 6) condition isn’t evaluated.

    mov    edi, [esp+32+4]   ; ctx
    mov    esi, [esp+32+8]   ; key

    push   Nk
    pop    ecx
    rep    movsd

    push   1
    pop    edx                ; rcon = 1
    mov    eax, [edi-4]       ; x=w[i-1];
    test   cl, Nk-1           ; (i % Nk)==0
    jnz    sk_ei
    ror    eax, 8             ; x = RotWord(x);
    call   SubWord            ; x = SubWord(x)
    xor    eax, edx           ; x ^= rcon;
    add    dl, dl
%ifdef SCR
    sbb    bl, bl
    and    bl, 1bh
    xor    dl, bl
    jnc    sk_sw
    xor    dl, 1bh
    jmp    sk_sw
    test   cl, 3              ; (i % 4)==0
    jne    sk_sw
    call   SubWord
    xor    eax, [edi-4*Nk]
    inc    ecx
    cmp    ecx, Nb*Nr
    jne    sk_l1


void aes_encrypt (aes_ctx *ctx, void *state, int enc)
  uint8_t round;
  uint32_t *w=(uint32_t*)ctx->w;

  if (enc==AES_ENCRYPT)
    AddRoundKey (state, w, 0);

    for (round=1; round<Nr; round++)
      SubBytes (state, enc);
      ShiftRows (state, enc);
      MixColumns (state, enc);
      AddRoundKey (state, w, round);
    AddRoundKey (state, w, Nr);

    for (round=Nr-1; round>0; round--)
      ShiftRows (state, enc);
      SubBytes (state, enc);
      AddRoundKey (state, w, round);
      MixColumns (state, enc);

  ShiftRows (state, enc);
  SubBytes (state, enc);
  AddRoundKey (state, w, round);

The assembly version is compacted by loading the address of 3 functions into registers and swapping 2 depending on encryption or decryption. AddRoundKey goes in eax initially.

ShiftRows/SubBytes goes into ebp and MixColumns goes into edx. If ecx==AES_ENCRYPT, then MixColumns and AddRoundKey are swapped because they’re executed in different order.

    lea    esi, [esp+32+ 4]
    xchg   edi, eax         ; ctx
    xchg   ecx, eax
    xchg   ecx, eax         ; enc
    xchg   esi, eax         ; state
    call   ld_fn

; 3 functions here

    pop    eax
    lea    ebp, [eax+(ShiftRows -AddRoundKey)]
    lea    edx, [ebp+(MixColumns-ShiftRows)]
    ; ************************************
    push   Nr
    pop    ebx
    jecxz  do_ark
    xor    ebx, ebx
    call   eax ; AddRoundKey
    push   eax
    jecxz  aes_edm
    xchg   edx, eax
    ;;jmp    aes_edm
    db     03dh ; mask call eax/call edx

    call   eax ; MixColumns / AddRoundKey
    call   edx ; AddRoundKey / MixColumns

    call   ebp ; ShiftRows + SubBytes
    dec    ebx
    jecxz  aes_edl
    inc    ebx
    inc    ebx
    cmp    ebx, Nr
    jne    aes_ed
    pop    eax
    call   eax ; AddRoundKey

The instruction masking byte used here is 0x3D which performs a compare, skipping over both call eax and call edx instructions for the first round of encryption or decryption.

 /* 00D5 */ "xe3x02"           /* jecxz 0xd9                */
 /* 00D7 */ "x31xdb"           /* xor ebx, ebx              */
 /* 00D9 */ "xffxd0"           /* call eax                  */
 /* 00DB */ "x50"              /* push eax                  */
 /* 00DC */ "xe3x06"           /* jecxz 0xe4                */
 /* 00DE */ "x92"              /* xchg edx, eax             */
 /* 00DF */ "x3dxffxd0xffxd2"  /* cmp eax, 0xd2ffd0ff       */
 /* 00E4 */ "xffxd5"           /* call ebp                  */
 /* 00E6 */ "x4b"              /* dec ebx                   */

Side-Channel Attack Resistant

Since someone pointed out the initial code was not resistant to side-channel attack, Peter added additional code which should prevent this. It can be enabled by defining SCR before assembly.


architecture size in bytes
x86 377
x86 398 (side-channel resistant)
x64 441
x64 462 (side-channel resistant)


Future updates to assembly code may not be shown here. Refer to AES projects in future here and here.

This entry was posted in assembly, cryptography, encryption, programming, security and tagged , , , , , , . Bookmark the permalink.

6 Responses to AES-256 (Rijndael) Block cipher

  1. Pingback: AES w 420 bajtach - Security News

  2. Pingback: Asmcodes: DES | Odzhan

  3. Pingback: Asmcodes: Twofish-256 | Odzhan

  4. Pingback: Asmcodes: Twofish-256 | x86 crypto

  5. Pingback: Shellcode: A Windows PIC using RSA-2048 key exchange, AES-256, SHA-3 | modexp

  6. Pingback: DES Block cipher | x86 crypto

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s