AES-128 Block Cipher

Introduction

In January 1997, the National Institute of Standards and Technology (NIST) initiated a process to replace the Data Encryption Standard (DES) published in 1977. A draft criteria to evaluate potential algorithms was published, and members of the public were invited to provide feedback. The finalized criteria was published in September 1997 which outlined a minimum acceptable requirement for each submission.

4 years later in November 2001, Rijndael by Belgian Cryptographers Vincent Rijmen and Joan Daemen which we now refer to as the Advanced Encryption Standard (AES), was announced as the winner.

Since publication, implementations of AES have frequently been optimized for speed. Code which executes the quickest has traditionally taken priority over how much ROM it uses. Developers will use lookup tables to accelerate each step of the encryption process, thus compact implementations are rarely if ever sought after.

Our challenge here is to implement AES in the least amount of C and more specifically x86 assembly code. It will obviously result in a slow implementation, and will not be resistant to side-channel analysis, although the latter problem can likely be resolved using conditional move instructions (CMOVcc) if necessary.

AES Parameters

There are three different set of parameters available, with the main difference related to key length. Our implementation will be AES-128 which fits perfectly onto a 32-bit architecture.

Key Length
(Nk words)
Block Size
(Nb words)
Number of Rounds
(Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

Structure of AES

Two IF statements are introduced in order to perform the encryption in one loop. What isn’t included in the illustration below is ExpandRoundKey and AddRoundConstant which generate round keys.

The first layout here is what we normally see used when describing AES. The second introduces 2 conditional statements which makes the code more compact.

Source in C

The optimizers built into C compilers can sometimes reveal more efficient ways to implement a piece of code. The following performs encryption, and results in approx. 400 bytes of x86 assembly.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned W;
// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B x){
    B i,y,c;
    if(x){
      for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
      x=y;F(4)x^=y=(y<<1)|(y>>7);
    }
    return x^99;
}
void E(B *s){
    W i,w,x[8],c=1,*k=(W*)&x[4];
    
    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];

    for(;;){
      // AddRoundKey, 1st part of ExpandRoundKey
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

      // 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;

      // if round 11, stop;
      if(c==108)break;
      // update round constant
      c=M(c);
      
      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

      // if not round 11, MixColumns
      if(c!=108)
        F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

x86 Overview

Some x86 registers have special purposes, and it’s important to know this when writing compact code.

Register Description Used by
eax Accumulator lods, stos, scas, xlat, mul, div
ebx Base xlat
ecx Count loop, rep (conditional suffixes E/Z and NE/NZ)
edx Data cdq, mul, div
esi Source Index lods, movs, cmps
edi Destination Index stos, movs, scas, cmps
ebp Base Pointer enter, leave
esp Stack Pointer pushad, popad, push, pop, call, enter, leave

Those of you familiar with the x86 architecture will know certain instructions have dependencies or affect the state of other registers after execution. For example, LODSB will load a byte from memory pointer in SI to AL before incrementing SI by 1. STOSB will store a byte in AL to memory pointer in DI before incrementing DI by 1. MOVSB will move a byte from memory pointer in SI to memory pointer in DI, before adding 1 to both SI and DI. If the same instruction is preceded by REP (for repeat) then this also affects the CX register, decreasing by 1.

Initialization

The s parameter points to a 32-byte buffer containing a 16-byte plain text and 16-byte master key which is copied to the local buffer x.

A copy of the data is required, because both will be modified during the encryption process. ESI will point to s while EDI will point to x

EAX will hold Rcon value declared as c. ECX will be used exclusively for loops, and EDX is a spare register for loops which require an index starting position of zero. There’s a reason to prefer EAX than other registers. Byte comparisons are only 2 bytes for AL, while 3 for others.

// 2 vs 3 bytes
  /* 0001 */ "\x3c\x6c"             /* cmp al, 0x6c         */
  /* 0003 */ "\x80\xfb\x6c"         /* cmp bl, 0x6c         */
  /* 0006 */ "\x80\xf9\x6c"         /* cmp cl, 0x6c         */
  /* 0009 */ "\x80\xfa\x6c"         /* cmp dl, 0x6c         */

In addition to this, one operation requires saving EAX in another register, which only requires 1 byte with XCHG. Other registers would require 2 bytes

// 1 vs 2 bytes
  /* 0001 */ "\x92"                 /* xchg edx, eax        */
  /* 0002 */ "\x87\xd3"             /* xchg ebx, edx        */

Setting EAX to 1, our loop counter ECX to 4, and EDX to 0 can be accomplished in a variety of ways requiring only 7 bytes. The alternative for setting EAX here would be : XOR EAX, EAX; INC EAX

// 7 bytes
  /* 0001 */ "\x6a\x01"             /* push 0x1             */
  /* 0003 */ "\x58"                 /* pop eax              */
  /* 0004 */ "\x6a\x04"             /* push 0x4             */
  /* 0006 */ "\x59"                 /* pop ecx              */
  /* 0007 */ "\x99"                 /* cdq                  */

Another way …

// 7 bytes
  /* 0001 */ "\x31\xc9"             /* xor ecx, ecx         */
  /* 0003 */ "\xf7\xe1"             /* mul ecx              */
  /* 0005 */ "\x40"                 /* inc eax              */
  /* 0006 */ "\xb1\x04"             /* mov cl, 0x4          */

And another..

// 7 bytes
  /* 0000 */ "\x6a\x01"             /* push 0x1             */
  /* 0002 */ "\x58"                 /* pop eax              */
  /* 0003 */ "\x99"                 /* cdq                  */
  /* 0004 */ "\x6b\xc8\x04"         /* imul ecx, eax, 0x4   */

ESI will point to s which contains our plain text and master key. ESI is normally reserved for read operations. We can load a byte with LODS into AL/EAX, and move values from ESI to EDI using MOVS.

Typically we see stack allocation using ADD or SUB, and sometimes (very rarely) using ENTER. This implementation only requires 32-bytes of stack space, and PUSHAD which saves 8 general purpose registers on the stack is exactly 32-bytes of memory, executed in 1 byte opcode.

To illustrate why it makes more sense to use PUSHAD/POPAD instead of ADD/SUB or ENTER/LEAVE, the following are x86 opcodes generated by assembler.

// 5 bytes
  /* 0000 */ "\xc8\x20\x00\x00" /* enter 0x20, 0x0 */
  /* 0004 */ "\xc9"             /* leave           */
  
// 6 bytes
  /* 0000 */ "\x83\xec\x20"     /* sub esp, 0x20   */
  /* 0003 */ "\x83\xc4\x20"     /* add esp, 0x20   */
  
// 2 bytes
  /* 0000 */ "\x60"             /* pushad          */
  /* 0001 */ "\x61"             /* popad           */

Obviously the 2-byte example is better here, but once you require more than 96-bytes, usually ADD/SUB in combination with a register is the better option.

; *****************************
; void E(void *s);
; *****************************
_E:
    pushad
    xor    ecx, ecx           ; ecx = 0
    mul    ecx                ; eax = 0, edx = 0
    inc    eax                ; c = 1
    mov    cl, 4
    pushad                    ; alloca(32)
; F(8)x[i]=((W*)s)[i];
    mov    esi, [esp+64+4]    ; esi = s
    mov    edi, esp
    pushad
    add    ecx, ecx           ; copy state + master key to stack
    rep    movsd
    popad

Multiplication

A pointer to this function is stored in EBP, and there are three reasons to use EBP over other registers:

  1. EBP has no 8-bit registers, so we can’t use it for any 8-bit operations.
  2. Indirect memory access requires 1 byte more for index zero.
  3. The only instructions that use EBP are ENTER and LEAVE.
// 2 vs 3 bytes for indirect access  
  /* 0001 */ "\x8b\x5d\x00"         /* mov ebx, [ebp]       */
  /* 0004 */ "\x8b\x1e"             /* mov ebx, [esi]       */

When writing compact code, EBP is useful only as a temporary register or pointer to some function.

; *****************************
; Multiplication over GF(2**8)
; *****************************
    call   $+21               ; save address      
    push   ecx                ; save ecx
    mov    cl, 4              ; 4 bytes
    add    al, al             ; al <<= 1
    jnc    $+4                ;
    xor    al, 27             ;
    ror    eax, 8             ; rotate for next byte
    loop   $-9                ; 
    pop    ecx                ; restore ecx
    ret
    pop    ebp

SubByte

In the SubBytes step, each byte a_{i,j} in the state matrix is replaced with S(a_{i,j}) using an 8-bit substitution box. The S-box is derived from the multiplicative inverse over GF(2^8), and we can implement SubByte purely using code.

; *****************************
; B SubByte(B x)
; *****************************
sub_byte:  
    pushad 
    test   al, al            ; if(x){
    jz     sb_l6
    xchg   eax, edx
    mov    cl, -1            ; i=255 
; for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
sb_l0:
    mov    al, 1             ; y=1
sb_l1:
    test   ah, ah            ; !c
    jnz    sb_l2    
    cmp    al, dl            ; y!=x
    setz   ah
    jz     sb_l0
sb_l2:
    mov    dh, al            ; y^=M(y)
    call   ebp               ;
    xor    al, dh
    loop   sb_l1             ; --i
; F(4)x^=y=(y<<1)|(y>>7);
    mov    dl, al            ; dl=y
    mov    cl, 4             ; i=4  
sb_l5:
    rol    dl, 1             ; y=R(y,1)
    xor    al, dl            ; x^=y
    loop   sb_l5             ; i--
sb_l6:
    xor    al, 99            ; return x^99
    mov    [esp+28], al
    popad
    ret

AddRoundKey

The state matrix is combined with a subkey using the bitwise XOR operation. This step known as Key Whitening was inspired by the mathematician Ron Rivest, who in 1984 applied a similar technique to the Data Encryption Standard (DES) and called it DESX.

; *****************************
; AddRoundKey
; *****************************
; F(4)s[i]=x[i]^k[i];
    pushad
    xchg   esi, edi           ; swap x and s
xor_key:
    lodsd                     ; eax = x[i]
    xor    eax, [edi+16]      ; eax ^= k[i]
    stosd                     ; s[i] = eax
    loop   xor_key
    popad

AddRoundConstant

There are various cryptographic attacks possible against AES without this small, but important step. It protects against the Slide Attack, first described in 1999 by David Wagner and Alex Biryukov. Without different round constants to generate round keys, all the round keys will be the same.

; *****************************
; AddRoundConstant
; *****************************
; *k^=c; c=M(c);
    xor    [esi+16], al
    call   ebp

ExpandRoundKey

The operation to expand the master key into subkeys for each round of encryption isn’t normally in-lined. To boost performance, these round keys are precomputed before the encryption process since you would only waste CPU cycles repeating the same computation which is unnecessary.

Compacting the AES code into a single call requires in-lining the key expansion operation. The C code here is not directly translated into x86 assembly, but the assembly does produce the same result.

; ***************************
; ExpandRoundKey
; ***************************
; F(4)w<<=8,w|=S(((B*)k)[15-i]);w=R(w,8);F(4)w=k[i]^=w;
    pushad
    add    esi,16
    mov    eax, [esi+3*4]    ; w=k[3]
    ror    eax, 8            ; w=R(w,8)
exp_l1:
    call   S                 ; w=S(w)
    ror    eax, 8            ; w=R(w,8);
    loop   exp_l1
    mov    cl, 4
exp_l2:
    xor    [esi], eax        ; k[i]^=w
    lodsd                    ; w=k[i]
    loop   exp_l2
    popad

Combining the steps

An earlier version of the code used separate AddRoundKey, AddRoundConstant, and ExpandRoundKey, but since these steps all relate to using and updating the round key, the 3 steps are combined in order to reduce the number of loops, thus shaving off a few bytes.

; *****************************
; AddRoundKey, AddRoundConstant, ExpandRoundKey
; *****************************
; w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
; w=R(w,8)^c;F(4)w=k[i]^=w;
    pushad
    xchg   eax, edx
    xchg   esi, edi
    mov    eax, [esi+16+12]  ; w=R(k[3],8);
    ror    eax, 8
xor_key:
    mov    ebx, [esi+16]     ; t=k[i];
    xor    [esi], ebx        ; x[i]^=t;
    movsd                    ; s[i]=x[i];
; w=(w&-256)|S(w)
    call   sub_byte          ; al=S(al);
    ror    eax, 8            ; w=R(w,8);
    loop   xor_key
; w=R(w,8)^c;
    xor    eax, edx          ; w^=c;
; F(4)w=k[i]^=w;
    mov    cl, 4
exp_key:
    xor    [esi], eax        ; k[i]^=w;
    lodsd                    ; w=k[i];
    loop   exp_key
    popad

Shifting Rows

ShiftRows cyclically shifts the bytes in each row of the state matrix by a certain offset. The first row is left unchanged. Each byte of the second row is shifted one to the left, with the third and fourth rows shifted by two and three respectively.

Because it doesn’t matter about the order of SubBytes and ShiftRows, they’re combined in one loop.

; ***************************
; ShiftRows and SubBytes
; ***************************
; F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((B*)s)[i]);
    pushad
    mov    cl, 16
shift_rows:
    lodsb                    ; al = S(s[i])
    call   sub_byte
    push   edx
    mov    ebx, edx          ; ebx = i%4
    and    ebx, 3            ;
    shr    edx, 2            ; (i/4 - ebx) % 4
    sub    edx, ebx          ; 
    and    edx, 3            ; 
    lea    ebx, [ebx+edx*4]  ; ebx = (ebx+edx*4)
    mov    [edi+ebx], al     ; x[ebx] = al
    pop    edx
    inc    edx
    loop   shift_rows
    popad

Mixing Columns

The MixColumns transformation along with ShiftRows are the main source of diffusion. Each column is treated as a four-term polynomial b(x)=b_{3}x^{3}+b_{2}x^{2}+b_{1}x+b_{0}, where the coefficients are elements over {GF} (2^{8}), and is then multiplied modulo x^{4}+1 with a fixed polynomial a(x)=3x^{3}+x^{2}+x+2

; *****************************
; MixColumns
; *****************************
; F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    pushad
mix_cols:
    mov    eax, [edi]        ; w0 = x[i];
    mov    ebx, eax          ; w1 = w0;
    ror    eax, 8            ; w0 = R(w0,8);
    mov    edx, eax          ; w2 = w0;
    xor    eax, ebx          ; w0^= w1;
    call   ebp               ; w0 = M(w0);
    xor    eax, edx          ; w0^= w2;
    ror    ebx, 16           ; w1 = R(w1,16);
    xor    eax, ebx          ; w0^= w1;
    ror    ebx, 8            ; w1 = R(w1,8);
    xor    eax, ebx          ; w0^= w1;
    stosd                    ; x[i] = w0;
    loop   mix_cols
    popad
    jmp    enc_main

Counter Mode (CTR)

Block ciphers should never be used in Electronic Code Book (ECB) mode, and the ECB Penguin illustrates why.

As you can see, blocks of the same data using the same key result in the exact same ciphertexts, which is why modes of encryption were invented. Galois/Counter Mode (GCM) is authenticated encryption which uses Counter (CTR) mode to provide confidentiality.

The concept of CTR mode which turns a block cipher into a stream cipher was first proposed by Whitfield Diffie and Martin Hellman in their 1979 publication, Privacy and Authentication: An Introduction to Cryptography.

CTR mode works by encrypting a nonce and counter, then using the ciphertext to encrypt our plain text using a simple XOR operation. Since AES encrypts 16-byte blocks, a counter can be 8-bytes, and a nonce 8-bytes.

The following is a very simple implementation of this mode using the AES-128 implementation.

// encrypt using Counter (CTR) mode
void encrypt(W l, B*c, B*p, B*k){
    W i,r;
    B t[32];

    // copy master key to local buffer
    F(16)t[i+16]=k[i];

    while(l){
      // copy counter+nonce to local buffer
      F(16)t[i]=c[i];
      
      // encrypt t
      E(t);
      
      // XOR plaintext with ciphertext
      r=l>16?16:l;
      F(r)p[i]^=t[i];
      
      // update length + position
      l-=r;p+=r;
      
      // update counter
      for(i=16;i>0;i--)
        if(++c[i-1])break;
    }
}

In assembly

; void encrypt(W len, B *ctr, B *in, B *key)
_encrypt:
    pushad
    lea    esi,[esp+32+4]
    lodsd
    xchg   eax, ecx          ; ecx = len
    lodsd
    xchg   eax, ebp          ; ebp = ctr
    lodsd
    xchg   eax, edx          ; edx = in
    lodsd
    xchg   esi, eax          ; esi = key
    pushad                   ; alloca(32)
; copy master key to local buffer
; F(16)t[i+16]=key[i];
    lea    edi, [esp+16]     ; edi = &t[16]
    movsd
    movsd
    movsd
    movsd
aes_l0:
    xor    eax, eax
    jecxz  aes_l3            ; while(len){
; copy counter+nonce to local buffer
; F(16)t[i]=ctr[i];
    mov    edi, esp          ; edi = t
    mov    esi, ebp          ; esi = ctr
    push   edi
    movsd
    movsd
    movsd
    movsd
; encrypt t    
    call   _E                ; E(t)
    pop    edi
aes_l1:
; xor plaintext with ciphertext
; r=len>16?16:len;
; F(r)in[i]^=t[i];
    mov    bl, [edi+eax]     ; 
    xor    [edx], bl         ; *in++^=t[i];
    inc    edx               ; 
    inc    eax               ; i++
    cmp    al, 16            ;
    loopne aes_l1            ; while(i!=16 && --ecx!=0)
; update counter
    xchg   eax, ecx          ; 
    mov    cl, 16
aes_l2:
    inc    byte[ebp+ecx-1]   ;
    loopz  aes_l2            ; while(++c[i]==0 && --ecx!=0)
    xchg   eax, ecx
    jmp    aes_l0
aes_l3:
    popad
    popad
    ret

AES-256

There is currently no assembly written for this, but keep an eye on the repo for updates.

void E(B *s) {
    W i,r=0,w,x[12],c=1,*k=(W*)&x[4];

    // copy 128-bit plain text + 256-bit master key to x
    F(12)x[i]=((W*)s)[i];

    for(;;) {
      // 1st part of ExpandKey
      w=k[r?3:7];
      F(4)w=(w&-256)|S(w),w=R(w,8); 
      // AddConstant, update constant
      if(!r)w=R(w,8)^c,c=M(c);
      // AddRoundKey, 2nd part of ExpandKey
      F(4)((W*)s)[i]=x[i]^k[r*4+i], w=k[r*4+i]^=w;
      // if round 15, stop
      if(c==27) break;
      r=(r+1)&1;
      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
      // if not round 15, MixColumns    
      if((c!=128) | r)
        F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

8-bit architectures

The following is an implementation specifically for 8-bit architectures.

void E(B *s) {
    B a,b,c,d,i,j,t,x[32],rc=1,*k=&x[16];
    
    // copy 128-bit plain text + 128-bit master key to x
    F(32)x[i]=s[i];

    for(;;) {
      // AddRoundKey
      F(16)s[i]=x[i]^k[i];
      // if round 11, stop
      if(rc==108)break;
      // ExpandKey
      F(4)k[i]^=S(k[12+((i-3)&3)]);k[0]^=rc;
      F(16)k[i+4]^=k[i];
      // update round constant
      rc=M(rc);
      // SubBytes and ShiftRows
      F(16)x[(i&3)+(((W)(i/4)-(i&3))&3)*4]=S(s[i]);
      // if not round 11
      if(rc!=108) {
        // MixColumns
        for(i=0;i<16;i+=4) {
          a=x[i],b=x[i+1],c=x[i+2],d=x[i+3];
          for(j=0;j<4;j++) {
            x[i+j]^=a^b^c^d^M(a^b);
            t=a,a=b,b=c,c=d,d=t;
          }
        }
      }
    }
}

Summary

The final assembly code for ECB mode is 205 bytes, and 272 for CTR mode.

Check sources here.

Advertisements
Posted in arm, assembly, cryptography, programming, security, x86 | Tagged , , , | 4 Comments

Xoodoo Permutation Function

Introduction

Xoodoo is a new permutation function designed by the Keccak team along with Seth Hoffert and Johan De Meulder.

It was described by Joan Daemen at the 2017 Workshop on Elliptic Curve Cryptography in a talk: Innovations in permutation-based crypto

The design is very similar to the Keccak permutation function, but obviously draws inspiration from the design of Gimli which also uses a 384-bit state, and works efficiently on many platforms.

The C code shown here is derived from a python implementation here

September 2018 UPDATE: The Keccak team have posted their own implementations here with further documentation in the Xoodoo cookbook.

C code

When calling this function, the p parameter should point to a 48-byte buffer.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define X(u,v)t=s[u],s[u]=s[v],s[v]=t
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void xoodoo(void*p){
  W e[4],a,b,c,t,r,i,*s=p;
  W x[12]={
    0x058,0x038,0x3c0,0x0d0,
    0x120,0x014,0x060,0x02c,
    0x380,0x0f0,0x1a0,0x012};

  for(r=0;r<12;r++){
    F(4)
      e[i]=R(s[i]^s[i+4]^s[i+8],18),
      e[i]^=R(e[i],9);
    F(12)
      s[i]^=e[(i-1)&3];
    X(7,4);X(7,5);X(7,6);
    s[0]^=x[r];
    F(4)
      a=s[i],
      b=s[i+4],
      c=R(s[i+8],21),
      s[i+8]=R((b&~a)^c,24),
      s[i+4]=R((a&~c)^b,31),
      s[i]^=c&~b;
    X(8,10);X(9,11);
  }
}

x86 Assembly

%define x0 eax
%define x1 ebx
%define x2 edx

xoodoo:
_xoodoo:
    pushad
    mov    esi, [esp+32+4]   ; esi = state
    xor    ecx, ecx          ; ecx = 0
    mul    ecx               ; eax = 0, edx = 0
    pushad                   ; allocate 32 bytes of memory
    mov    edi, esp          ; edi = e
    call   ld_rc
    dw     0x58,  0x38, 0x3c0, 0xd0
    dw     0x120, 0x14,  0x60, 0x2c
    dw     0x380, 0xf0, 0x1a0, 0x12
ld_rc:
    pop    ebx                  ; ebx = rc
xoo_main:
    pushad                      ; save all
    movzx  ebp, word[ebx+eax*2] ; ebp = rc[r]
    mov    cl, 4                ; i = 4
    pushad                      ; save all
    ;
    ; Theta
    ;
    ; e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
    ; e[i]^= ROTR32(e[i], 9);
xd_l1:
    lodsd                    ; eax  = x[i]
    xor    eax, [esi+16-4]   ; eax ^= x[i+4]
    xor    eax, [esi+32-4]   ; eax ^= x[i+8]
    ror    eax, 18
    mov    ebx, eax
    ror    ebx, 9
    xor    eax, ebx
    stosd                    ; e[i] = eax
    loop   xd_l1
    popad                    ; restore all
    ; x[i]^= e[(i - 1) & 3];
    dec    edx               ; edx = -1
    mov    cl, 12
xd_l2:
    mov    eax, edx          ; eax = edx & 3
    and    eax, 3
    mov    eax, [edi+eax*4]  ; eax = e[(i - 1) & 3]
    inc    edx               ; i++
    xor    [esi+edx*4], eax  ; x[i] ^= eax
    loop   xd_l2

    ; XCHG(x[7], x[4]);
    mov    eax, [esi+7*4]
    xchg   eax, [esi+4*4]

    ; XCHG(x[7], x[5]);
    xchg   eax, [esi+5*4]

    ; XCHG(x[7], x[6]);
    xchg   eax, [esi+6*4]
    mov    [esi+7*4], eax

    ; x[0] ^= rc[r];
    xor    [esi], ebp
    mov    cl, 4
    pushad
xd_l6:
    ; x0 = x[i+0];
    lodsd

    ; x1 = x[i+4];
    mov    x1, [esi+16-4]

    ; x2 = ROTR32(x[i+8], 21);
    mov    x2, [esi+32-4]
    ror    x2, 21

    ; x[i+8] = ROTR32((~x0 & x1) ^ x2, 24);
    not    x0
    and    x0, x1
    xor    x0, x2
    ror    x0, 24
    mov    [esi+32-4], x0

    ; x[i+4] = ROTR32((~x2 & x0) ^ x1, 31);
    push   x2
    not    x2
    and    x2, [esi-4]
    xor    x2, x1
    ror    x2, 31
    mov    [esi+16-4], x2
    pop    x2

    ; x[i+0] ^= ~x1 & x2;
    not    x1
    and    x1, x2
    xor    [esi-4], x1
    loop   xd_l6
    popad

    ; XCHG(x[8], x[10]);
    ; XCHG(x[9], x[11]);
    mov    eax, [esi+8*4]
    mov    ebp, [esi+9*4]

    xchg   eax, [esi+10*4]
    xchg   ebp, [esi+11*4]

    mov    [esi+8*4], eax
    mov    [esi+9*4], ebp

    popad

    ; --r
    inc    eax
    cmp    al, 12
    jnz    xoo_main

    ; release memory
    popad
    ; restore registers
    popad
    ret

ARM assembly

ARM offers a nice instruction BIC which is the equivilant of (x & ~y) in C.

This instruction is used in the Rho and Chi modules combined.

.arm
  .arch armv6
  .text
  .align  2

  .global xoodoo

state .req r0
x     .req r1

r     .req r2
i     .req r3

x0    .req r4
x1    .req r5
x2    .req r6
x3    .req r7

rc    .req r8
xt    .req r9

e     .req sp

xoodoo:
    // save registers
    push   {r0-r12, lr}

    mov    r, #12              // 12 rounds
    sub    sp, #16             // allocate 16 bytes
    adr    rc, rc_tab
xoodoo_main:
    mov    i, #0               // i = 0
    mov    x, state
theta_l0:
    ldr    x2, [x, #32]        // x2 = x[i+8]
    ldr    x1, [x, #16]        // x1 = x[i+4]
    ldr    x0, [x], #4         // x0 = x[i+0], advance x by 4

    // e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
    eor    x0, x1
    eor    x0, x2
    ror    x0, #18

    // e[i]^= ROTR32(e[i], 9);
    eor    x0, x0, ror #9
    str    x0, [sp, i, lsl #2]  // store in e

    add    i, #1               // i++
    cmp    i, #4               // i<4
    bne    theta_l0            //

    // x[i]^= e[(i - 1) & 3];
    mov    i, #0               // i = 0
    mov    x, state            // x = state
theta_l1:
    sub    xt, i, #1
    and    xt, #3               // xt = i & 3
    ldr    xt, [sp, xt, lsl #2] // xt = e[(i - 1) & 3]
    ldr    x0, [x, i, lsl #2]   // x0 = x[i]
    eor    x0, xt               // x0 ^= xt
    str    x0, [x, i, lsl #2]   // x[i] = x0
    add    i, #1                // i++
    cmp    i, #12               // i<12
    bne    theta_l1

    // Rho west
    // XCHG(x[7], x[4]);
    // XCHG(x[7], x[5]);
    // XCHG(x[7], x[6]);
    add    x, state, #16       // x = &state[4]
    ldm    x, {x0, x1, x2, x3}
    mov    xt, x0              // xt = x[4]
    mov    x0, x3              // x[4] = x[7]
    mov    x3, x2              // x[7] = x[6]
    mov    x2, x1              // x[6] = x[5]
    mov    x1, xt              // x[5] = xt
    stm    x, {x0, x1, x2, x3}

    mov    x, state

    // Iota
    // x[0] ^= rc[r];
    ldrh   xt, [rc], #2        // load half-word, advance by 2
    ldr    x0, [x]             // load word
    eor    x0, xt              // xor
    str    x0, [x]             // store word

    mov    i, #4
chi:
    // Chi and Rho east
    // x0 = x[i+0];
    ldr    x0, [x]

    // x1 = x[i+4];
    ldr    x1, [x, #16]

    // x2 = ROTR32(x[i+8], 21);
    ldr    x2, [x, #32]
    ror    x2, #21

    // x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24);
    bic    xt, x1, x0
    eor    xt, x2
    ror    xt, #24
    str    xt, [x, #32]

    // x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31);
    bic    xt, x0, x2
    eor    xt, x1
    ror    xt, #31
    str    xt, [x, #16]

    // x[i+0]^= x2 & ~x1;
    bic    xt, x2, x1
    eor    xt, x0
    str    xt, [x], #4

    // i--
    subs   i, #1
    bne    chi

    add    x, state, #32       // x = &state[8]

    // XCHG(x[8], x[10]);
    ldm    x, {x0, x1, x2, x3}
    push   {x0}
    mov    x0, x2
    pop    {x2}

    // XCHG(x[9], x[11]);
    push   {x1}
    mov    x1, x3
    pop    {x3}
    stm    x, {x0, x1, x2, x3}

    subs   r, #1               // r--
    bne    xoodoo_main         // r>0

    // release stack
    add    sp, #16

    // restore registers, and return
    pop    {r0-r12, pc}

    // round constants
rc_tab:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

ARM64 assembly

// Xoodoo in ARM64 assembly
// 268 bytes

    .arch armv8-a
    .text

    .global xoodoo

xoodoo:
    sub    sp, sp, 16          // allocate 16 bytes
    adr    x8, rc
    mov    w9, 12               // 12 rounds
L0:
    mov    w7, 0                // i = 0
    mov    x1, x0
L1:
    ldr    w4, [x1, 32]         // w4 = s[i+8]
    ldr    w3, [x1, 16]         // w3 = s[i+4]
    ldr    w2, [x1], 4          // w2 = s[i+0], advance x1 by 4

    // e[i] = R(s[i] ^ s[i+4] ^ s[i+8], 18);
    eor    w2, w2, w3 
    eor    w2, w2, w4 
    ror    w2, w2, 18

    // e[i] ^= R(e[i], 9);
    eor    w2, w2, w2, ror 9
    str    w2, [sp, x7, lsl 2]  // store in e

    add    w7, w7, 1            // i++
    cmp    w7, 4                // i < 4
    bne    L1                   //

    // s[i]^= e[(i - 1) & 3];
    mov    w7, 0                // i = 0
L2:
    sub    w2, w7, 1
    and    w2, w2, 3            // w2 = i & 3
    ldr    w2, [sp, x2, lsl 2]  // w2 = e[(i - 1) & 3]
    ldr    w3, [x0, x7, lsl 2]  // w3 = s[i]
    eor    w3, w3, w2           // w3 ^= w2 
    str    w3, [x0, x7, lsl 2]  // s[i] = w3 
    add    w7, w7, 1            // i++
    cmp    w7, 12               // i < 12
    bne    L2 

    // Rho west
    // X(s[7], s[4]);
    // X(s[7], s[5]);
    // X(s[7], s[6]);
    ldp    w2, w3, [x0, 16]
    ldp    w4, w5, [x0, 24]
    stp    w5, w2, [x0, 16]
    stp    w3, w4, [x0, 24]

    // Iota
    // s[0] ^= *rc++;
    ldrh   w2, [x8], 2         // load half-word, advance by 2
    ldr    w3, [x0]            // load word
    eor    w3, w3, w2          // xor
    str    w3, [x0]            // store word

    mov    w7, 4
    mov    x1, x0
L3:
    // Chi and Rho east
    // a = s[i+0];
    ldr    w2, [x1]

    // b = s[i+4];
    ldr    w3, [x1, 16]

    // c = R(s[i+8], 21);
    ldr    w4, [x1, 32]
    ror    w4, w4, 21

    // s[i+8] = R((b & ~a) ^ c, 24);
    bic    w5, w3, w2 
    eor    w5, w5, w4 
    ror    w5, w5, 24
    str    w5, [x1, 32]

    // s[i+4] = R((a & ~c) ^ b, 31);
    bic    w5, w2, w4 
    eor    w5, w5, w3 
    ror    w5, w5, 31
    str    w5, [x1, 16]

    // s[i+0]^= c & ~b;
    bic    w5, w4, w3 
    eor    w5, w5, w2 
    str    w5, [x1], 4

    // i--
    subs   w7, w7, 1
    bne    L3 

    // X(s[8], s[10]);
    // X(s[9], s[11]);
    ldp    w2, w3, [x0, 32] // 8, 9
    ldp    w4, w5, [x0, 40] // 10, 11
    stp    w2, w3, [x0, 40]
    stp    w4, w5, [x0, 32]

    subs   w9, w9, 1           // r--
    bne    L0                  // r != 0

    // release stack
    add    sp, sp, 16
    ret
    // round constants
rc:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

Summary

Xoodoo with 12 rounds and a 384-bit state is much more compact than Keccak with 22 rounds and an 800-bit state. Presumably Keccak is more secure, but cannot be vectorized and perform as well as Xoodoo or Gimli.

The x86 assembly here is 186 bytes while the C generated assembly is 300.

Sources here.

Posted in arm, assembly, cryptography, programming, x86 | Tagged , , , , , , | Leave a comment