AES-128 / AES-256 Block Cipher

Introduction

In January 1997, the National Institute of Standards and Technology (NIST) initiated a process to replace the Data Encryption Standard (DES) published in 1977. A draft criteria to evaluate potential algorithms was published, and members of the public were invited to provide feedback. The finalized criteria was published in September 1997 which outlined a minimum acceptable requirement for each submission. Four years later in November 2001, Rijndael by Belgian Cryptographers Vincent Rijmen and Joan Daemen that we now refer to as the Advanced Encryption Standard (AES), was announced as the winner.

Since publication, implementations of AES have frequently been optimized for speed. Code that executes the quickest has traditionally taken priority over how much ROM it uses. Developers will use lookup tables to accelerate each step of the encryption process, thus compact implementations are rarely if ever sought after. Our challenge here is to implement AES in the least amount of C and more specifically x86 assembly code. It will obviously result in a slow implementation, and will not be resistant to side-channel analysis, although the latter problem can likely be resolved using conditional move instructions (CMOVcc) if necessary. There are also implementations in 32-bit and 64-bit ARM assembly. The 8-bit version of the C code has been successfully tested on an Arduino Uno.

Parameters

There are three different set of parameters available, with the main difference related to key length. Our implementation will be AES-128, which fits perfectly onto a 32-bit architecture.

Key Length
(Nk words)
Block Size
(Nb words)
Number of Rounds
(Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

Structure

Two IF statements are introduced in order to perform the encryption in one loop. What isn’t included in the illustration below is ExpandRoundKey and AddRoundConstant which generate round keys.

The first layout here is what we normally see used when describing AES. The second introduces 2 conditional statements that makes the code more compact.

Source in C

The optimizers built into C compilers can sometimes reveal more efficient ways to implement a piece of code. The following performs encryption, and results in approx. 400 bytes of x86 assembly.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned W;
// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B x){
    B i,y,c;
    if(x){
      for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
      x=y;F(4)x^=y=(y<<1)|(y>>7);
    }
    return x^99;
}
void E(B *s){
    W i,w,x[8],c=1,*k=(W*)&x[4];
    
    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];

    for(;;){
      // AddRoundKey, 1st part of ExpandRoundKey
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

      // 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;

      // if round 11, stop;
      if(c==108)break;
      // update round constant
      c=M(c);
      
      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

      // if not round 11, MixColumns
      if(c!=108)
        F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

x86 Overview

Some x86 registers have special purposes, and it’s important to know this when writing compact code.

Register Description Used by
eax Accumulator lods, stos, scas, xlat, mul, div
ebx Base xlat
ecx Count loop, rep (conditional suffixes E/Z and NE/NZ)
edx Data cdq, mul, div
esi Source Index lods, movs, cmps
edi Destination Index stos, movs, scas, cmps
ebp Base Pointer enter, leave
esp Stack Pointer pushad, popad, push, pop, call, enter, leave

Those of you familiar with the x86 architecture will know certain instructions have dependencies or affect the state of other registers after execution. For example, LODSB will load a byte from memory pointer in SI to AL before incrementing SI by 1. STOSB will store a byte in AL to memory pointer in DI before incrementing DI by 1. MOVSB will move a byte from memory pointer in SI to memory pointer in DI, before adding 1 to both SI and DI. If the same instruction is preceded by REP (for repeat) then this also affects the CX register, decreasing by 1.

Initialization

The s parameter points to a 32-byte buffer containing a 16-byte plain text and 16-byte master key which is copied to the local buffer x.

A copy of the data is required, because both will be modified during the encryption process. ESI will point to s while EDI will point to x. EAX will hold Rcon value declared as c. ECX will be used exclusively for loops, and EDX is a spare register for loops which require an index starting position of zero. There’s a reason to prefer EAX than other registers. Byte comparisons are only 2 bytes for AL, while 3 for others.

// 2 vs 3 bytes
  /* 0001 */ "\x3c\x6c"             /* cmp al, 0x6c         */
  /* 0003 */ "\x80\xfb\x6c"         /* cmp bl, 0x6c         */
  /* 0006 */ "\x80\xf9\x6c"         /* cmp cl, 0x6c         */
  /* 0009 */ "\x80\xfa\x6c"         /* cmp dl, 0x6c         */

In addition to this, one operation requires saving EAX in another register, which only requires 1 byte with XCHG. Other registers would require 2 bytes

// 1 vs 2 bytes
  /* 0001 */ "\x92"                 /* xchg edx, eax        */
  /* 0002 */ "\x87\xd3"             /* xchg ebx, edx        */

Setting EAX to 1, our loop counter ECX to 4, and EDX to 0 can be accomplished in a variety of ways requiring only 7 bytes. The alternative for setting EAX here would be : XOR EAX, EAX; INC EAX

// 7 bytes
  /* 0001 */ "\x6a\x01"             /* push 0x1             */
  /* 0003 */ "\x58"                 /* pop eax              */
  /* 0004 */ "\x6a\x04"             /* push 0x4             */
  /* 0006 */ "\x59"                 /* pop ecx              */
  /* 0007 */ "\x99"                 /* cdq                  */

Another way …

// 7 bytes
  /* 0001 */ "\x31\xc9"             /* xor ecx, ecx         */
  /* 0003 */ "\xf7\xe1"             /* mul ecx              */
  /* 0005 */ "\x40"                 /* inc eax              */
  /* 0006 */ "\xb1\x04"             /* mov cl, 0x4          */

And another..

// 7 bytes
  /* 0000 */ "\x6a\x01"             /* push 0x1             */
  /* 0002 */ "\x58"                 /* pop eax              */
  /* 0003 */ "\x99"                 /* cdq                  */
  /* 0004 */ "\x6b\xc8\x04"         /* imul ecx, eax, 0x4   */

ESI will point to s which contains our plain text and master key. ESI is normally reserved for read operations. We can load a byte with LODS into AL/EAX, and move values from ESI to EDI using MOVS. Typically we see stack allocation using ADD or SUB, and sometimes (very rarely) using ENTER. This implementation only requires 32-bytes of stack space, and PUSHAD which saves 8 general purpose registers on the stack is exactly 32-bytes of memory, executed in 1 byte opcode. To illustrate why it makes more sense to use PUSHAD/POPAD instead of ADD/SUB or ENTER/LEAVE, the following are x86 opcodes generated by assembler.

// 5 bytes
  /* 0000 */ "\xc8\x20\x00\x00" /* enter 0x20, 0x0 */
  /* 0004 */ "\xc9"             /* leave           */
  
// 6 bytes
  /* 0000 */ "\x83\xec\x20"     /* sub esp, 0x20   */
  /* 0003 */ "\x83\xc4\x20"     /* add esp, 0x20   */
  
// 2 bytes
  /* 0000 */ "\x60"             /* pushad          */
  /* 0001 */ "\x61"             /* popad           */

Obviously the 2-byte example is better here, but once you require more than 96-bytes, usually ADD/SUB in combination with a register is the better option.

; *****************************
; void E(void *s);
; *****************************
_E:
    pushad
    xor    ecx, ecx           ; ecx = 0
    mul    ecx                ; eax = 0, edx = 0
    inc    eax                ; c = 1
    mov    cl, 4
    pushad                    ; alloca(32)
; F(8)x[i]=((W*)s)[i];
    mov    esi, [esp+64+4]    ; esi = s
    mov    edi, esp
    pushad
    add    ecx, ecx           ; copy state + master key to stack
    rep    movsd
    popad

Multiplication

A pointer to this function is stored in EBP, and there are three reasons to use EBP over other registers:

  1. EBP has no 8-bit registers, so we can’t use it for any 8-bit operations.
  2. Indirect memory access requires 1 byte more for index zero.
  3. The only instructions that use EBP are ENTER and LEAVE.
// 2 vs 3 bytes for indirect access  
  /* 0001 */ "\x8b\x5d\x00"         /* mov ebx, [ebp]       */
  /* 0004 */ "\x8b\x1e"             /* mov ebx, [esi]       */

When writing compact code, EBP is useful only as a temporary register or pointer to some function.

; *****************************
; Multiplication over GF(2**8)
; *****************************
    call   $+21               ; save address      
    push   ecx                ; save ecx
    mov    cl, 4              ; 4 bytes
    add    al, al             ; al <<= 1
    jnc    $+4                ;
    xor    al, 27             ;
    ror    eax, 8             ; rotate for next byte
    loop   $-9                ; 
    pop    ecx                ; restore ecx
    ret
    pop    ebp

SubByte

In the SubBytes step, each byte a_{i,j} in the state matrix is replaced with S(a_{i,j}) using an 8-bit substitution box. The S-box is derived from the multiplicative inverse over GF(2^8), and we can implement SubByte purely using code.

; *****************************
; B SubByte(B x)
; *****************************
sub_byte:  
    pushad 
    test   al, al            ; if(x){
    jz     sb_l6
    xchg   eax, edx
    mov    cl, -1            ; i=255 
; for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
sb_l0:
    mov    al, 1             ; y=1
sb_l1:
    test   ah, ah            ; !c
    jnz    sb_l2    
    cmp    al, dl            ; y!=x
    setz   ah
    jz     sb_l0
sb_l2:
    mov    dh, al            ; y^=M(y)
    call   ebp               ;
    xor    al, dh
    loop   sb_l1             ; --i
; F(4)x^=y=(y<<1)|(y>>7);
    mov    dl, al            ; dl=y
    mov    cl, 4             ; i=4  
sb_l5:
    rol    dl, 1             ; y=R(y,1)
    xor    al, dl            ; x^=y
    loop   sb_l5             ; i--
sb_l6:
    xor    al, 99            ; return x^99
    mov    [esp+28], al
    popad
    ret

AddRoundKey

The state matrix is combined with a subkey using the bitwise XOR operation.

; *****************************
; AddRoundKey
; *****************************
; F(4)s[i]=x[i]^k[i];
    pushad
    xchg   esi, edi           ; swap x and s
xor_key:
    lodsd                     ; eax = x[i]
    xor    eax, [edi+16]      ; eax ^= k[i]
    stosd                     ; s[i] = eax
    loop   xor_key
    popad

AddRoundConstant

There are various cryptographic attacks possible against AES without this small, but important step. It protects against the Slide Attack, first described in 1999 by David Wagner and Alex Biryukov. Without different round constants to generate round keys, all the round keys will be the same.

; *****************************
; AddRoundConstant
; *****************************
; *k^=c; c=M(c);
    xor    [esi+16], al
    call   ebp

ExpandRoundKey

The operation to expand the master key into subkeys for each round of encryption isn’t normally in-lined. To boost performance, these round keys are precomputed before the encryption process since you would only waste CPU cycles repeating the same computation which is unnecessary. Compacting the AES code into a single call requires in-lining the key expansion operation. The C code here is not directly translated into x86 assembly, but the assembly does produce the same result.

; ***************************
; ExpandRoundKey
; ***************************
; F(4)w<<=8,w|=S(((B*)k)[15-i]);w=R(w,8);F(4)w=k[i]^=w;
    pushad
    add    esi,16
    mov    eax, [esi+3*4]    ; w=k[3]
    ror    eax, 8            ; w=R(w,8)
exp_l1:
    call   S                 ; w=S(w)
    ror    eax, 8            ; w=R(w,8);
    loop   exp_l1
    mov    cl, 4
exp_l2:
    xor    [esi], eax        ; k[i]^=w
    lodsd                    ; w=k[i]
    loop   exp_l2
    popad

Combining the steps

An earlier version of the code used separate AddRoundKey, AddRoundConstant and ExpandRoundKey, but since these steps all relate to using and updating the round key, the three steps are combined in order to reduce the number of loops, thus shaving off a few bytes.

; *****************************
; AddRoundKey, AddRoundConstant, ExpandRoundKey
; *****************************
; w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
; w=R(w,8)^c;F(4)w=k[i]^=w;
    pushad
    xchg   eax, edx
    xchg   esi, edi
    mov    eax, [esi+16+12]  ; w=R(k[3],8);
    ror    eax, 8
xor_key:
    mov    ebx, [esi+16]     ; t=k[i];
    xor    [esi], ebx        ; x[i]^=t;
    movsd                    ; s[i]=x[i];
; w=(w&-256)|S(w)
    call   S                 ; al=S(al);
    ror    eax, 8            ; w=R(w,8);
    loop   xor_key
; w=R(w,8)^c;
    xor    eax, edx          ; w^=c;
; F(4)w=k[i]^=w;
    mov    cl, 4
exp_key:
    xor    [esi], eax        ; k[i]^=w;
    lodsd                    ; w=k[i];
    loop   exp_key
    popad

ShiftRows

ShiftRows cyclically shifts the bytes in each row of the state matrix by a certain offset. The first row is left unchanged. Each byte of the second row is shifted one to the left, with the third and fourth rows shifted by two and three respectively. Because it doesn’t matter about the order of SubBytes and ShiftRows, they’re combined in one loop.

; ***************************
; ShiftRows and SubBytes
; ***************************
; F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((B*)s)[i]);
    pushad
    mov    cl, 16
shift_rows:
    lodsb                    ; al = S(s[i])
    call   sub_byte
    push   edx
    mov    ebx, edx          ; ebx = i%4
    and    ebx, 3            ;
    shr    edx, 2            ; (i/4 - ebx) % 4
    sub    edx, ebx          ; 
    and    edx, 3            ; 
    lea    ebx, [ebx+edx*4]  ; ebx = (ebx+edx*4)
    mov    [edi+ebx], al     ; x[ebx] = al
    pop    edx
    inc    edx
    loop   shift_rows
    popad

MixColumns

The MixColumns transformation along with ShiftRows are the main source of diffusion. Each column is treated as a four-term polynomial b(x)=b_{3}x^{3}+b_{2}x^{2}+b_{1}x+b_{0}, where the coefficients are elements over {GF} (2^{8}), and is then multiplied modulo x^{4}+1 with a fixed polynomial a(x)=3x^{3}+x^{2}+x+2

; *****************************
; MixColumns
; *****************************
; F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    pushad
mix_cols:
    mov    eax, [edi]        ; w0 = x[i];
    mov    ebx, eax          ; w1 = w0;
    ror    eax, 8            ; w0 = R(w0,8);
    mov    edx, eax          ; w2 = w0;
    xor    eax, ebx          ; w0^= w1;
    call   ebp               ; w0 = M(w0);
    xor    eax, edx          ; w0^= w2;
    ror    ebx, 16           ; w1 = R(w1,16);
    xor    eax, ebx          ; w0^= w1;
    ror    ebx, 8            ; w1 = R(w1,8);
    xor    eax, ebx          ; w0^= w1;
    stosd                    ; x[i] = w0;
    loop   mix_cols
    popad
    jmp    enc_main

Counter Mode (CTR)

Block ciphers should never be used in Electronic Code Book (ECB) mode, and the ECB Penguin illustrates why.

As you can see, blocks of the same data using the same key result in the exact same ciphertexts; this is why modes of encryption were invented. Galois/Counter Mode (GCM) is authenticated encryption that uses Counter (CTR) mode to provide confidentiality. The concept of CTR mode which turns a block cipher into a stream cipher was first proposed by Whitfield Diffie and Martin Hellman in their 1979 publication, Privacy and Authentication: An Introduction to Cryptography. CTR mode works by encrypting a nonce and counter, then using the ciphertext to encrypt our plain text using a simple XOR operation. Since AES encrypts 16-byte blocks, a counter can be 8-bytes, and a nonce 8-bytes.

The following is a very simple implementation of this mode using the AES-128 implementation.

// encrypt using Counter (CTR) mode
void encrypt(W l, B*c, B*p, B*k){
    W i,r;
    B t[32];

    // copy master key to local buffer
    F(16)t[i+16]=k[i];

    while(l){
      // copy counter+nonce to local buffer
      F(16)t[i]=c[i];
      
      // encrypt t
      E(t);
      
      // XOR plaintext with ciphertext
      r=l>16?16:l;
      F(r)p[i]^=t[i];
      
      // update length + position
      l-=r;p+=r;
      
      // update counter
      for(i=16;i>0;i--)
        if(++c[i-1])break;
    }
}

In assembly.

; void encrypt(W len, B *ctr, B *in, B *key)
_encrypt:
    pushad
    lea    esi,[esp+32+4]
    lodsd
    xchg   eax, ecx          ; ecx = len
    lodsd
    xchg   eax, ebp          ; ebp = ctr
    lodsd
    xchg   eax, edx          ; edx = in
    lodsd
    xchg   esi, eax          ; esi = key
    pushad                   ; alloca(32)
; copy master key to local buffer
; F(16)t[i+16]=key[i];
    lea    edi, [esp+16]     ; edi = &t[16]
    movsd
    movsd
    movsd
    movsd
aes_l0:
    xor    eax, eax
    jecxz  aes_l3            ; while(len){
; copy counter+nonce to local buffer
; F(16)t[i]=ctr[i];
    mov    edi, esp          ; edi = t
    mov    esi, ebp          ; esi = ctr
    push   edi
    movsd
    movsd
    movsd
    movsd
; encrypt t    
    call   _E                ; E(t)
    pop    edi
aes_l1:
; xor plaintext with ciphertext
; r=len>16?16:len;
; F(r)in[i]^=t[i];
    mov    bl, [edi+eax]     ; 
    xor    [edx], bl         ; *in++^=t[i];
    inc    edx               ; 
    inc    eax               ; i++
    cmp    al, 16            ;
    loopne aes_l1            ; while(i!=16 && --ecx!=0)
; update counter
    xchg   eax, ecx          ; 
    mov    cl, 16
aes_l2:
    inc    byte[ebp+ecx-1]   ;
    loopz  aes_l2            ; while(++c[i]==0 && --ecx!=0)
    xchg   eax, ecx
    jmp    aes_l0
aes_l3:
    popad
    popad
    ret

Full source code

The following is for AES-128, AES-256 on 8-bit or 32-bit and 64-bit architectures. Using a table lookup for the sbox is enabled by default because the DYNAMIC method is incredibly slow.

// Multiplication over GF(2**8)
#if AES_INT_LEN == 1
  #define M(x)(((x)<<1)^((-((x)>>7))&0x1b))
#else
  u32 M(u32 x) {
      u32 t=x&0x80808080;
      return((x^t)<<1)^((t>>7)*0x1b);
  }
#endif
// the sbox array is used by default for optimal speed
#ifndef DYNAMIC
  u8 sbox[256]=
  {0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 
   0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76,
   0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 
   0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0,
   0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 
   0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15,
   0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 
   0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75,
   0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 
   0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84,
   0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 
   0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf,
   0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 
   0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8,
   0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 
   0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2,
   0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 
   0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73,
   0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 
   0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb,
   0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 
   0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79,
   0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 
   0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08,
   0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 
   0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,
   0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 
   0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e,
   0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 
   0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf,
   0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 
   0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 };
  #define S(x) sbox[x]
#else
  // SubByte
  u8 S(u8 x) {
      u8 i,y,c;
      if(x) {
        for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
        x=y;
        for(i=0;i<4;i++) {
          x^=y=(y<<1)|(y>>7);
        }
      }
      return x^99;
  }
#endif

#if AES_INT_LEN == 1
  // 128-bit version for 8-bit architectures

  void aes_ecb(void *mk, void *data) {
      u8 a,b,c,d,i,j,t,x[AES_BLK_LEN],
        k[AES_KEY_LEN],rc=1,*s=(u8*)data;
      
      // copy 128-bit plain text + 128-bit master key to x
      for(i=0;i<AES_BLK_LEN;i++) {
        x[i]=s[i], k[i]=((u32*)mk)[i];
      }

      for(;;) {
        // AddRoundKey
        for(i=0;i<AES_BLK_LEN;i++) {
          s[i]=x[i]^k[i];
        }
        // if round 11, stop
        if(rc==108)break;
        // AddConstant
        k[0]^=rc; rc=M(rc);
        // ExpandKey
        for(i=0;i<4;i++) {
          k[i]^=S(k[12+((i-3)&3)]);
        }
        for(i=0;i<12;i++) {
          k[i+4]^=k[i];
        }
        // SubBytes and ShiftRows
        for(i=0;i<AES_BLK_LEN;i++) {
          x[(i&3)+((((u32)(i>>2)-(i&3))&3)<<2)]=S(s[i]);
        }
        // if not round 11
        if(rc!=108) {
          // MixColumns
          for(i=0;i<AES_BLK_LEN;i+=4) {
            a=x[i],b=x[i+1],c=x[i+2],d=x[i+3];
            for(j=0;j<4;j++) {
              x[i+j]^=a^b^c^d^M(a^b);
              t=a,a=b,b=c,c=d,d=t;
            }
          }
        }
      }
  }
#else
  // 32-bit or 64-bit versions

  #if AES_KEY_LEN == 32

    void aes_ecb(void *mk, void *data) {
        u32 c=1,i,r=0,w,x[4],k[8], *s=(u32*)data;

        // copy 128-bit plain text
        for(i=0;i<4;i++) {
          x[i] = s[i];
        }
        // copy 256-bit master key
        for(i=0;i<8;i++) {
          k[i] = ((u32*)mk)[i];
        }

        for(;;) {
          // 1st part of ExpandKey
          w=k[r?3:7];
          for(i=0;i<4;i++) {
            w=(w&-256) | S(w&255),w=R(w,8);
          } 
          // AddConstant, update constant
          if(!r)w=R(w,8)^c,c=M(c);
          // AddRoundKey, 2nd part of ExpandKey
          for(i=0;i<4;i++) {
            ((u32*)s)[i]=x[i]^k[r*4+i], w=k[r*4+i]^=w;
          }
          // if round 15, stop
          if(c==27) break;
          r=(r+1)&1;
          // SubBytes and ShiftRows
          for(i=0;i<AES_BLK_LEN;i++) {
            ((u8*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((u8*)s)[i]);
          }
          // if not round 15, MixColumns    
          if((c!=128) | r) {
            for(i=0;i<4;i++) {
              w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
            }
          }
        }
    }
  #else
    void aes_ecb(void *mk, void *data) {
        u32 c=1,i,w,x[4],k[4],*s=(u32*)data;

        // copy 128-bit plain text + 128-bit master key to x
        for(i=0;i<4;i++) {
          x[i]=s[i], k[i]=((u32*)mk)[i];
        }
        for(;;) {
          // 1st part of ExpandKey
          w=k[3];
          for(i=0;i<4;i++) {
            w=(w&-256)|S(w&255), w=R(w,8);
          }
          // AddConstant, AddRoundKey, 2nd part of ExpandKey
          w=R(w, 8)^c;
          for(i=0;i<4;i++) {
            ((u32*)s)[i]=x[i]^k[i], w=k[i]^=w;
          }
          // if round 11, stop
          if(c==108)break; 
          // update constant
          c=M(c);
          // SubBytes and ShiftRows
          for(i=0;i<AES_BLK_LEN;i++) {
            ((u8*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((u8*)s)[i]);
          }
          // if not round 11, MixColumns
          if(c!=108) {
            for(i=0;i<4;i++) {
              w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
            }
          }
        }
    }
  #endif
#endif

#ifdef CTR
  // encrypt using Counter (CTR) mode
  void aes_ctr(u32 len, void *ctr, void *data, void *mk) {
      u8 i, r, t[AES_BLK_LEN], *p=data, *c=ctr;
      
      while(len) {
        // copy counter+nonce to local buffer
        for(i=0;i<AES_BLK_LEN;i++)t[i] = c[i];
        // encrypt t
        aes_ecb(mk, t);
        // XOR plaintext with ciphertext
        r = len > AES_BLK_LEN ? AES_BLK_LEN : len;
        for(i=0;i<r;i++) p[i] ^= t[i];
        // update length + position
        len -= r; p += r;
        // update counter.
        for(i=AES_BLK_LEN;i!=0;i--)
          if(++c[i-1]) break;
      }
  }
#endif

Summary

The final assembly code for ECB mode is 205 bytes, and 272 for CTR mode. You can find sources here.

Advertisements
Posted in arm, assembly, cryptography, programming, security, x86 | Tagged , , , | 4 Comments

SHAKE-128 Stream Cipher

Introduction

An Extendable-Output Function (XOF) absorbs input of variable length and generates output of variable length. SHAKE128 and SHAKE256 are two such XOFs designed by the Keccak team and included in the SHA-3 standard. They both use a sponge construction and the Keccak(pronounced “ketchak”) permutation function to generate ciphertext streams. The suffixes “128” and “256” indicate the security strength of these functions when the output is of sufficient length. XOFs can be used as a message digest algorithm, a non-reseedable random number generator, for key derivation or as a stream cipher. Here, you’ll see SHAKE128 used as a simple stream cipher. It doesn’t support an Initialization Vector (IV) or nonce, nor does it include authentication. It simply absorbs a 128-bit key and switches to XOF mode before using the output to encrypt data using an exclusive OR operation.

Compact code

This function can be used to initialize a context and encrypt or decrypt a stream of bytes. P is the KECCAK-f[1600] permutation function that is also used by SHA-3. This code is only provided as an example of how one might use SHAKE128 as a stream cipher, and shouldn’t be used for any mission critical application.

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)

typedef unsigned long long W;
typedef unsigned char B;

typedef struct _shake_ctx {
    int i;
    union {
      B b[200];
      W q[25];
    } s;
}shake_ctx;

void P(shake_ctx*ctx) {
    W n,i,j,r,x,y,t,Y,b[5],*s=ctx->s.q;
    B c=1;

    F(n,24){
      F(i,5){b[i]=0;F(j,5)b[i]^=s[i+j*5];}
      F(i,5){
        t=b[(i+4)%5]^R(b[(i+1)%5],63);
        F(j,5)s[i+j*5]^=t;}
      t=s[1],y=r=0,x=1;
      F(j,24)
        r+=j+1,Y=(x*2)+(y*3),x=y,y=Y%5,
        Y=s[x+y*5],s[x+y*5]=R(t, -r),t=Y;
      F(j,5){
        F(i,5)b[i]=s[i+j*5];
        F(i,5)
          s[i+j*5]=b[i]^(b[(i+2)%5]&~b[(i+1)%5]);}
      F(j,7)
        if((c=(c<<1)^((c>>7)*113))&2)
          *s^=1ULL<<((1<<j)-1);
    }
}

#define RT (1600 - 256) / 8

void shake128(W len,void*in,shake_ctx*c) {
    W i;

    if(len) {
      F(i,len) {
        if(c->i == RT) {
          P(c); c->i = 0;
        }
        ((B*)in)[i] ^= c->s.b[c->i++];
      }
    } else {
      F(i,25)c->s.q[i]=0; c->i=0;
      F(i,16)c->s.b[i]^=((B*)in)[i];
      c->s.b[16]  ^=0x1F;
      c->s.b[RT-1]^=0x80;
      P(c);
    }
}

sources here.

Posted in cryptography, encryption, programming, security | Tagged , , , , | Leave a comment

SIMECK Block Cipher

Introduction

The Simeck Family of Lightweight Block Ciphers written by Gangqiang Yang, Bo Zhu, Valentin Suder, Mark D. Aagaard, and Guang Gong was published in 2015. According to the authors, SIMECK combines the good design components of both SIMON and SPECK, in order to devise more compact and efficient block ciphers. There are three variants.

  • Simeck32/64
  • Simeck48/96
  • Simeck64/128

Only the 64/128 version of encryption is implemented here because it has the most potential to be used in applications. Reference implementations for all variants by Bo Zhu can be found here.

Compact code

The following combines key scheduling and encryption in one function. mk should point to a 128-bit master key while data should point to a 64-bit block of plaintext to encrypt. The value 0x938BCA3083F used for the key schedule is exactly 44 bits, and because there are 44 rounds of encryption, it’s also used as the loop counter.

#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)

void simeck(void*mk,void*p){
  unsigned int t,k0,k1,k2,k3,l,r,*k=mk,*x=p;
  unsigned long long s=0x938BCA3083F;

  k0=k[0];k1=k[1];k2=k[2];k3=k[3]; 
  r=x[0];l=x[1];

  do{
    r^=R(l,1)^(R(l,5)&l)^k0;
    X(l,r);
    t=(s&1)-4;
    k0^=R(k1,1)^(R(k1,5)&k1)^t;    
    X(k0,k1);X(k1,k2);X(k2,k3);
  } while(s>>=1);
  x[0]=r; x[1]=l;
}

x86 assembly

; -----------------------------------------------
; SIMECK64/128 Block Cipher in x86 assembly
;
; size: 97 bytes
;
; global calls use cdecl convention
;
; -----------------------------------------------
    bits 32

    %ifndef BIN
      global _simeck
    %endif

struc pushad_t
  _edi resd 1
  _esi resd 1
  _ebp resd 1
  _esp resd 1
  _ebx resd 1
  _edx resd 1
  _ecx resd 1
  _eax resd 1
  .size:
endstruc

%define x0 ebx
%define x1 eax

%define k0 ecx
%define k1 edx
%define k2 edi
%define k3 ebp

%define t0 esi
%define s0 dword[esp+_edi+4]
%define s1 dword[esp+_esi+4]

_simeck:
    pushad
    mov    edi, 0xBCA3083F
    mov    esi, 0x938
    pushad
    mov    esi, [esp+64+4] ; esi=mk
    lodsd
    xchg   eax, k0
    lodsd
    xchg   eax, k1
    lodsd
    xchg   eax, k2
    lodsd
    xchg   eax, k3
    mov    esi, [esp+64+8] ; esi=x
    push   esi
    lodsd
    xchg   eax, x0
    lodsd
sm_l0:
    xor    x0, k0  ; x[0]^=k[0];
    mov    t0, x1  ; x[0]^=R(x[1],1);
    rol    t0, 1   ;
    xor    x0, t0  ;
    rol    t0, 4   ; x[0]^=(R(x[1],5)&x[1]);
    and    t0, x1  ;
    xor    x0, t0  ;

    xchg   x0, x1  ; X(x[0],x[1]);

    ; t0 = (s & 1) - 4;
    xor    t0, t0
    shr    s1, 1
    rcr    s0, 1
    adc    t0, -4

    xor    k0, t0  ; k[0]^=t0;
    mov    t0, k1  ; k[0]^=R(k[1],1);
    rol    t0, 1   ;
    xor    k0, t0  ;
    rol    t0, 4   ; k[0]^=(R(k[1],5)&k[1]);
    and    t0, k1  ;
    xor    k0, t0  ;

    xchg   k0, k1  ; X(k[0],k[1]);
    xchg   k1, k2  ; X(k[1],k[2]);
    xchg   k2, k3  ; X(x[0],k[0]);

    cmp    s0, 0
    jnz    sm_l0

    pop    edi
    xchg   eax, x0
    stosd          ; x[0]=x0;
    xchg   eax, x0
    stosd          ; x[1]=x1;

    popad
    popad
    ret

ARM assembly

//
  .arm
  .arch armv6
  .text
  .align  2

  .global simeck
  
k  .req r0
p  .req r1

r  .req r2
l  .req r3

k0 .req r4
k1 .req r5
k2 .req r6
k3 .req r7

t0 .req r8
t1 .req r9

sx .req r10
sy .req r11

simeck:
    // save registers
    push   {r0-r12, lr}
    
    // unsigned long long s=0x938BCA3083F;
    ldr    sx, =#0xBCA3083F
    ldr    sy, =#0x938
    
    // k0=k[0]; k1=k[1]; k2=k[2]; k3=k[3];
    ldm    k, {k0,k1,k2,k3}
    
    // r=x[0]; l=x[1];
    ldm    p, {r,l}
sm_l0:
    // r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
    eor    t0, k0, l,ror #31
    and    t1, l, l, ror #27
    eor    t0, t1        
    mov    t1, l         
    eor    l, r, t0       
    mov    r, t1     

    // t1 = (s & 1) - 4;
    and    t1, sx, #1
    sub    t1, #4
  
    // k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
    // X(k0,k1); X(k1,k2); X(k2,k3);
    eor    k0, k0, k1, ror #31
    and    t0, k1, k1, ror #27
    eor    t0, k0
    mov    k0, k1
    mov    k1, k2
    mov    k2, k3
    eor    k3, t0, t1
    
    // s >>= 1
    movs   sy, sy, lsr #1
    movs   sx, sx, rrx
    bne    sm_l0
    
    // x[0]=r; x[1]=l;
    stm    p, {r,l}    
    // restore registers, and return
    pop    {r0-r12, pc}

ARM64 / AArch64 assembly

// SIMECK in ARM64 assembly
// 100 bytes

    .arch armv8-a
    .text
    .global simeck
  
simeck:
     // unsigned long long s = 0x938BCA3083F;
     movz    x2, 0x083F
     movk    x2, 0xBCA3, lsl 16
     movk    x2, 0x0938, lsl 32 
 
     // load 128-bit key 
     ldp     w3, w4, [x0]
     ldp     w5, w6, [x0, 8]
 
     // load 64-bit plaintext 
     ldp     w8, w7, [x1]
L0:
     // r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
     eor     w9, w3, w7, ror 31
     and     w10, w7, w7, ror 27
     eor     w9, w9, w10        
     mov     w10, w7         
     eor     w7, w8, w9      
     mov     w8, w10     

     // t1 = (s & 1) - 4;
     // k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
     // X(k0,k1); X(k1,k2); X(k2,k3);
     eor     w3, w3, w4, ror 31
     and     w9, w4, w4, ror 27
     eor     w9, w9, w3 
     mov     w3, w4 
     mov     w4, w5 
     mov     w5, w6 
     and     x10, x2, 1
     sub     x10, x10, 4
     eor     w6, w9, w10 
    
     // s >>= 1
     lsr     x2, x2, 1 
     cbnz    x2, L0
 
     // save 64-bit ciphertext 
     stp     w8, w7, [x1]
     ret 

Sources here.

Posted in arm, assembly, cryptography, encryption, programming, x86 | Tagged , , , , , , , | Leave a comment

Xoodoo Permutation Function

Introduction

Xoodoo is a cryptographic permutation function designed by the Keccak team along with Seth Hoffert and Johan De Meulder. It was described by Joan Daemen at the 2017 Workshop on Elliptic Curve Cryptography in a talk: Innovations in permutation-based crypto The design is very similar to the Keccak permutation function, but obviously draws inspiration from the design of Gimli that also uses a 384-bit state, and works efficiently on many platforms. The C code shown here is derived from a python implementation. The Keccak team have posted their own implementations here with further documentation in the Xoodoo cookbook.

C code

When calling this function, the p parameter should point to a 48-byte buffer.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define X(u,v)t=s[u],s[u]=s[v],s[v]=t
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void xoodoo(void*p){
  W e[4],a,b,c,t,r,i,*s=p;
  W x[12]={
    0x058,0x038,0x3c0,0x0d0,
    0x120,0x014,0x060,0x02c,
    0x380,0x0f0,0x1a0,0x012};

  for(r=0;r<12;r++){
    F(4)
      e[i]=R(s[i]^s[i+4]^s[i+8],18),
      e[i]^=R(e[i],9);
    F(12)
      s[i]^=e[(i-1)&3];
    X(7,4);X(7,5);X(7,6);
    s[0]^=x[r];
    F(4)
      a=s[i],
      b=s[i+4],
      c=R(s[i+8],21),
      s[i+8]=R((b&~a)^c,24),
      s[i+4]=R((a&~c)^b,31),
      s[i]^=c&~b;
    X(8,10);X(9,11);
  }
}

x86 Assembly

%define x0 eax
%define x1 ebx
%define x2 edx

xoodoo:
_xoodoo:
    pushad
    mov    esi, [esp+32+4]   ; esi = state
    xor    ecx, ecx          ; ecx = 0
    mul    ecx               ; eax = 0, edx = 0
    pushad                   ; allocate 32 bytes of memory
    mov    edi, esp          ; edi = e
    call   ld_rc
    dw     0x58,  0x38, 0x3c0, 0xd0
    dw     0x120, 0x14,  0x60, 0x2c
    dw     0x380, 0xf0, 0x1a0, 0x12
ld_rc:
    pop    ebx                  ; ebx = rc
xoo_main:
    pushad                      ; save all
    movzx  ebp, word[ebx+eax*2] ; ebp = rc[r]
    mov    cl, 4                ; i = 4
    pushad                      ; save all
    ;
    ; Theta
    ;
    ; e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
    ; e[i]^= ROTR32(e[i], 9);
xd_l1:
    lodsd                    ; eax  = x[i]
    xor    eax, [esi+16-4]   ; eax ^= x[i+4]
    xor    eax, [esi+32-4]   ; eax ^= x[i+8]
    ror    eax, 18
    mov    ebx, eax
    ror    ebx, 9
    xor    eax, ebx
    stosd                    ; e[i] = eax
    loop   xd_l1
    popad                    ; restore all
    ; x[i]^= e[(i - 1) & 3];
    dec    edx               ; edx = -1
    mov    cl, 12
xd_l2:
    mov    eax, edx          ; eax = edx & 3
    and    eax, 3
    mov    eax, [edi+eax*4]  ; eax = e[(i - 1) & 3]
    inc    edx               ; i++
    xor    [esi+edx*4], eax  ; x[i] ^= eax
    loop   xd_l2

    ; XCHG(x[7], x[4]);
    mov    eax, [esi+7*4]
    xchg   eax, [esi+4*4]

    ; XCHG(x[7], x[5]);
    xchg   eax, [esi+5*4]

    ; XCHG(x[7], x[6]);
    xchg   eax, [esi+6*4]
    mov    [esi+7*4], eax

    ; x[0] ^= rc[r];
    xor    [esi], ebp
    mov    cl, 4
    pushad
xd_l6:
    ; x0 = x[i+0];
    lodsd

    ; x1 = x[i+4];
    mov    x1, [esi+16-4]

    ; x2 = ROTR32(x[i+8], 21);
    mov    x2, [esi+32-4]
    ror    x2, 21

    ; x[i+8] = ROTR32((~x0 & x1) ^ x2, 24);
    not    x0
    and    x0, x1
    xor    x0, x2
    ror    x0, 24
    mov    [esi+32-4], x0

    ; x[i+4] = ROTR32((~x2 & x0) ^ x1, 31);
    push   x2
    not    x2
    and    x2, [esi-4]
    xor    x2, x1
    ror    x2, 31
    mov    [esi+16-4], x2
    pop    x2

    ; x[i+0] ^= ~x1 & x2;
    not    x1
    and    x1, x2
    xor    [esi-4], x1
    loop   xd_l6
    popad

    ; XCHG(x[8], x[10]);
    ; XCHG(x[9], x[11]);
    mov    eax, [esi+8*4]
    mov    ebp, [esi+9*4]

    xchg   eax, [esi+10*4]
    xchg   ebp, [esi+11*4]

    mov    [esi+8*4], eax
    mov    [esi+9*4], ebp

    popad

    ; --r
    inc    eax
    cmp    al, 12
    jnz    xoo_main

    ; release memory
    popad
    ; restore registers
    popad
    ret

ARM assembly

ARM offers a nice instruction BIC which is the equivilant of (x & ~y) in C.

This instruction is used in the Rho and Chi modules combined.

.arm
  .arch armv6
  .text
  .align  2

  .global xoodoo

state .req r0
x     .req r1

r     .req r2
i     .req r3

x0    .req r4
x1    .req r5
x2    .req r6
x3    .req r7

rc    .req r8
xt    .req r9

e     .req sp

xoodoo:
    // save registers
    push   {r0-r12, lr}

    mov    r, #12              // 12 rounds
    sub    sp, #16             // allocate 16 bytes
    adr    rc, rc_tab
xoodoo_main:
    mov    i, #0               // i = 0
    mov    x, state
theta_l0:
    ldr    x2, [x, #32]        // x2 = x[i+8]
    ldr    x1, [x, #16]        // x1 = x[i+4]
    ldr    x0, [x], #4         // x0 = x[i+0], advance x by 4

    // e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
    eor    x0, x1
    eor    x0, x2
    ror    x0, #18

    // e[i]^= ROTR32(e[i], 9);
    eor    x0, x0, ror #9
    str    x0, [sp, i, lsl #2]  // store in e

    add    i, #1               // i++
    cmp    i, #4               // i<4
    bne    theta_l0            //

    // x[i]^= e[(i - 1) & 3];
    mov    i, #0               // i = 0
    mov    x, state            // x = state
theta_l1:
    sub    xt, i, #1
    and    xt, #3               // xt = i & 3
    ldr    xt, [sp, xt, lsl #2] // xt = e[(i - 1) & 3]
    ldr    x0, [x, i, lsl #2]   // x0 = x[i]
    eor    x0, xt               // x0 ^= xt
    str    x0, [x, i, lsl #2]   // x[i] = x0
    add    i, #1                // i++
    cmp    i, #12               // i<12
    bne    theta_l1

    // Rho west
    // XCHG(x[7], x[4]);
    // XCHG(x[7], x[5]);
    // XCHG(x[7], x[6]);
    add    x, state, #16       // x = &state[4]
    ldm    x, {x0, x1, x2, x3}
    mov    xt, x0              // xt = x[4]
    mov    x0, x3              // x[4] = x[7]
    mov    x3, x2              // x[7] = x[6]
    mov    x2, x1              // x[6] = x[5]
    mov    x1, xt              // x[5] = xt
    stm    x, {x0, x1, x2, x3}

    mov    x, state

    // Iota
    // x[0] ^= rc[r];
    ldrh   xt, [rc], #2        // load half-word, advance by 2
    ldr    x0, [x]             // load word
    eor    x0, xt              // xor
    str    x0, [x]             // store word

    mov    i, #4
chi:
    // Chi and Rho east
    // x0 = x[i+0];
    ldr    x0, [x]

    // x1 = x[i+4];
    ldr    x1, [x, #16]

    // x2 = ROTR32(x[i+8], 21);
    ldr    x2, [x, #32]
    ror    x2, #21

    // x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24);
    bic    xt, x1, x0
    eor    xt, x2
    ror    xt, #24
    str    xt, [x, #32]

    // x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31);
    bic    xt, x0, x2
    eor    xt, x1
    ror    xt, #31
    str    xt, [x, #16]

    // x[i+0]^= x2 & ~x1;
    bic    xt, x2, x1
    eor    xt, x0
    str    xt, [x], #4

    // i--
    subs   i, #1
    bne    chi

    add    x, state, #32       // x = &state[8]

    // XCHG(x[8], x[10]);
    ldm    x, {x0, x1, x2, x3}
    push   {x0}
    mov    x0, x2
    pop    {x2}

    // XCHG(x[9], x[11]);
    push   {x1}
    mov    x1, x3
    pop    {x3}
    stm    x, {x0, x1, x2, x3}

    subs   r, #1               // r--
    bne    xoodoo_main         // r>0

    // release stack
    add    sp, #16

    // restore registers, and return
    pop    {r0-r12, pc}

    // round constants
rc_tab:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

ARM64 assembly

// Xoodoo in ARM64 assembly
// 268 bytes

    .arch armv8-a
    .text

    .global xoodoo

xoodoo:
    sub    sp, sp, 16          // allocate 16 bytes
    adr    x8, rc
    mov    w9, 12               // 12 rounds
L0:
    mov    w7, 0                // i = 0
    mov    x1, x0
L1:
    ldr    w4, [x1, 32]         // w4 = s[i+8]
    ldr    w3, [x1, 16]         // w3 = s[i+4]
    ldr    w2, [x1], 4          // w2 = s[i+0], advance x1 by 4

    // e[i] = R(s[i] ^ s[i+4] ^ s[i+8], 18);
    eor    w2, w2, w3 
    eor    w2, w2, w4 
    ror    w2, w2, 18

    // e[i] ^= R(e[i], 9);
    eor    w2, w2, w2, ror 9
    str    w2, [sp, x7, lsl 2]  // store in e

    add    w7, w7, 1            // i++
    cmp    w7, 4                // i < 4
    bne    L1                   //

    // s[i]^= e[(i - 1) & 3];
    mov    w7, 0                // i = 0
L2:
    sub    w2, w7, 1
    and    w2, w2, 3            // w2 = i & 3
    ldr    w2, [sp, x2, lsl 2]  // w2 = e[(i - 1) & 3]
    ldr    w3, [x0, x7, lsl 2]  // w3 = s[i]
    eor    w3, w3, w2           // w3 ^= w2 
    str    w3, [x0, x7, lsl 2]  // s[i] = w3 
    add    w7, w7, 1            // i++
    cmp    w7, 12               // i < 12
    bne    L2 

    // Rho west
    // X(s[7], s[4]);
    // X(s[7], s[5]);
    // X(s[7], s[6]);
    ldp    w2, w3, [x0, 16]
    ldp    w4, w5, [x0, 24]
    stp    w5, w2, [x0, 16]
    stp    w3, w4, [x0, 24]

    // Iota
    // s[0] ^= *rc++;
    ldrh   w2, [x8], 2         // load half-word, advance by 2
    ldr    w3, [x0]            // load word
    eor    w3, w3, w2          // xor
    str    w3, [x0]            // store word

    mov    w7, 4
    mov    x1, x0
L3:
    // Chi and Rho east
    // a = s[i+0];
    ldr    w2, [x1]

    // b = s[i+4];
    ldr    w3, [x1, 16]

    // c = R(s[i+8], 21);
    ldr    w4, [x1, 32]
    ror    w4, w4, 21

    // s[i+8] = R((b & ~a) ^ c, 24);
    bic    w5, w3, w2 
    eor    w5, w5, w4 
    ror    w5, w5, 24
    str    w5, [x1, 32]

    // s[i+4] = R((a & ~c) ^ b, 31);
    bic    w5, w2, w4 
    eor    w5, w5, w3 
    ror    w5, w5, 31
    str    w5, [x1, 16]

    // s[i+0]^= c & ~b;
    bic    w5, w4, w3 
    eor    w5, w5, w2 
    str    w5, [x1], 4

    // i--
    subs   w7, w7, 1
    bne    L3 

    // X(s[8], s[10]);
    // X(s[9], s[11]);
    ldp    w2, w3, [x0, 32] // 8, 9
    ldp    w4, w5, [x0, 40] // 10, 11
    stp    w2, w3, [x0, 40]
    stp    w4, w5, [x0, 32]

    subs   w9, w9, 1           // r--
    bne    L0                  // r != 0

    // release stack
    add    sp, sp, 16
    ret
    // round constants
rc:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

Summary

Xoodoo with 12 rounds and a 384-bit state is much more compact than Keccak with 22 rounds and an 800-bit state. Presumably Keccak is more secure, but cannot be vectorized and perform as well as Xoodoo or Gimli. Sources here.

Posted in arm, assembly, cryptography, programming, x86 | Tagged , , , , , , | Leave a comment

SHACAL-2 Block Cipher

Introduction

SHACAL-1 and SHACAL-2 are block ciphers based on the Secure Hash Standard. They were proposed by Helena Handschuh and David Naccache while both working for the now defunct smartcard company, Gemplus International. SHACAL-1 is an 80 round block cipher built atop of the SHA-1 compression function. It supports a 160-bit block with key sizes of between 128 and 512-bits. SHACAL-2 is a 64 round block cipher built atop of the SHA-256 compression function and supports a larger 256-bit block. Both were submitted to the New European Schemes for Signatures, Integrity and Encryption (NESSIE) project. SHACAL-1 was dropped from the portfolio in 2003 due to concerns about its key schedule, but SHACAL-2 was finally selected. The best known attack against SHACAL-2 is a related-key rectangle attack on round 44 of the 64 rounds in total.

Macros and data types

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)

#define rev32(x) __builtin_bswap32(x)

typedef unsigned long long Q;
typedef unsigned int W;
typedef unsigned char B;

typedef struct _sha256_ctx {
    W s[8];
    union {
      B b[64];
      W w[16];
      Q q[8];
    }x;
    Q len;
}sha256_ctx;

Compact code

The following is not optimized for performance. It’s a simplified version intended to be used as a reference. The mk parameter should point to a key between 128 and 512-bits while data should point to a 256-bit block of plaintext to encrypt.

#define CH(x,y,z)(((x)&(y))^(~(x)&(z)))
#define MAJ(x,y,z)(((x)&(y))^((x)&(z))^((y)&(z)))
#define EP0(x)(R(x,2)^R(x,13)^R(x,22))
#define EP1(x)(R(x,6)^R(x,11)^R(x,25))
#define SIG0(x)(R(x,7)^R(x,18)^((x)>>3))
#define SIG1(x)(R(x,17)^R(x,19)^((x)>>10))

void shacal2(void*mk,void*data) {
    W t1,t2,i,w[64],s[8],*k=mk,*x=data;
    
    W K[64]=
    { 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 
      0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
      0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 
      0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
      0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 
      0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
      0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 
      0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
      0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 
      0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
      0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 
      0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
      0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 
      0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
      0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 
      0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 }; 
    
    // load master key
    F(16)w[i]=rev32(k[i]);
    // expand key
    for(i=16;i<64;i++)
      w[i]=SIG1(w[i-2])+w[i-7]+SIG0(w[i-15])+w[i-16];
    // load 256-bit plaintext
    F(8)s[i]=rev32(x[i]);
    // encrypt
    F(64) {
      t1=s[7]+EP1(s[4])+CH(s[4],s[5],s[6])+w[i]+K[i];
      t2=EP0(s[0])+MAJ(s[0],s[1],s[2]);
      s[7]=s[6],s[6]=s[5],s[5]=s[4],s[4]=s[3]+t1;
      s[3]=s[2],s[2]=s[1],s[1]=s[0],s[0]=t1+t2;
    }
    // store 256-bit ciphertext
    F(8)x[i]=rev32(s[i]);
}

Sources here.

Posted in assembly, cryptography, programming, x86 | Tagged , , , | Leave a comment

XTEA Block Cipher

Introduction

TEA Extensions (XTEA) is a 64-bit block cipher with support for 128-bit keys. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA) which was discussed previously in this post. XTEA compared to its predecessor contains a more complex key-schedule and rearrangement of shifts, XORs, and additions. Only Encryption is implemented here, that can then be used in Counter (CTR) mode to create a stream cipher. XTEA is not considered a secure block cipher and shouldn’t be considered for any project that requires a high level of security. Moreover, the following code is not optimized for performance. It’s a simplified version intended to be used only as a reference.

Encryption

This version uses 32 rounds by default. Despite i being initialized to 64, it’s possible to perform encryption of each 32-bit value via swapping and an if statement. mk should point to a 128-bit key while data should point to a 64-bit block of plaintext to encrypt.

// uses 32 rounds by default
void xtea(void *key, void *data) {
    int      i;
    uint32_t x0, x1, t, sum=0;
    
    uint32_t *k=(uint32_t*)key;
    uint32_t *x=(uint32_t*)data;
    
    // load 64-bit plain text
    x0 = x[0]; x1 = x[1];
    
    for (i=64; i>0; i--) {
      t = sum;
      // add constant every 2nd round
      if (i & 1) {
        sum += 0x9E3779B9;
        t = sum >> 11;          
      }
      x0 += ((((x1 << 4) ^ 
        (x1 >> 5)) + x1) ^ 
        (sum + k[t & 3]));
         
      XCHG(x0, x1);
    }
    // save 64-bit cipher text
    x[0] = x0; x[1] = x1;
}

x86 assembly

bits 32
        
    %define x0  eax
    %define x1  ebx    

    %define t0  ebp
    %define t1  esi
    
    %define k   edi
    
    %define sum edx
    
xtea:    
_xtea:
    pushad
    mov    edi, [esp+32+4]   ; edi = key
    mov    esi, [esp+32+8]   ; esi = data
    push   64
    pop    ecx
    xor    edx, edx          ; sum = 0
    push   esi
    lodsd
    xchg   eax, x1
    lodsd
    xchg   eax, x1
xtea_enc:
    mov    t0, x1            ; t0   = x1 << 4
    shl    t0, 4
    
    mov    t1, x1            ; t1   = x1 >> 5
    shr    t1, 5    
    
    xor    t0, t1            ; t0  ^= t1
    add    t0, x1            ; t0  += x1;
    
    mov    t1, sum           ; t1   = sum
    test   cl, 1
    jz     xtea_l1
    
    add    sum, 0x9E3779B9   ; sum += 0x9E3779B9   
    mov    t1, sum     
    shr    t1, 11            ; t1 = sum >> 11  
xtea_l1:    
    and    t1, 3             ; t1  &= 3
    mov    t1, [k+4*t1]      ; t1 = sum + k[t1]
    add    t1, sum
    
    xor    t0, t1            ; t0 ^= t1
    
    add    x0, t0            ; x0 += t0
    xchg   x0, x1            ; XCHG(x0, x1); 
    loop   xtea_enc    
    
    pop    edi               ; edi = x
    stosd                    ; x[0] = x0
    xchg   eax, x1
    stosd                    ; x[1] = x1
    popad
    ret

ARM / AArch32

ARM allows each instruction to have conditional execution and that allows us to eliminate the branch used in x86 assembly.

    .arch armv7-a
    .text
    
    .equ ROUNDS, 32
    
    .global xtea

    // xtea(void*mk, void*data);
xtea:
    // save registers
    push   {r2-r8, lr}
    
    mov    r7, #(ROUNDS * 2)

    // load 64-bit plaintext
    ldm    r1, {r2, r4}         // x0  = x[0], x1 = x[1];
    mov    r3, #0               // sum = 0;
    ldr    r5, =#0x9E3779B9     // c   = 0x9E3779B9;
L0:
    mov    r6, r3               // t0 = sum;
    tst    r7, #1               // if (i & 1)
  
    addne  r3, r5               // sum += 0x9E3779B9;
    lsrne  r6, r3, #11          // t0 = sum >> 11

    and    r6, #3               // t0 %= 4
    ldr    r6, [r0, r6, lsl #2] // t0 = k[t0];
    add    r8, r3, r6           // t1 = sum + t0
    mov    r6, r4, lsl #4       // t0 = (x1 << 4)
    eor    r6, r4, lsr #5       // t0^= (x1 >> 5)
    add    r6, r4               // t0+= x1
    eor    r6, r8               // t0^= t1
    mov    r8, r4               // backup x1
    add    r4, r6, r2           // x1 = t0 + x0

    // XCHG(x0, x1)
    mov    r2, r8               // x0 = x1
    subs   r7, r7, #1           // i--
    bne    L0                   // i>0
    
    // store 64-bit ciphertext
    stm    r1, {r2, r4}
    pop    {r2-r8, pc}

Because the link register (LR) is saved upon entry point, and restored to the program counter (PC) at end, this returns execution directly to the caller.

ARM64 / AArch64

// XTEA in ARM64 assembly
// 92 bytes

    .arch armv8-a
    .text
 
    .equ ROUNDS, 32
 
    .global xtea

    // xtea(void*mk, void*data);
xtea:
    mov    w7, ROUNDS * 2 

    // load 64-bit plain text
    ldp    w2, w4, [x1]         // x0  = x[0], x1 = x[1];
    mov    w3, wzr              // sum = 0;
    ldr    w5, =0x9E3779B9      // c   = 0x9E3779B9;
L0:
    mov    w6, w3               // t0 = sum;
    tbz    w7, 0, L1            // if ((i & 1)==0) goto L1;

    // the next 2 only execute if (i % 2) is not zero
    add    w3, w3, w5           // sum += 0x9E3779B9;
    lsr    w6, w3, 11           // t0 = sum >> 11
L1:
    and    w6, w6, 3            // t0 %= 4
    ldr    w6, [x0, x6, lsl 2]  // t0 = k[t0];
    add    w8, w3, w6           // t1 = sum + t0
    mov    w6, w4, lsl 4        // t0 = (x1 << 4)
    eor    w6, w6, w4, lsr 5    // t0^= (x1 >> 5)
    add    w6, w6, w4           // t0+= x1
    eor    w6, w6, w8           // t0^= t1
    mov    w8, w4               // backup x1
    add    w4, w6, w2           // x1 = t0 + x0

    // XCHG(x0, x1)
    mov    w2, w8               // x0 = x1
    subs   w7, w7, 1
    bne    L0                   // i > 0
    stp    w2, w4, [x1]
    ret

Summary

It’s safe to assume people searching the internet for a “Tiny Encryption Algorithm” will come accross TEA/XTEA, but for a variety of reasons, it is not suitable to use anymore. The Chaskey block cipher is a good alternative with support for 128-bit blocks.

sources here.

Posted in arm, assembly, programming, security, x86 | Tagged , , , , , , | 1 Comment

BlaBla Stream Cipher

Introduction

BlaBla is a 256-bit stream cipher designed by Jean-Philippe Aumasson. It uses the same permutation function as the cryptographic hash algorithm BLAKE2b, that is similar to the permutation function used in the ChaCha stream cipher, hence the name. Frank Denis described it as “…yet another evil experiment from Jean-Philippe Aumasson” Frank informed me the permutation function used in BlaBla is also used in this project here which is an Authenticated Encryption Associated Data (AEAD) algorithm using a Masked Even Mansour (MEM) construction. More information can be found in Improved Masking for Tweakable Blockciphers with Applications to Authenticated Encryption

Initialization

ChaCha uses an internal state of 512-bits or 64-bytes and works efficiently on 32-bit CPUs. BlaBla uses an internal state of 1024-bits or 128-bytes and works efficiently on 64-bit CPUs. There are nine 64-bit constants used to initialize the internal state. The first four of these include the same 32-bit integers used to initialize the 256-bit key variant of ChaCha, which is the string “expand 32-byte k”. The other values were possibly generated randomly, but there’s currently no explanation for their origin at the moment.

// setup the key/internal state
void blabla_setkey(blabla_ctx *c, const void *key, const void *nonce) {
    int i;
    
    c->q[ 0] = 0x6170786593810fab;
    c->q[ 1] = 0x3320646ec7398aee;
    c->q[ 2] = 0x79622d3217318274;
    c->q[ 3] = 0x6b206574babadada;

    // copy 256-bit key
    F(4) c->q[i+4] = ((W*)key)[i];
    
    c->q[ 8] = 0x2ae36e593e46ad5f;
    c->q[ 9] = 0xb68f143029225fc9;
    c->q[10] = 0x8da1e08468303aa6;
    c->q[11] = 0xa48a209acd50a4a7;
    c->q[12] = 0x7fdc12f23f90778c;
    
    // set 64-bit counter
    c->q[13] = 1; 
    
    // copy 128-bit nonce
    F(2)c->q[i+14]=((W*)nonce)[i];
}

AMD64 assembly

blabla_iv1:
    dq     0x6170786593810fab
    dq     0x3320646ec7398aee
    dq     0x79622d3217318274
    dq     0x6b206574babadada
blabla_iv2:
    dq     0x2ae36e593e46ad5f
    dq     0xb68f143029225fc9
    dq     0x8da1e08468303aa6
    dq     0xa48a209acd50a4a7
    dq     0x7fdc12f23f90778c
    
; void blabla_setkey(blabla_ctx *c, const void *key, const void *nonce) 
blabla_setkey:
    push   rsi                  ; save key
    ; copy 4 IV
    lea    rsi, [rel blabla_iv1]
    push   (4*8)/4
    pop    rcx
    rep    movsd
    pop    rsi                  ; restore key
    ; copy 256-bit key to internal state
    ; F(4) c->q[i+4] = k[i];
    mov    cl, 32/4
    rep    movsd
    ; copy 5 more IV
    lea    rsi, [rel blabla_iv2]
    mov    cl, (5*8)/4
    rep    movsd
    ; set 64-bit counter
    ; c->q[13] = 1; 
    push   1
    pop    rax
    stosq
    ; copy 128-bit nonce to internal state
    ; F(2)c->q[i+14]=n[i];
    push   rdx   
    pop    rsi
    movsq
    movsq
    ret    

Stream Generator

void blabla_stream(blabla_ctx *s, void *out) {
    W a,b,c,d,i,t,r,*x=(W*)out;
    uint16_t v[8]={0xC840,0xD951,0xEA62,0xFB73,
                   0xFA50,0xCB61,0xD872,0xE943};
            
    // store internal state in buffer
    F(16)x[i] = s->q[i];
    
    // permute buffer
    F(80) {
      d=v[i%8];
      a=(d&15);b=(d>>4&15);
      c=(d>>8&15);d>>=12;
      
      for(r=0x3F101820;r;r>>=8)
        x[a]+=x[b],
        x[d]=R(x[d]^x[a],(r&255)),
        X(a,c),X(b,d);
    }
    // add internal state to buffer
    F(16)x[i] += s->q[i];
    // increase counter of internal state
    s->q[13]++;
}

AMD64 assembly

blabla_v:
    dw     0c840H, 0d951H
    dw     0ea62H, 0fb73H
    dw     0fa50H, 0cb61H
    dw     0d872H, 0e943H
    
; void blabla_stream(blabla_ctx *s, void *out) 
blabla_stream:
    push    rbx
    push    rcx
    push    rsi
    push    rdi
    push    rbp
    
    ; store internal state in buffer
    ; F(16)x[i] = s->q[i];
    push    128/4
    pop     rcx
    xchg    rsi, rdi
    push    rsi               ; save s
    push    rdi               ; save out
    rep     movsd
    pop     rdi               ; restore out
    
    ; permute buffer
    ; F(80) {
    xor     eax, eax
bb_sx0:
    push    rax               ; save i
    ; d=v[i%8];
    and     al, 7 
    lea     rsi, [rel blabla_v]
    movzx   edx, word[rsi+rax*2]
    ; a=(d&15);b=(d>>4&15);
    mov     eax, edx          ; a = d & 15
    and     eax, 15           
    mov     ebx, edx          ; b = d >> 4 & 15
    shr     ebx, 4
    and     ebx, 15
    ; c=(d>>8&15);d>>=12;
    mov     ebp, edx          ; c = d >> 8 & 15
    shr     ebp, 8
    and     ebp, 15
    shr     edx, 12           ; d >>= 12
    
      ; for (r=0x7080C10;r;r>>=8)
    mov    ecx, 0x3F101820 ; load rotation values
bb_sx1:
    ; x[a] += x[b]
    mov    rsi, [rdi+rbx*8]
    add    [rdi+rax*8], rsi
    
    ; x[d] = R(x[d] ^ x[a], (r & 255))
    mov    rsi, [rdi+rdx*8]    
    xor    rsi, [rdi+rax*8]
    ror    rsi, cl
    mov    [rdi+rdx*8], rsi
    
    ; X(a, c); X(b, d);
    xchg   rax, rbp
    xchg   rbx, rdx 
    
    ; r >>= 8
    shr    ecx, 8       ; shift until done 
    jnz    bb_sx1
    
    pop    rax
    inc    al
    cmp    al, 80
    jnz    bb_sx0
    
    ; add internal state to buffer
    ; F(16)x[i] += s->q[i];
    pop     rsi         ; restore state
    push    16
    pop     rcx
bb_sx5:
    lodsq
    add     rax, [rdi]
    stosq
    loop    bb_sx5
    
    ; update 64-bit counter
    ; c->q[13]++;   
    inc     qword[rsi+13*8-128]
    
    pop     rbp
    pop     rdi
    pop     rsi
    pop     rcx
    pop     rbx
    ret

Encryption/Decryption

// encrypt or decrypt stream of len-bytes
void blabla_encrypt(blabla_ctx *ctx, void *buf, size_t len) {
    W i, r;
    B c[128], *p=(B*)buf;
    
    while(len) {
      // generate 128-bytes of ciphertext
      blabla_stream(ctx, c);
      r = len>128?128:len;
      // xor plaintext with ciphertext
      F(r) p[i] ^= c[i];
      // decrease total length, advance buffer
      len -= r; p+= r;
    }
}

AMD64 assembly

; void blabla_encrypt(blabla_ctx *ctx, void *buf, size_t len) 
blabla_encrypt:
    push    rbx               ; save rbx
    
    push    rsi               ; rbx = buf
    pop     rbx 
    
    push    rdx               ; rcx = len
    pop     rcx
    
    sub     rsp, 124
    push    rax
    push    rsp               ; rsi = c[128] 
    pop     rsi
bb_e0:
    jrcxz   bb_e3             ; exit if len==0
    ; blabla_stream(ctx, c);
    call    blabla_stream
    xor     eax, eax          ; i = 0
bb_e1:
    mov     dl, byte[rsi+rax] ;
    xor     byte[rbx], dl     ; *p ^= c[i]
    inc     rbx               ; p++
    inc     al                ; i++
    cmp     al, 128           ; i<128
    loopne  bb_e1             ; --len
    jmp     bb_e0
bb_e3:
    pop     rax
    add     rsp, 124
    pop     rbx               ; restore rbx
    ret

Sources here.

Posted in assembly, cryptography, encryption, programming, security | Tagged , , , | Leave a comment

HIGHT Block Cipher

Introduction

HIGHT which stands for HIGh security and light weigHT is a 64-bit block cipher with support for 128-bit keys. It was first proposed at the 2006 Cryptographic Hardware and Embedded Systems (CHES) conference held in Japan. HIGHT attracted a lot of attention upon its release because it only uses add-rotate-xor (ARX) operations on 8-bit words. It was selected in 2010 by the Telecommunications Technology Association (TTA) of Korea to become part of the ISO/IEC 18033-3 portfolio that also consists of the following 64-bit block ciphers: TDEA, MISTY1, CAST-128. Some points to note:

  • 64-bit input/output data block size
  • 128-bit key length
  • 32-rounds using 2**8 modular addition, rotation, and XOR (ARX)
  • No S-Box (non-linear operations using modular addition)
  • Designed for low-resource device (data storage, power, etc.)

Constant Generation

“Random” constants are used in the generation of subkeys. These are to protect the cipher against simple Slide Attacks. I’ve included the option of using a lookup table or function to calculate the constants at runtime.

// use lookup table for constants
#ifdef LUT
uint8_t rc[128]=
{ 0x5a, 0x6d, 0x36, 0x1b, 0x0d, 0x06, 0x03, 0x41,
  0x60, 0x30, 0x18, 0x4c, 0x66, 0x33, 0x59, 0x2c,
  0x56, 0x2b, 0x15, 0x4a, 0x65, 0x72, 0x39, 0x1c,
  0x4e, 0x67, 0x73, 0x79, 0x3c, 0x5e, 0x6f, 0x37,
  0x5b, 0x2d, 0x16, 0x0b, 0x05, 0x42, 0x21, 0x50,
  0x28, 0x54, 0x2a, 0x55, 0x6a, 0x75, 0x7a, 0x7d,
  0x3e, 0x5f, 0x2f, 0x17, 0x4b, 0x25, 0x52, 0x29,
  0x14, 0x0a, 0x45, 0x62, 0x31, 0x58, 0x6c, 0x76,
  0x3b, 0x1d, 0x0e, 0x47, 0x63, 0x71, 0x78, 0x7c,
  0x7e, 0x7f, 0x3f, 0x1f, 0x0f, 0x07, 0x43, 0x61,
  0x70, 0x38, 0x5c, 0x6e, 0x77, 0x7b, 0x3d, 0x1e,
  0x4f, 0x27, 0x53, 0x69, 0x34, 0x1a, 0x4d, 0x26,
  0x13, 0x49, 0x24, 0x12, 0x09, 0x04, 0x02, 0x01,
  0x40, 0x20, 0x10, 0x08, 0x44, 0x22, 0x11, 0x48,
  0x64, 0x32, 0x19, 0x0c, 0x46, 0x23, 0x51, 0x68,
  0x74, 0x3a, 0x5d, 0x2e, 0x57, 0x6b, 0x35, 0x5a };
#else

void gen_const(uint8_t *ci)
{
    int     i, j;
    uint8_t c;
    
    union {
      uint8_t  b[128+8];
      uint32_t w[(128+8)/4];
    } s;

    // zero initialize s
    memset(&s, 0, sizeof(s));

    // set initial bits
    s.w[1] = 65537;
    s.w[0] = s.w[1] << 8;

    // set first constant
    // precalculated from bits of s array
    ci[0] = 0x5A;

    for(i=1; i<128; i++) {
      c = s.b[i + 2] ^
          s.b[i - 1];

      s.b[i + 6] = c;

      for(j=1; j<7; j++) {
        c += c;
        c ^= s.b[i + 6 - j];
      }
      ci[i] = c;
    }
}
#endif

The assembly code is less than 128 bytes, which is used instead of a precomputed array.

gen_constx:
    pushad
    mov    esi, edi ; esi = ci
    xor    eax, eax
    mov    cl, 80h + 8 ; allocate 136 bytes
    sub    esp, ecx
    mov    edi, esp
    ; memset(&s, 0, sizeof(s));
    rep    stosb
    ; s.w[1] = 65537;
    mov    dword[esp+4], 65537
    ; s.w[0] = s.w[1] << 8;
    mov    dword[esp], 65537 << 8
    ; ci[0] = 0x5A;
    mov    byte[esi], 0x5A
    ; i = 1;
    inc    eax
gen_l1:
    ; c = s.b[i+2] ^ s.b[i-1];
    mov    cl, [esp+eax+2]
    xor    cl, [esp+eax-1]
    ; s.b[i+6] = c
    mov    [esp+eax+6], cl
    ; j = 1
    cdq
    inc    edx
gen_l2:
    ; c += c
    add    cl, cl
    ; c ^= s.b[(i + 6) - j];
    lea    ebx, [eax+6]
    sub    ebx, edx  
    xor    cl, [esp+ebx]
    ; j++
    inc    edx
    cmp    dl, 7
    jnz    gen_l2
    ; ci[i] = c;
    mov    [esi+eax], cl
    ; i++
    inc    al
    ; i<128
    jns    gen_l1
    lea    esp, [esp+eax+8]   
    popad
    ret

Key Schedule

The master key is expanded from 128-bits to 1024-bits. Each byte of the master key, indexed using an unexplained calculation, is added to 1 of the generated constants.

void hight128_setkey(void *in, void *out)
{
    uint8_t i, j, idx;
    w128_t  *wk, *mk;
    uint8_t *sk;
    
    mk=(w128_t*)in;
    wk=(w128_t*)out;
    sk=(uint8_t*)out;
    
    // apply key whitening
    wk->w[0] = mk->w[3];
    wk->w[1] = mk->w[0];

    #ifdef LUT
      memcpy(&sk[8], rc, sizeof(rc));
    #else  
      // generate constants
      gen_const(&sk[8]);
    #endif
 
    // generate subkeys
    for(i=0; i<8; i++) {
      sk += 8;
      for(j=0; j<8; j++) {
        idx = (j - i + 8) & 7;

        sk[0] += mk->b[idx  ];
        sk[8] += mk->b[idx+8];
        sk++;        
      }
    }
}

There may be a way to implement generation of the subkeys in 1 loop. The calculation of idx currently requires 2 loops.

hight128_setkeyx:    
_hight128_setkeyx:
    pushad
    mov    esi, [esp+32+4] ; esi = in
    mov    edi, [esp+32+8] ; edi = out    
    mov    eax, [esi+3*4]
    stosd
    mov    eax, [esi]
    stosd
    ; i=0
    xor    ecx, ecx    
    call   gen_constx
sk_l1:
    ; j=0
    xor    edx, edx
sk_l2:
    ; idx = ((j + 8) - i) & 7;
    lea    ebx, [edx+8]
    sub    ebx, ecx
    and    ebx, 7
    ; sk[0] += mk->b[idx  ];
    mov    al, [esi+ebx]
    add    [edi+0], al
    ; sk[8] += mk->b[idx+8];
    mov    al, [esi+ebx+8]
    add    [edi+8], al
    ; sk++;
    inc    edi
    ; j++
    inc    edx
    ; j<8
    cmp    dl, 8
    jnz    sk_l2
    ; sk += 8
    add    edi, edx
    ; i++
    inc    ecx
    ; i<8
    cmp    cl, 8
    jnz    sk_l1    
    popad
    ret

Round functions

2 Linear functions which provide diffusion. Although they’re shown separately here, a different version inlines them in the main encryption loop.

uint8_t F0(uint8_t x) {
    return ROTL8(x, 1) ^ ROTL8(x, 2) ^ ROTL8(x, 7);
}

uint8_t F1(uint8_t x) {
    return ROTL8(x, 3) ^ ROTL8(x, 4) ^ ROTL8(x, 6);
}

Encryption

The only nonlinear operations here are the modular addition. Perhaps it would have made more sense to use an s-box layer. A 4 x 4 bit-slice sbox would work well here.

void hight128_encrypt(void *data, void *keys)
{
    int      i;
    w64_t    *x;
    uint8_t  *sk, *wk=(uint8_t*)keys;

    x  = (w64_t*)data;
    sk = &wk[8];

    // mix key with 1st 4 bytes
    x->b[0] += wk[0]; x->b[2] ^= wk[1];
    x->b[4] += wk[2]; x->b[6] ^= wk[3];

    for(i=0; i<32; i++) {
      // circular shift left by 8 bits
      x->q = ROTL64(x->q, 8);
      // apply linear/non-linear operations
      x->b[2] += (F1(x->b[1]) ^ *sk++);
      x->b[4] ^= (F0(x->b[3]) + *sk++);
      x->b[6] += (F1(x->b[5]) ^ *sk++);
      x->b[0] ^= (F0(x->b[7]) + *sk++);
    }
    // circular shift right by 8 bits
    x->q = ROTL64(x->q, 56);

    // mix key with 2nd 4 bytes
    x->b[0] += wk[4]; x->b[2] ^= wk[5];
    x->b[4] += wk[6]; x->b[6] ^= wk[7];
}

The assembly code for this is slightly different from the C code, and wasn’t enjoyable to implement 😉 For each round, there are 2 loops, consisting of 1 addition and 1 XOR with the linear functions inlined. The 64-bit rotation uses ADD/ADC.

; rotate cipher text 64-bits    
rotl64: 
    pushad  
    mov    eax, [edi]
    mov    ebx, [edi+4]   
rt_l0:     
    add    eax, eax
    adc    ebx, ebx
    adc    al, 0
    loop   rt_l0    
    stosd
    xchg   eax, ebx
    stosd
    popad
    ret   
   
hight128_encryptx:
_hight128_encryptx:
    pushad
    mov    edi, [esp+32+4] ; data
    mov    esi, [esp+32+8] ; wk and sk
    push   2
    pop    ecx
    push   edi
hi_l0:    
    lodsw
    ; x->b[0] += wk[0];     
    add    [edi+0], al
    ; x->b[2] ^= wk[1];
    xor    [edi+2], ah
    scasd
    loop   hi_l0    
    pop    edi    
    lodsd    
    ; save wk[4]
    push   eax
    ; 32 rounds    
    mov    cl, 32
hi_enc:
    push   ecx
    ; x->q = ROTL64(x->q, 8);
    mov    cl, 8
    call   rotl64        
    mov    cl, 2
    movzx  edx, cl
hi_l1:
    ; c = x->b[j-1];
    mov    al, [edi+edx-1]
    ; c = ROTL8(c, 3) ^ ROTL8(c, 4) ^ ROTL8(c, 6);
    mov    ah, al
    rol    ah, 3
    rol    al, 4
    xor    ah, al
    rol    al, 6-4
    xor    ah, al
    ; x->b[j] += (c ^ *sk++);
    lodsb
    xor    al, ah
    add    [edi+edx], al    
    ; c = x->b[j+1];
    mov    al, [edi+edx+1]
    ; c = ROTL8(c, 1) ^ ROTL8(c, 2) ^ ROTL8(c, 7);
    mov    ah, al
    rol    ah, 1
    rol    al, 2
    xor    ah, al
    rol    al, 7-2
    xor    ah, al
    ; x->b[(j+2) & 7] ^= (c + *sk++);
    lodsb
    add    al, ah
    lea    ebx, [edx+2]
    and    bl, 7
    xor    [edi+ebx], al
    add    dl, 4    
    loop   hi_l1    
    pop    ecx
    loop   hi_enc    
    ; x->q = ROTL64(x->q, 56);   
    mov    cl, 56
    call   rotl64    
    ; restore wk[4]
    pop    eax
    ; x->b[0] += wk[0];     
    add    [edi+0], al
    ; x->b[2] ^= wk[1];
    xor    [edi+2], ah
    bswap  eax                ; instead of shr eax, 16
    ; x->b[4] += wk[6]; 
    add    [edi+4], ah    
    ; x->b[6] ^= wk[7];
    xor    [edi+6], al    
    popad
    ret

I’ve no doubt the above code can be optimized further, but it would require removing the dependency on EDI in the main loop. The load/save operations using EDI are not necessary except at the beginning and end of code. If the plaintext was loaded into 2 32-bit registers, there’s a possibility of removing the 64-bit rotation in the main loop by rotating registers and swapping bytes.

Summary

Because of the linear functions, it’s difficult to see how one could optimize it for a 32-bit or 64-bit CPU. I’ve speculated that if the bytes of plaintext were rearranged, one could optimize it further for a 16-bit CPU, but it would still need 8-bit operations for the linear functions. The assembly code generated by MSVC is approx. 440 bytes and the assembly is currently 283 bytes.

Sources here.

Posted in assembly, cryptography, encryption, programming, security | Tagged , , , | Leave a comment

RoadRunneR Block Cipher

Introduction

RoadRunneR is a 64-bit block cipher with support for 80 and 128-bit keys. It was designed by Adnan Baysal and Suhap Sahin and published in 2015. The 80-bit variant uses 10 rounds while the 128-bit uses 12. In this post, I’ll show x86 assembly and C code for the 128-bit version. The authors of cipher state in their proposal the main objectives of RoadRunneR:

  • Implementation efficiency in 8-bit CPUs
  • No table and SRAM usage
  • Low decryption overhead
  • Provable security like in wide trail design strategy

RoadRunneR is a perfect choice for devices with very restricted memory resources and for applications requiring reasonable throughput expectations.

Although the cipher is specifically designed for 8-bit CPUs, the code shown here has a mixture of 8-bit and 32-bit code.

Compact code

The following implementation combines both key scheduling and encryption in one function. The mk parameter should point to a 128-bit master key and the data parameter should point to a 64-bit block of plaintext to be encrypted.

#define R(v,n)(((v)<<(n))|((v)>>(8-(n))))
#define X(x,y)t=x,x=y,y=t;

typedef unsigned char B;
typedef unsigned int W;

// S-Layer
void S(void *p) {
    B t, *x=(B*)p;
    
    t = x[3]; 
    x[3] &= x[2]; 
    x[3] ^= x[1];
    x[1] |= x[2];
     
    x[1] ^= x[0];
    x[0] &= x[3]; 
    x[0] ^=  t; 
    t    &= x[1];    
    x[2] ^=  t;
}

// encrypt 64-bits of data using 128-bit key  
void roadrunner(void *mk, void *data) {
    int i, j, r;
    W   t, *x=(W*)data, *k=(W*)mk;
    B   s, *p, k_idx=4;
    
    // apply K-Layer
    x[0] ^= k[0];
    
    // apply 12 rounds of encryption
    for(r=12; r>0; r--) {
      // F round
      t=x[0]; // save original 32-bits
      p=(B*)x;
      for(i=0; i<3; i++) {
        // add constant
        if(i==2) p[3] ^= r;
        // apply S-Layer
        S(p);
        for (j=0; j<4; j++) {      
          // apply L-Layer
          s = R(p[j], 1) ^ p[j];       
          p[j] ^= R(s,  1); 
          // apply K-Layer
          p[j] ^= ((B*)k)[(k_idx%16)+j];
        }
        k_idx+=4;
      }
      // apply S-Layer
      S(p);
      // add upper 32-bits
      x[0]^= x[1]; x[1] = t;
    }
    // permute
    X(x[0], x[1]);
    // apply K-Layer
    x[0] ^= k[1];
}

x86 assembly

; -----------------------------------------------
; RoadRunneR-64/128 block cipher in x86 assembly
;
; size: 135 bytes
;
; global calls use cdecl convention
;
; -----------------------------------------------


    bits 32
    
    %ifndef BIN
      global roadrunner
      global _roadrunner
    %endif

struc pushad_t
  _edi resd 1
  _esi resd 1
  _ebp resd 1
  _esp resd 1
  _ebx resd 1
  _edx resd 1
  _ecx resd 1
  _eax resd 1
  .size:
endstruc

roadrunner:
_roadrunner:
    pushad
    mov    edi, [esp+32+4]    ; k = mk
    mov    esi, [esp+32+8]    ; x = data
    xor    ebx, ebx           ; k_idx = 0
    ; apply K-Layer
    ; x[0] ^= ((W*)mk)[0];
    mov    ebp, [edi]         ; t = ((W*)mk)[0]
    xor    [esi], ebp         ; x[0] ^= t 
    ; apply 12 rounds of encryption
    ; for(r=12; r>0; r--) {
    push   12             
    pop    ecx
rr_L0:    
    ; F round
    pushad                    ; save all registers: for(r=12
    mov    ebp, [esi]         ; t = x[0]
    mov    edx, ecx           ; edx = r
    mov    cl, 3              ; i = 3
rr_L1:
    inc    ebx                ; k_idx = (k_idx + 4) % 16
    and    ebx, 3
    pushad                    ; save all registers: for(i=3
    ; add constant
    cmp    cl, 1              ; if (i == 1)
    jne    rr_L2
    xor    [esi+3], dl        ; p[3] ^= r
rr_L2:
    lea    edi, [edi+ebx*4]   ; edi = &k[k_idx]
    ; apply S-Layer
    ; S(p);
    call   S
    ; for (j=3; j>=0; j--) { 
    mov    cl, 4              ; j = 4 
    ; apply L-Layer
rr_L3:
    ; s = R(p[j], 1) ^ p[j]; 
    mov    al, [esi]          ; s = p[j]
    rol    al, 1              ; s = R(s, 1)
    xor    al, [esi]          ; s^= p[j]
    ; s = R(s, 1) ^ p[j];
    rol    al, 1              ; s = R(s, 1) 
    xor    al, [esi]          ; s^= p[j]
    
    ; apply K-Layer
    ; p[j] = s ^ ((B*)k)[(k_idx%16)+j];
    xor    al, byte[edi]
    mov    [esi], al
    cmpsb
    loop   rr_L3              ; j>0; j--
    
    popad                     ; restore registers: for(i=3
    loop   rr_L1              ; i>0; i--
    
    ; apply S-Layer
    ; S(p);    
    call   S
    ; add upper 32-bits
    ; x[0]^= x[1];
    lodsd
    xor    eax, [esi]
    mov    [esi-4], eax
    
    ; x[1] = t;
    mov    [esi], ebp
    mov    [esp+_ebx], ebx    ; save k_idx
    popad                     ; restore registers: for(r=12
    loop   rr_L0              ; r>0; r--
    
    ; permute
    ; X(x[0], x[1]);
    lodsd
    xchg   eax, [esi]
    
    ; apply K-layer
    ; x[0] ^= ((W*)mk)[1];    
    xor    eax, [edi+4]
    mov    [esi-4], eax  
    
    popad
    ret    

; S-Layer    
S:
    pushad
    mov    ebx, esi      ; ebx = esi
    lodsd
    mov    edx, eax      ; t = ROTR32(x[0], 16); 
    ror    edx, 16
    and    [ebx+3], dl   ; x[3] &= t[0];
    xor    [ebx+3], ah   ; x[3] ^= x[1];
    or     [ebx+1], dl   ; x[1] |= t[0];    
    xor    [ebx+1], al   ; x[1] ^= x[0];
    mov    dl, [ebx+3]
    and    [ebx+0], dl   ; x[0] &= x[3];
    xor    [ebx+0], dh   ; x[0] ^= t[1];
    and    dh, [ebx+1]   ; t[1] &= x[1];
    xor    [ebx+2], dh   ; x[2] ^= t[1]; 
    popad
    ret

S Layer

Using bit-slice techniques for an S-box layer has become popular lately as more research into lightweight cryptography for resource constrained environments progresses. Block ciphers such as NOEKEON, PRIDE, RECTANGLE and Fantomas/Robin all use a bit-sliced S-box layer. Some advantages of using a bit-slice S-box layer in both hardware and software implementations include protection against side-channel analysis, and elimination of S-Box array that would require ROM/RAM. Authors say the S-box for RoadRunneR was found through a brute force search of assembly code combinations. The search space was restricted to 4 input words, a single temporary word, and the following instructions: AND, OR, XOR, NOT, and MOV.

Finding Optimal Bitsliced Implementations of 4×4-bit S-boxes. describes this process in more detail. The following values for the 4 x 4 S-Layer are used.

// S-Layer
void sbox(uint8_t *x)
{
    uint8_t t;
    
    t = x[3];

    x[3] &= x[2];
    x[3] ^= x[1];
    x[1] |= x[2];
    x[1] ^= x[0];
    x[0] &= x[3];
    x[0] ^=  t; 
       t &= x[1];
    x[2] ^=  t;
}

Cryptographic Analysis of All 4 x 4 – Bit S-Boxes

L Layer

A linear function on each byte of the state to provide diffusion. Such a linear function is normally implemented using an XOR of rotated or shifted values. It’s commonly seen in many block ciphers and hash algorithms like Serpent, SM3 and SM4.

K Layer

This is not referenced in the original paper, but is essentially Key Whitening AKA Key Addition, which was first implemented by Ron Rivest in DES-X. The theory is that by XOR’ing bits of the key against plaintext during the encryption process, this helps strengthen the final ciphertext against exhaustive brute force attacks. It’s a technique used in many SPN block ciphers. For example, it’s known as AddRoundKey in the Advanced Encryption Standard (AES).

SLK Function

Combining Layers S, L and K is the following function.

// SLK function 
void SLK(w32_t *x, uint8_t *sk)
{
    int     i;
    uint8_t t;
    uint8_t *p=x->b;
    
    // apply S-Layer
    sbox(p);
    
    for (i=3; i>=0; i--) {      
      // apply L-Layer
      t   = ROTL8(*p, 1) ^ *p;       
      *p ^= ROTL8(t,  1); 
      
      // apply K-Layer
      *p++ ^= *sk++;
    }
}

The F-Function

The main function consists of a 4-round Substitution–permutation network (SPN) structure. The first three rounds have the same function called SLK, which is the application of S, L and K layers already described. The final round only applies the S layer. In addition to aforementioned layers is the addition of a round constant after the second call to SLK. This is simply the round number XOR’ed against the least significant byte of the state, which for the 128-bit key variant would be from 12 down to 1.

It is, as the authors describe, intended to protect against simple Slide Attacks. Biryukov and Wagner state the following in Advanced Slide Attacks.

The most obvious type of slide attack is usually easy to prevent by destroying self-similarity in iterative ciphers, for example by adding iteration counters or random constants.

// F round
void F(w64_t *blk, void *key, 
    uint8_t *key_idx, uint8_t ci)
{
    int      i;
    uint32_t t;
    uint8_t  *rk=(uint8_t*)key;
    w32_t    *x=(w32_t*)blk;
    
    // save 32-bits
    t = x->w;
    
    for (i=3; i>0; i--) {
      // add round constant
      if (i==1) x->b[3] ^= ci;
      // apply S,L,K layers
      SLK (x, rk + *key_idx);      
      // advance master key index
      *key_idx = (*key_idx + 4) & 15;
    }
    
    // apply S-Layer
    sbox(x->b);
    
    // add upper 32-bits
    blk->w[0]^= blk->w[1];
    blk->w[1] = t;
}

Encryption

The variable key_idx which acts as a master key index is initialized to 4, skipping the first 4 bytes which are XOR’d against the plaintext as part of K-Layer before 12 rounds of F. Note how rnd is initialized to 12 and passed to F. This is the round constant used to protect against simple slide attacks.

// encrypt 64-bits of data using 128-bit key  
void road64_encrypt(void *data, void *key)
{
    int      rnd;
    uint8_t  key_idx;
    uint32_t t;
    w64_t    *x=(w64_t*)data;
    uint32_t *rk=(uint32_t*)key;

    // initialize master key index
    key_idx = 4;
    
    // apply K-Layer
    x->w[0] ^= rk[0];
    
    // apply rounds
    for (rnd=RR_ROUNDS; rnd>0; rnd--) {
      F(x, rk, &key_idx, rnd);
    }
    // P-Layer?
    XCHG(x->w[0], x->w[1]);
    // apply K-Layer
    x->w[0] ^= rk[1];
}

Sources here.

Posted in assembly, cryptography, encryption, programming, security | Tagged , , | Leave a comment

CHAM Block Cipher

Introduction

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices was published in December 2017 at the 20th Annual International Conference on Information Security and Cryptology held in South Korea. CHAM consists of three ciphers, CHAM-64/128 for 16-bit architectures, CHAM-128/128 and CHAM-128/256 for 32-bit architectures. This post will only focus on CHAM-128/128 that operates on four branches of 32 bits each. The only operations used are 32-bit modular addition, rotation and exclusive-OR (ARX). The design of CHAM draws inspiration from the SPECK and SIMON block ciphers published by the NSA.

The following are parameters for all three variants. N is the block length, K is the key length, R is the number of rounds, W is the width of a word and K/W is the number of words per key.

Key schedule

Eight round keys are generated from the 128-bit master key. Each round key is 32-bits in length.

void cham128_setkey(void *in, void *out)
{
    int i;
    uint32_t *k=(uint32_t*)in;
    uint32_t *rk=(uint32_t*)out;

    for (i=0; i<KW; i++) {
      rk[i] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 8);
      rk[(i + KW) ^ 1]	= k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 11);
    }
}

x86 assembly

%define K 128   ; key length
%define N 128   ; block length
%define R 80    ; number of rounds
%define W 32    ; word length
%define KW K/W  ; number of words per key
     
cham128_setkeyx:
_cham128_setkeyx:
    pushad
    mov    esi, [esp+32+4]  ; k  = in
    mov    edi, [esp+32+8]  ; rk = out    
    xor    eax, eax         ; i  = 0
sk_l0:
    mov    ebx, [esi+eax*4] ; ebx = k[i]
    mov    ecx, ebx         ; ecx = k[i]
    mov    edx, ebx         ; edx = k[i]    
    rol    ebx, 1           ; ROTL32(k[i], 1)
    rol    ecx, 8           ; ROTL32(k[i], 8)    
    xor    edx, ebx         ; k[i] ^ ROTL32(k[i], 1) 
    xor    edx, ecx    
    mov    [edi+eax*4], edx ; rk[i] = edx
    xor    edx, ecx         ; reset edx
    rol    ecx, 3           ; k[i] ^ ROTL32(k[i], 11)
    xor    edx, ecx    
    lea    ebx, [eax+KW]
    xor    ebx, 1
    mov    [edi+ebx*4], edx ; rk[(i + KW) ^ 1] = edx
    inc    al
    cmp    al, KW
    jnz    sk_l0    
    popad
    ret

Encryption

There are 80 rounds in total; this is significantly more than the 34 used for SPECK-128/128.

void cham128_encrypt(void *keys, void *data)
{
    int i;
    uint32_t x0, x1, x2, x3;
    uint32_t t;
    uint32_t *rk=(uint32_t*)keys;
    uint32_t *x=(uint32_t*)data;
    
    x0 = x[0]; x1 = x[1];
    x2 = x[2]; x3 = x[3];

    for (i=0; i<R; i++)
    {
      if ((i & 1) == 0) {
        x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 1) ^ rk[i & 7]), 8);
      } else {
        x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 8) ^ rk[i & 7]), 1);
      }
      XCHG(x0, x1);
      XCHG(x1, x2);
      XCHG(x2, x3);
    }
    x[0] = x0; x[1] = x1;
    x[2] = x2; x[3] = x3;
}

Compact code

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void cham(void*mk,void*p){
    W rk[8],*w=p,*k=mk,i,t;

    F(4)
      t=k[i]^R(k[i],31),
      rk[i]=t^R(k[i],24),
      rk[(i+4)^1]=t^R(k[i],21);
    F(80)
      t=w[3],w[0]^=i,w[3]=rk[i&7],
      w[3]=w[0]+(w[3]^R(w[1],(i&1)?24:31)),
      w[3]=R(w[3],(i&1)?31:24),
      w[0]=w[1],w[1]=w[2],w[2]=t;
}

x86 assembly

Only the encryption here, since if you were to implement with CTR mode, decryption isn’t necessary.

; -----------------------------------------------
; CHAM-128/128 block cipher in x86 assembly
;
; size: 124 bytes
;
; global calls use cdecl convention
;
; -----------------------------------------------

      bits 32
     
      %ifndef BIN
        global cham
        global _cham
      %endif
      
%define K 128   ; key length
%define N 128   ; block length
%define R 80    ; number of rounds
%define W 32    ; word length
%define KW K/W  ; number of words per key
     
%define x0 ebp
%define x1 ebx
%define x2 edx
%define x3 esi
%define rk edi

cham:
_cham:
    pushad
    mov    esi, [esp+32+4]  ; k = key
    mov    ebp, [esp+32+8]  ; x = data
    pushad                  ; allocate 2*KW
    mov    edi, esp         ; edi = rk    
    xor    eax, eax         ; i  = 0
sk_l0:
    mov    ebx, [esi+eax*4] ; ebx = k[i]
    mov    ecx, ebx         ; ecx = k[i]
    mov    edx, ebx         ; edx = k[i]    
    rol    ebx, 1           ; ROTL32(k[i], 1)
    rol    ecx, 8           ; ROTL32(k[i], 8)    
    xor    edx, ebx         ; k[i] ^ ROTL32(k[i], 1) 
    xor    edx, ecx  
    mov    [edi+eax*4], edx ; rk[i] = edx
    xor    edx, ecx         ; reset edx
    rol    ecx, 3           ; k[i] ^ ROTL32(k[i], 11)
    xor    edx, ecx    
    lea    ebx, [eax+KW]
    xor    ebx, 1
    mov    [edi+ebx*4], edx ; rk[(i + KW) ^ 1] = edx
    inc    eax
    cmp    al, KW
    jnz    sk_l0    
    
    xchg   esi, ebp
    push   esi
    lodsd
    xchg   eax, x0
    lodsd
    xchg   eax, x1
    lodsd
    xchg   eax, x2
    lodsd
    xchg   eax, x3
    xor    eax, eax ; i = 0
    ; initialize sub keys
enc_l0: 
    mov    edi, ebp  ; k = keys
    jmp    enc_lx    
enc_l1:
    test   al, 7    ; i & 7
    jz     enc_l0
enc_lx:    
    push   eax      ; save i
    mov    cx, 0x0108
    test   al, 1    ; if ((i & 1)==0)
    jnz    enc_l2
    
    xchg   cl, ch
enc_l2:
    xor    x0, eax          ; x0 ^= i
    mov    eax, x1
    rol    eax, cl          ; 
    xor    eax, [edi]       ; ROTL32(x1, r0) ^ *rk++
    scasd
    add    x0, eax
    xchg   cl, ch
    rol    x0, cl
    
    xchg   x0, x1          ; XCHG(x0, x1);
    xchg   x1, x2          ; XCHG(x1, x2);
    xchg   x2, x3          ; XCHG(x2, x3);
    
    pop    eax      ; restore i
    inc    eax      ; i++
    cmp    al, R    ; i<R
    jnz    enc_l1
    
    pop    edi
    xchg   eax, x0
    stosd           ; x[0] = x0;
    xchg   eax, x1
    stosd           ; x[1] = x1;
    xchg   eax, x2
    stosd           ; x[2] = x2;
    xchg   eax, x3
    stosd           ; x[3] = x3;
    popad
    ret

ARM / AArch32 assembly

  .arm
  .arch armv7-a  
  .text
  
  .global cham
  
k   .req r0
x   .req r1

// data
x0 .req r0
x1 .req r2
x2 .req r3
x3 .req r4  

// round keys  
rk .req sp

k0 .req r6
k1 .req r7
k2 .req r8

i  .req r10

cham:
  // save registers
  push   {r0-r12,lr}
  
  // allocate memory for round keys
  sub    sp, #32
  
  // derive round keys from 128-bit key
  mov    i, #0                 // i  = 0
cham_init:  
  ldr    k0, [k, i, lsl #2]    // k0 = k[i];  
  ror    k1, k0, #31           // k1 = ROTR32(k0, 31);
  ror    k2, k0, #24           // k2 = ROTR32(k0, 24);  
  eor    k0, k1                // k0^= k1;
  eor    k1, k0, k2            // rk[i] = k0 ^ k2; 
  str    k1, [rk, i, lsl #2]  
  eor    k0, k2, ror #29       // k0 ^= ROTR32(k2, 29);
  add    k1, i, #4             // k1 = (i+KW)
  eor    k1, #1                // k1 = (i+KW) ^ 1 
  str    k0, [rk, k1, lsl #2]  // rk[(i+KW)^1] = k0;  
  add    i, #1                 // i++
  cmp    i, #4                 // i<KW  
  bne    cham_init             //  
  
  // load 128-bit plain text
  ldm    x, {x0, x1, x2, x3}
  
  // perform encryption
  mov    i, #0                 // i = 0 
cham_enc:
  mov    k0, x3
  eor    x0, i                 // x0 ^= i
  tst    i, #1                 // if (i & 1)  
  
  // x3  = rk[i & 7];    
  and    k1, i, #7             // k1 = i & 7;
  ldr    x3, [rk, k1, lsl #2]  // x3 = rk[i & 7];  
  
  // execution depends on (i % 2)
  // x3 ^= (i & 1) ? ROTR32(x1, 24) : ROTR32(x1, 31);
  eorne  x3, x1, ror #24       // 
  eoreq  x3, x1, ror #31       // 
  
  add    x3, x0                // x3 += x0;
  
  // x3 = (i & 1) ? ROTR32(x3, 31) : ROTR32(x3, 24);
  rorne  x3, #31               // x3 = ROTR32(x3, 31); 
  roreq  x3, #24               // x3 = ROTR32(x3, 24);
  
  // swap
  mov    x0, x1                // x0 = x1; 
  mov    x1, x2                // x1 = x2;
  mov    x2, k0                // x2 = k0;

  add    i, #1                 // i++;  
  cmp    i, #80                // i<R 
  bne    cham_enc              // 
  
  // save 128-bit cipher text
  stm    x, {x0, x1, x2, x3}   // x[0] = x0; x[1] = x1; 
                               // x[2] = x2; x[3] = x3;
  // release memory for round keys
  add    sp, #32
                                                              
  // restore registers
  pop    {r0-r12, pc}

ARM64 / AArch64 assembly

// CHAM 128/128 in ARM64 assembly
// 160 bytes 

    .arch armv8-a
    .text
    .global cham

    // cham(void*mk,void*p);
cham:
    sub    sp, sp, 32
    mov    w2, wzr
    mov    x8, x1
L0:
    // t=k[i]^R(k[i],31),
    ldr    w5, [x0, x2, lsl 2]
    eor    w6, w5, w5, ror 31

    // rk[i]=t^R(k[i],24),
    eor    w7, w6, w5, ror 24
    str    w7, [sp, x2, lsl 2]

    // rk[(i+4)^1]=t^R(k[i],21);
    eor    w7, w6, w5, ror 21
    add    w5, w2, 4
    eor    w5, w5, 1
    str    w7, [sp, x5, lsl 2]

    // i++
    add    w2, w2, 1
    // i < 4
    cmp    w2, 4
    bne    L0

    ldp    w0, w1, [x8]
    ldp    w2, w3, [x8, 8]

    // i = 0
    mov    w4, wzr
L1:
    tst    w4, 1

    // t=w[3],w[0]^=i,w[3]=rk[i%8],
    mov    w5, w3
    eor    w0, w0, w4
    and    w6, w4, 7
    ldr    w3, [sp, x6, lsl 2]

    // w[3]^=R(w[1],(i & 1) ? 24 : 31),
    mov    w6, w1, ror 24
    mov    w7, w1, ror 31
    csel   w6, w6, w7, ne
    eor    w3, w3, w6

    // w[3]+=w[0],
    add    w3, w3, w0

    // w[3]=R(w[3],(i & 1) ? 31 : 24),
    mov    w6, w3, ror 31
    mov    w7, w3, ror 24
    csel   w3, w6, w7, ne

    // w[0]=w[1],w[1]=w[2],w[2]=t;
    mov    w0, w1
    mov    w1, w2
    mov    w2, w5

    // i++ 
    add    w4, w4, 1
    // i < 80
    cmp    w4, 80
    bne    L1

    stp    w0, w1, [x8]
    stp    w2, w3, [x8, 8]
    add    sp, sp, 32
    ret

Sources here.

Posted in assembly, cryptography, encryption, programming | Tagged , , | 1 Comment