## Introduction

In January 1997, the National Institute of Standards and Technology (NIST) initiated a process to replace the Data Encryption Standard (DES) published in 1977. A draft criteria to evaluate potential algorithms was published, and members of the public were invited to provide feedback. The finalized criteria was published in September 1997 which outlined a minimum acceptable requirement for each submission. Four years later in November 2001, Rijndael by Belgian Cryptographers Vincent Rijmen and Joan Daemen that we now refer to as the Advanced Encryption Standard (AES), was announced as the winner.

Since publication, implementations of AES have frequently been optimized for speed. Code that executes the quickest has traditionally taken priority over how much ROM it uses. Developers will use lookup tables to accelerate each step of the encryption process, thus compact implementations are rarely if ever sought after. Our challenge here is to implement AES in the least amount of C and more specifically x86 assembly code. It will obviously result in a slow implementation, and will not be resistant to side-channel analysis, although the latter problem can likely be resolved using conditional move instructions (CMOVcc) if necessary. There are also implementations in 32-bit and 64-bit ARM assembly. The 8-bit version of the C code has been successfully tested on an Arduino Uno.

## Parameters

There are three different set of parameters available, with the main difference related to key length. Our implementation will be AES-128, which fits perfectly onto a 32-bit architecture.

Key Length
(Nk words)
Block Size
(Nb words)
Number of Rounds
(Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

## Structure

Two IF statements are introduced in order to perform the encryption in one loop. What isn’t included in the illustration below is ExpandRoundKey and AddRoundConstant which generate round keys.

The first layout here is what we normally see used when describing AES. The second introduces 2 conditional statements that makes the code more compact.

## Source in C

The optimizers built into C compilers can sometimes reveal more efficient ways to implement a piece of code. The following performs encryption, and results in approx. 400 bytes of x86 assembly.

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned W;
// Multiplication over GF(2**8)
W M(W x){
W t=x&0x80808080;
return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B x){
B i,y,c;
if(x){
for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
x=y;F(4)x^=y=(y<<1)|(y>>7);
}
return x^99;
}
void E(B *s){
W i,w,x[8],c=1,*k=(W*)&x[4];

// copy plain text + master key to x
F(8)x[i]=((W*)s)[i];

for(;;){
// AddRoundKey, 1st part of ExpandRoundKey
w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

// 2nd part of ExpandRoundKey
w=R(w,8)^c;F(4)w=k[i]^=w;

// if round 11, stop;
if(c==108)break;
// update round constant
c=M(c);

// SubBytes and ShiftRows
F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

// if not round 11, MixColumns
if(c!=108)
F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
}
}
```

## x86 Overview

Some x86 registers have special purposes, and it’s important to know this when writing compact code.

Register Description Used by
eax Accumulator lods, stos, scas, xlat, mul, div
ebx Base xlat
ecx Count loop, rep (conditional suffixes E/Z and NE/NZ)
edx Data cdq, mul, div
esi Source Index lods, movs, cmps
edi Destination Index stos, movs, scas, cmps
ebp Base Pointer enter, leave

Those of you familiar with the x86 architecture will know certain instructions have dependencies or affect the state of other registers after execution. For example, LODSB will load a byte from memory pointer in SI to AL before incrementing SI by 1. STOSB will store a byte in AL to memory pointer in DI before incrementing DI by 1. MOVSB will move a byte from memory pointer in SI to memory pointer in DI, before adding 1 to both SI and DI. If the same instruction is preceded by REP (for repeat) then this also affects the CX register, decreasing by 1.

## Initialization

The s parameter points to a 32-byte buffer containing a 16-byte plain text and 16-byte master key which is copied to the local buffer x.

A copy of the data is required, because both will be modified during the encryption process. ESI will point to s while EDI will point to x. EAX will hold Rcon value declared as c. ECX will be used exclusively for loops, and EDX is a spare register for loops which require an index starting position of zero. There’s a reason to prefer EAX than other registers. Byte comparisons are only 2 bytes for AL, while 3 for others.

```// 2 vs 3 bytes
/* 0001 */ "\x3c\x6c"             /* cmp al, 0x6c         */
/* 0003 */ "\x80\xfb\x6c"         /* cmp bl, 0x6c         */
/* 0006 */ "\x80\xf9\x6c"         /* cmp cl, 0x6c         */
/* 0009 */ "\x80\xfa\x6c"         /* cmp dl, 0x6c         */
```

In addition to this, one operation requires saving EAX in another register, which only requires 1 byte with XCHG. Other registers would require 2 bytes

```// 1 vs 2 bytes
/* 0001 */ "\x92"                 /* xchg edx, eax        */
/* 0002 */ "\x87\xd3"             /* xchg ebx, edx        */
```

Setting EAX to 1, our loop counter ECX to 4, and EDX to 0 can be accomplished in a variety of ways requiring only 7 bytes. The alternative for setting EAX here would be : XOR EAX, EAX; INC EAX

```// 7 bytes
/* 0001 */ "\x6a\x01"             /* push 0x1             */
/* 0003 */ "\x58"                 /* pop eax              */
/* 0004 */ "\x6a\x04"             /* push 0x4             */
/* 0006 */ "\x59"                 /* pop ecx              */
/* 0007 */ "\x99"                 /* cdq                  */
```

Another way …

```// 7 bytes
/* 0001 */ "\x31\xc9"             /* xor ecx, ecx         */
/* 0003 */ "\xf7\xe1"             /* mul ecx              */
/* 0005 */ "\x40"                 /* inc eax              */
/* 0006 */ "\xb1\x04"             /* mov cl, 0x4          */
```

And another..

```// 7 bytes
/* 0000 */ "\x6a\x01"             /* push 0x1             */
/* 0002 */ "\x58"                 /* pop eax              */
/* 0003 */ "\x99"                 /* cdq                  */
/* 0004 */ "\x6b\xc8\x04"         /* imul ecx, eax, 0x4   */
```

ESI will point to s which contains our plain text and master key. ESI is normally reserved for read operations. We can load a byte with LODS into AL/EAX, and move values from ESI to EDI using MOVS. Typically we see stack allocation using ADD or SUB, and sometimes (very rarely) using ENTER. This implementation only requires 32-bytes of stack space, and PUSHAD which saves 8 general purpose registers on the stack is exactly 32-bytes of memory, executed in 1 byte opcode. To illustrate why it makes more sense to use PUSHAD/POPAD instead of ADD/SUB or ENTER/LEAVE, the following are x86 opcodes generated by assembler.

```// 5 bytes
/* 0000 */ "\xc8\x20\x00\x00" /* enter 0x20, 0x0 */
/* 0004 */ "\xc9"             /* leave           */

// 6 bytes
/* 0000 */ "\x83\xec\x20"     /* sub esp, 0x20   */
/* 0003 */ "\x83\xc4\x20"     /* add esp, 0x20   */

// 2 bytes
/* 0000 */ "\x60"             /* pushad          */
/* 0001 */ "\x61"             /* popad           */
```

Obviously the 2-byte example is better here, but once you require more than 96-bytes, usually ADD/SUB in combination with a register is the better option.

```; *****************************
; void E(void *s);
; *****************************
_E:
xor    ecx, ecx           ; ecx = 0
mul    ecx                ; eax = 0, edx = 0
inc    eax                ; c = 1
mov    cl, 4
; F(8)x[i]=((W*)s)[i];
mov    esi, [esp+64+4]    ; esi = s
mov    edi, esp
add    ecx, ecx           ; copy state + master key to stack
rep    movsd
```

## Multiplication

A pointer to this function is stored in EBP, and there are three reasons to use EBP over other registers:

1. EBP has no 8-bit registers, so we can’t use it for any 8-bit operations.
2. Indirect memory access requires 1 byte more for index zero.
3. The only instructions that use EBP are ENTER and LEAVE.
```// 2 vs 3 bytes for indirect access
/* 0001 */ "\x8b\x5d\x00"         /* mov ebx, [ebp]       */
/* 0004 */ "\x8b\x1e"             /* mov ebx, [esi]       */
```

When writing compact code, EBP is useful only as a temporary register or pointer to some function.

```; *****************************
; Multiplication over GF(2**8)
; *****************************
push   ecx                ; save ecx
mov    cl, 4              ; 4 bytes
add    al, al             ; al <<= 1
jnc    \$+4                ;
xor    al, 27             ;
ror    eax, 8             ; rotate for next byte
loop   \$-9                ;
pop    ecx                ; restore ecx
ret
pop    ebp
```

## SubByte

In the SubBytes step, each byte $a_{i,j}$ in the state matrix is replaced with $S(a_{i,j})$ using an 8-bit substitution box. The S-box is derived from the multiplicative inverse over $GF(2^8)$, and we can implement SubByte purely using code.

```; *****************************
; B SubByte(B x)
; *****************************
sub_byte:
test   al, al            ; if(x){
jz     sb_l6
xchg   eax, edx
mov    cl, -1            ; i=255
; for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
sb_l0:
mov    al, 1             ; y=1
sb_l1:
test   ah, ah            ; !c
jnz    sb_l2
cmp    al, dl            ; y!=x
setz   ah
jz     sb_l0
sb_l2:
mov    dh, al            ; y^=M(y)
call   ebp               ;
xor    al, dh
loop   sb_l1             ; --i
; F(4)x^=y=(y<<1)|(y>>7);
mov    dl, al            ; dl=y
mov    cl, 4             ; i=4
sb_l5:
rol    dl, 1             ; y=R(y,1)
xor    al, dl            ; x^=y
loop   sb_l5             ; i--
sb_l6:
xor    al, 99            ; return x^99
mov    [esp+28], al
ret
```

The state matrix is combined with a subkey using the bitwise XOR operation.

```; *****************************
; *****************************
; F(4)s[i]=x[i]^k[i];
xchg   esi, edi           ; swap x and s
xor_key:
lodsd                     ; eax = x[i]
xor    eax, [edi+16]      ; eax ^= k[i]
stosd                     ; s[i] = eax
loop   xor_key
```

There are various cryptographic attacks possible against AES without this small, but important step. It protects against the Slide Attack, first described in 1999 by David Wagner and Alex Biryukov. Without different round constants to generate round keys, all the round keys will be the same.

```; *****************************
; *****************************
; *k^=c; c=M(c);
xor    [esi+16], al
call   ebp
```

## ExpandRoundKey

The operation to expand the master key into subkeys for each round of encryption isn’t normally in-lined. To boost performance, these round keys are precomputed before the encryption process since you would only waste CPU cycles repeating the same computation which is unnecessary. Compacting the AES code into a single call requires in-lining the key expansion operation. The C code here is not directly translated into x86 assembly, but the assembly does produce the same result.

```; ***************************
; ExpandRoundKey
; ***************************
; F(4)w<<=8,w|=S(((B*)k)[15-i]);w=R(w,8);F(4)w=k[i]^=w;
mov    eax, [esi+3*4]    ; w=k[3]
ror    eax, 8            ; w=R(w,8)
exp_l1:
call   S                 ; w=S(w)
ror    eax, 8            ; w=R(w,8);
loop   exp_l1
mov    cl, 4
exp_l2:
xor    [esi], eax        ; k[i]^=w
lodsd                    ; w=k[i]
loop   exp_l2
```

## Combining the steps

An earlier version of the code used separate AddRoundKey, AddRoundConstant and ExpandRoundKey, but since these steps all relate to using and updating the round key, the three steps are combined in order to reduce the number of loops, thus shaving off a few bytes.

```; *****************************
; *****************************
; w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
; w=R(w,8)^c;F(4)w=k[i]^=w;
xchg   eax, edx
xchg   esi, edi
mov    eax, [esi+16+12]  ; w=R(k[3],8);
ror    eax, 8
xor_key:
mov    ebx, [esi+16]     ; t=k[i];
xor    [esi], ebx        ; x[i]^=t;
movsd                    ; s[i]=x[i];
; w=(w&-256)|S(w)
call   S                 ; al=S(al);
ror    eax, 8            ; w=R(w,8);
loop   xor_key
; w=R(w,8)^c;
xor    eax, edx          ; w^=c;
; F(4)w=k[i]^=w;
mov    cl, 4
exp_key:
xor    [esi], eax        ; k[i]^=w;
lodsd                    ; w=k[i];
loop   exp_key
```

## ShiftRows

ShiftRows cyclically shifts the bytes in each row of the state matrix by a certain offset. The first row is left unchanged. Each byte of the second row is shifted one to the left, with the third and fourth rows shifted by two and three respectively. Because it doesn’t matter about the order of SubBytes and ShiftRows, they’re combined in one loop.

```; ***************************
; ShiftRows and SubBytes
; ***************************
; F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((B*)s)[i]);
mov    cl, 16
shift_rows:
lodsb                    ; al = S(s[i])
call   sub_byte
push   edx
mov    ebx, edx          ; ebx = i%4
and    ebx, 3            ;
shr    edx, 2            ; (i/4 - ebx) % 4
sub    edx, ebx          ;
and    edx, 3            ;
lea    ebx, [ebx+edx*4]  ; ebx = (ebx+edx*4)
mov    [edi+ebx], al     ; x[ebx] = al
pop    edx
inc    edx
loop   shift_rows
```

## MixColumns

The MixColumns transformation along with ShiftRows are the main source of diffusion. Each column is treated as a four-term polynomial $b(x)=b_{3}x^{3}+b_{2}x^{2}+b_{1}x+b_{0}$, where the coefficients are elements over ${GF} (2^{8})$, and is then multiplied modulo $x^{4}+1$ with a fixed polynomial $a(x)=3x^{3}+x^{2}+x+2$

```; *****************************
; MixColumns
; *****************************
; F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
mix_cols:
mov    eax, [edi]        ; w0 = x[i];
mov    ebx, eax          ; w1 = w0;
ror    eax, 8            ; w0 = R(w0,8);
mov    edx, eax          ; w2 = w0;
xor    eax, ebx          ; w0^= w1;
call   ebp               ; w0 = M(w0);
xor    eax, edx          ; w0^= w2;
ror    ebx, 16           ; w1 = R(w1,16);
xor    eax, ebx          ; w0^= w1;
ror    ebx, 8            ; w1 = R(w1,8);
xor    eax, ebx          ; w0^= w1;
stosd                    ; x[i] = w0;
loop   mix_cols
jmp    enc_main
```

## Counter Mode (CTR)

Block ciphers should never be used in Electronic Code Book (ECB) mode, and the ECB Penguin illustrates why.

As you can see, blocks of the same data using the same key result in the exact same ciphertexts; this is why modes of encryption were invented. Galois/Counter Mode (GCM) is authenticated encryption that uses Counter (CTR) mode to provide confidentiality. The concept of CTR mode which turns a block cipher into a stream cipher was first proposed by Whitfield Diffie and Martin Hellman in their 1979 publication, Privacy and Authentication: An Introduction to Cryptography. CTR mode works by encrypting a nonce and counter, then using the ciphertext to encrypt our plain text using a simple XOR operation. Since AES encrypts 16-byte blocks, a counter can be 8-bytes, and a nonce 8-bytes.

The following is a very simple implementation of this mode using the AES-128 implementation.

```// encrypt using Counter (CTR) mode
void encrypt(W l, B*c, B*p, B*k){
W i,r;
B t[32];

// copy master key to local buffer
F(16)t[i+16]=k[i];

while(l){
// copy counter+nonce to local buffer
F(16)t[i]=c[i];

// encrypt t
E(t);

// XOR plaintext with ciphertext
r=l>16?16:l;
F(r)p[i]^=t[i];

// update length + position
l-=r;p+=r;

// update counter
for(i=16;i>0;i--)
if(++c[i-1])break;
}
}
```

In assembly.

```; void encrypt(W len, B *ctr, B *in, B *key)
_encrypt:
lea    esi,[esp+32+4]
lodsd
xchg   eax, ecx          ; ecx = len
lodsd
xchg   eax, ebp          ; ebp = ctr
lodsd
xchg   eax, edx          ; edx = in
lodsd
xchg   esi, eax          ; esi = key
; copy master key to local buffer
; F(16)t[i+16]=key[i];
lea    edi, [esp+16]     ; edi = &t[16]
movsd
movsd
movsd
movsd
aes_l0:
xor    eax, eax
jecxz  aes_l3            ; while(len){
; copy counter+nonce to local buffer
; F(16)t[i]=ctr[i];
mov    edi, esp          ; edi = t
mov    esi, ebp          ; esi = ctr
push   edi
movsd
movsd
movsd
movsd
; encrypt t
call   _E                ; E(t)
pop    edi
aes_l1:
; xor plaintext with ciphertext
; r=len>16?16:len;
; F(r)in[i]^=t[i];
mov    bl, [edi+eax]     ;
xor    [edx], bl         ; *in++^=t[i];
inc    edx               ;
inc    eax               ; i++
cmp    al, 16            ;
loopne aes_l1            ; while(i!=16 && --ecx!=0)
; update counter
xchg   eax, ecx          ;
mov    cl, 16
aes_l2:
inc    byte[ebp+ecx-1]   ;
loopz  aes_l2            ; while(++c[i]==0 && --ecx!=0)
xchg   eax, ecx
jmp    aes_l0
aes_l3:
ret
```

## Full source code

The following is for AES-128, AES-256 on 8-bit or 32-bit and 64-bit architectures. Using a table lookup for the sbox is enabled by default because the DYNAMIC method is incredibly slow.

```// Multiplication over GF(2**8)
#if AES_INT_LEN == 1
#define M(x)(((x)<<1)^((-((x)>>7))&0x1b))
#else
u32 M(u32 x) {
u32 t=x&0x80808080;
return((x^t)<<1)^((t>>7)*0x1b);
}
#endif
// the sbox array is used by default for optimal speed
#ifndef DYNAMIC
u8 sbox[256]=
{0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5,
0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76,
0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0,
0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0,
0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc,
0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15,
0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a,
0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75,
0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0,
0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84,
0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b,
0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf,
0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85,
0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8,
0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5,
0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2,
0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17,
0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73,
0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88,
0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb,
0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c,
0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79,
0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9,
0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08,
0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6,
0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,
0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e,
0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e,
0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94,
0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf,
0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68,
0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 };
#define S(x) sbox[x]
#else
// SubByte
u8 S(u8 x) {
u8 i,y,c;
if(x) {
for(c=i=0,y=1;--i;y=(!c&&y==x)?c=1:y,y^=M(y));
x=y;
for(i=0;i<4;i++) {
x^=y=(y<<1)|(y>>7);
}
}
return x^99;
}
#endif

#if AES_INT_LEN == 1
// 128-bit version for 8-bit architectures

void aes_ecb(void *mk, void *data) {
u8 a,b,c,d,i,j,t,x[AES_BLK_LEN],
k[AES_KEY_LEN],rc=1,*s=(u8*)data;

// copy 128-bit plain text + 128-bit master key to x
for(i=0;i<AES_BLK_LEN;i++) {
x[i]=s[i], k[i]=((u8*)mk)[i];
}

for(;;) {
for(i=0;i<AES_BLK_LEN;i++) {
s[i]=x[i]^k[i];
}
// if round 11, stop
if(rc==108)break;
k[0]^=rc; rc=M(rc);
// ExpandKey
for(i=0;i<4;i++) {
k[i]^=S(k[12+((i-3)&3)]);
}
for(i=0;i<12;i++) {
k[i+4]^=k[i];
}
// SubBytes and ShiftRows
for(i=0;i<AES_BLK_LEN;i++) {
x[(i&3)+((((u32)(i>>2)-(i&3))&3)<<2)]=S(s[i]);
}
// if not round 11
if(rc!=108) {
// MixColumns
for(i=0;i<AES_BLK_LEN;i+=4) {
a=x[i],b=x[i+1],c=x[i+2],d=x[i+3];
for(j=0;j<4;j++) {
x[i+j]^=a^b^c^d^M(a^b);
t=a,a=b,b=c,c=d,d=t;
}
}
}
}
}
#else
// 32-bit or 64-bit versions

#if AES_KEY_LEN == 32

void aes_ecb(void *mk, void *data) {
u32 c=1,i,r=0,w,x[4],k[8], *s=(u32*)data;

// copy 128-bit plain text
for(i=0;i<4;i++) {
x[i] = s[i];
}
// copy 256-bit master key
for(i=0;i<8;i++) {
k[i] = ((u32*)mk)[i];
}

for(;;) {
// 1st part of ExpandKey
w=k[r?3:7];
for(i=0;i<4;i++) {
w=(w&-256) | S(w&255),w=R(w,8);
}
if(!r)w=R(w,8)^c,c=M(c);
// AddRoundKey, 2nd part of ExpandKey
for(i=0;i<4;i++) {
((u32*)s)[i]=x[i]^k[r*4+i], w=k[r*4+i]^=w;
}
// if round 15, stop
if(c==27) break;
r=(r+1)&1;
// SubBytes and ShiftRows
for(i=0;i<AES_BLK_LEN;i++) {
((u8*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((u8*)s)[i]);
}
// if not round 15, MixColumns
if((c!=128) | r) {
for(i=0;i<4;i++) {
w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
}
}
}
}
#else
void aes_ecb(void *mk, void *data) {
u32 c=1,i,w,x[4],k[4],*s=(u32*)data;

// copy 128-bit plain text + 128-bit master key to x
for(i=0;i<4;i++) {
x[i]=s[i], k[i]=((u32*)mk)[i];
}
for(;;) {
// 1st part of ExpandKey
w=k[3];
for(i=0;i<4;i++) {
w=(w&-256)|S(w&255), w=R(w,8);
}
w=R(w, 8)^c;
for(i=0;i<4;i++) {
((u32*)s)[i]=x[i]^k[i], w=k[i]^=w;
}
// if round 11, stop
if(c==108)break;
// update constant
c=M(c);
// SubBytes and ShiftRows
for(i=0;i<AES_BLK_LEN;i++) {
((u8*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(((u8*)s)[i]);
}
// if not round 11, MixColumns
if(c!=108) {
for(i=0;i<4;i++) {
w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
}
}
}
}
#endif
#endif

#ifdef CTR
// encrypt using Counter (CTR) mode
void aes_ctr(u32 len, void *ctr, void *data, void *mk) {
u8 i, r, t[AES_BLK_LEN], *p=data, *c=ctr;

while(len) {
// copy counter+nonce to local buffer
for(i=0;i<AES_BLK_LEN;i++)t[i] = c[i];
// encrypt t
aes_ecb(mk, t);
// XOR plaintext with ciphertext
r = len > AES_BLK_LEN ? AES_BLK_LEN : len;
for(i=0;i<r;i++) p[i] ^= t[i];
// update length + position
len -= r; p += r;
// update counter.
for(i=AES_BLK_LEN;i!=0;i--)
if(++c[i-1]) break;
}
}
#endif
```

## Summary

The final assembly code for ECB mode is 205 bytes, and 272 for CTR mode. You can find sources here.

Posted in arm, assembly, cryptography, programming, security, x86 | Tagged , , , | 4 Comments

## SHAKE-128 Stream Cipher

### Introduction

An Extendable-Output Function (XOF) absorbs input of variable length and generates output of variable length. SHAKE128 and SHAKE256 are two such XOFs designed by the Keccak team and included in the SHA-3 standard. They both use a sponge construction and the Keccak(pronounced “ketchak”) permutation function to generate ciphertext streams. The suffixes “128” and “256” indicate the security strength of these functions when the output is of sufficient length. XOFs can be used as a message digest algorithm, a non-reseedable random number generator, for key derivation or as a stream cipher. Here, you’ll see SHAKE128 used as a simple stream cipher. It doesn’t support an Initialization Vector (IV) or nonce, nor does it include authentication. It simply absorbs a 128-bit key and switches to XOF mode before using the output to encrypt data using an exclusive OR operation.

### Compact code

This function can be used to initialize a context and encrypt or decrypt a stream of bytes. P is the KECCAK-f[1600] permutation function that is also used by SHA-3. This code is only provided as an example of how one might use SHAKE128 as a stream cipher, and shouldn’t be used for any mission critical application.

```#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)

typedef unsigned long long W;
typedef unsigned char B;

typedef struct _shake_ctx {
int i;
union {
B b[200];
W q[25];
} s;
}shake_ctx;

void P(shake_ctx*ctx) {
W n,i,j,r,x,y,t,Y,b[5],*s=ctx->s.q;
B c=1;

F(n,24){
F(i,5){b[i]=0;F(j,5)b[i]^=s[i+j*5];}
F(i,5){
t=b[(i+4)%5]^R(b[(i+1)%5],63);
F(j,5)s[i+j*5]^=t;}
t=s[1],y=r=0,x=1;
F(j,24)
r+=j+1,Y=(x*2)+(y*3),x=y,y=Y%5,
Y=s[x+y*5],s[x+y*5]=R(t, -r),t=Y;
F(j,5){
F(i,5)b[i]=s[i+j*5];
F(i,5)
s[i+j*5]=b[i]^(b[(i+2)%5]&~b[(i+1)%5]);}
F(j,7)
if((c=(c<<1)^((c>>7)*113))&2)
*s^=1ULL<<((1<<j)-1);
}
}

#define RT (1600 - 256) / 8

void shake128(W len,void*in,shake_ctx*c) {
W i;

if(len) {
F(i,len) {
if(c->i == RT) {
P(c); c->i = 0;
}
((B*)in)[i] ^= c->s.b[c->i++];
}
} else {
F(i,25)c->s.q[i]=0; c->i=0;
F(i,16)c->s.b[i]^=((B*)in)[i];
c->s.b[16]  ^=0x1F;
c->s.b[RT-1]^=0x80;
P(c);
}
}
```

sources here.

## SIMECK Block Cipher

### Introduction

The Simeck Family of Lightweight Block Ciphers written by Gangqiang Yang, Bo Zhu, Valentin Suder, Mark D. Aagaard, and Guang Gong was published in 2015. According to the authors, SIMECK combines the good design components of both SIMON and SPECK, in order to devise more compact and efficient block ciphers. There are three variants.

• Simeck32/64
• Simeck48/96
• Simeck64/128

Only the 64/128 version of encryption is implemented here because it has the most potential to be used in applications. Reference implementations for all variants by Bo Zhu can be found here.

### Compact code

The following combines key scheduling and encryption in one function. mk should point to a 128-bit master key while data should point to a 64-bit block of plaintext to encrypt. The value 0x938BCA3083F used for the key schedule is exactly 44 bits, and because there are 44 rounds of encryption, it’s also used as the loop counter.

```#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)

void simeck(void*mk,void*p){
unsigned int t,k0,k1,k2,k3,l,r,*k=mk,*x=p;
unsigned long long s=0x938BCA3083F;

k0=k[0];k1=k[1];k2=k[2];k3=k[3];
r=x[0];l=x[1];

do{
r^=R(l,1)^(R(l,5)&l)^k0;
X(l,r);
t=(s&1)-4;
k0^=R(k1,1)^(R(k1,5)&k1)^t;
X(k0,k1);X(k1,k2);X(k2,k3);
} while(s>>=1);
x[0]=r; x[1]=l;
}
```

### x86 assembly

```; -----------------------------------------------
; SIMECK64/128 Block Cipher in x86 assembly
;
; size: 97 bytes
;
; global calls use cdecl convention
;
; -----------------------------------------------
bits 32

%ifndef BIN
global _simeck
%endif

_edi resd 1
_esi resd 1
_ebp resd 1
_esp resd 1
_ebx resd 1
_edx resd 1
_ecx resd 1
_eax resd 1
.size:
endstruc

%define x0 ebx
%define x1 eax

%define k0 ecx
%define k1 edx
%define k2 edi
%define k3 ebp

%define t0 esi
%define s0 dword[esp+_edi+4]
%define s1 dword[esp+_esi+4]

_simeck:
mov    edi, 0xBCA3083F
mov    esi, 0x938
mov    esi, [esp+64+4] ; esi=mk
lodsd
xchg   eax, k0
lodsd
xchg   eax, k1
lodsd
xchg   eax, k2
lodsd
xchg   eax, k3
mov    esi, [esp+64+8] ; esi=x
push   esi
lodsd
xchg   eax, x0
lodsd
sm_l0:
xor    x0, k0  ; x[0]^=k[0];
mov    t0, x1  ; x[0]^=R(x[1],1);
rol    t0, 1   ;
xor    x0, t0  ;
rol    t0, 4   ; x[0]^=(R(x[1],5)&x[1]);
and    t0, x1  ;
xor    x0, t0  ;

xchg   x0, x1  ; X(x[0],x[1]);

; t0 = (s & 1) - 4;
xor    t0, t0
shr    s1, 1
rcr    s0, 1

xor    k0, t0  ; k[0]^=t0;
mov    t0, k1  ; k[0]^=R(k[1],1);
rol    t0, 1   ;
xor    k0, t0  ;
rol    t0, 4   ; k[0]^=(R(k[1],5)&k[1]);
and    t0, k1  ;
xor    k0, t0  ;

xchg   k0, k1  ; X(k[0],k[1]);
xchg   k1, k2  ; X(k[1],k[2]);
xchg   k2, k3  ; X(x[0],k[0]);

cmp    s0, 0
jnz    sm_l0

pop    edi
xchg   eax, x0
stosd          ; x[0]=x0;
xchg   eax, x0
stosd          ; x[1]=x1;

ret
```

### ARM assembly

```//
.arm
.arch armv6
.text
.align  2

.global simeck

k  .req r0
p  .req r1

r  .req r2
l  .req r3

k0 .req r4
k1 .req r5
k2 .req r6
k3 .req r7

t0 .req r8
t1 .req r9

sx .req r10
sy .req r11

simeck:
// save registers
push   {r0-r12, lr}

// unsigned long long s=0x938BCA3083F;
ldr    sx, =#0xBCA3083F
ldr    sy, =#0x938

// k0=k[0]; k1=k[1]; k2=k[2]; k3=k[3];
ldm    k, {k0,k1,k2,k3}

// r=x[0]; l=x[1];
ldm    p, {r,l}
sm_l0:
// r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
eor    t0, k0, l,ror #31
and    t1, l, l, ror #27
eor    t0, t1
mov    t1, l
eor    l, r, t0
mov    r, t1

// t1 = (s & 1) - 4;
and    t1, sx, #1
sub    t1, #4

// k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
// X(k0,k1); X(k1,k2); X(k2,k3);
eor    k0, k0, k1, ror #31
and    t0, k1, k1, ror #27
eor    t0, k0
mov    k0, k1
mov    k1, k2
mov    k2, k3
eor    k3, t0, t1

// s >>= 1
movs   sy, sy, lsr #1
movs   sx, sx, rrx
bne    sm_l0

// x[0]=r; x[1]=l;
stm    p, {r,l}
// restore registers, and return
pop    {r0-r12, pc}
```

### ARM64 / AArch64 assembly

```// SIMECK in ARM64 assembly
// 100 bytes

.arch armv8-a
.text
.global simeck

simeck:
// unsigned long long s = 0x938BCA3083F;
movz    x2, 0x083F
movk    x2, 0xBCA3, lsl 16
movk    x2, 0x0938, lsl 32

ldp     w3, w4, [x0]
ldp     w5, w6, [x0, 8]

ldp     w8, w7, [x1]
L0:
// r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
eor     w9, w3, w7, ror 31
and     w10, w7, w7, ror 27
eor     w9, w9, w10
mov     w10, w7
eor     w7, w8, w9
mov     w8, w10

// t1 = (s & 1) - 4;
// k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
// X(k0,k1); X(k1,k2); X(k2,k3);
eor     w3, w3, w4, ror 31
and     w9, w4, w4, ror 27
eor     w9, w9, w3
mov     w3, w4
mov     w4, w5
mov     w5, w6
and     x10, x2, 1
sub     x10, x10, 4
eor     w6, w9, w10

// s >>= 1
lsr     x2, x2, 1
cbnz    x2, L0

// save 64-bit ciphertext
stp     w8, w7, [x1]
ret
```

Sources here.

Posted in arm, assembly, cryptography, encryption, programming, x86 | Tagged , , , , , , , | Leave a comment

## Xoodoo Permutation Function

### Introduction

Xoodoo is a cryptographic permutation function designed by the Keccak team along with Seth Hoffert and Johan De Meulder. It was described by Joan Daemen at the 2017 Workshop on Elliptic Curve Cryptography in a talk: Innovations in permutation-based crypto The design is very similar to the Keccak permutation function, but obviously draws inspiration from the design of Gimli that also uses a 384-bit state, and works efficiently on many platforms. The C code shown here is derived from a python implementation. The Keccak team have posted their own implementations here with further documentation in the Xoodoo cookbook.

### C code

When calling this function, the p parameter should point to a 48-byte buffer.

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define X(u,v)t=s[u],s[u]=s[v],s[v]=t
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void xoodoo(void*p){
W e[4],a,b,c,t,r,i,*s=p;
W x[12]={
0x058,0x038,0x3c0,0x0d0,
0x120,0x014,0x060,0x02c,
0x380,0x0f0,0x1a0,0x012};

for(r=0;r<12;r++){
F(4)
e[i]=R(s[i]^s[i+4]^s[i+8],18),
e[i]^=R(e[i],9);
F(12)
s[i]^=e[(i-1)&3];
X(7,4);X(7,5);X(7,6);
s[0]^=x[r];
F(4)
a=s[i],
b=s[i+4],
c=R(s[i+8],21),
s[i+8]=R((b&~a)^c,24),
s[i+4]=R((a&~c)^b,31),
s[i]^=c&~b;
X(8,10);X(9,11);
}
}
```

### x86 Assembly

```%define x0 eax
%define x1 ebx
%define x2 edx

xoodoo:
_xoodoo:
mov    esi, [esp+32+4]   ; esi = state
xor    ecx, ecx          ; ecx = 0
mul    ecx               ; eax = 0, edx = 0
pushad                   ; allocate 32 bytes of memory
mov    edi, esp          ; edi = e
call   ld_rc
dw     0x58,  0x38, 0x3c0, 0xd0
dw     0x120, 0x14,  0x60, 0x2c
dw     0x380, 0xf0, 0x1a0, 0x12
ld_rc:
pop    ebx                  ; ebx = rc
xoo_main:
movzx  ebp, word[ebx+eax*2] ; ebp = rc[r]
mov    cl, 4                ; i = 4
;
; Theta
;
; e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
; e[i]^= ROTR32(e[i], 9);
xd_l1:
lodsd                    ; eax  = x[i]
xor    eax, [esi+16-4]   ; eax ^= x[i+4]
xor    eax, [esi+32-4]   ; eax ^= x[i+8]
ror    eax, 18
mov    ebx, eax
ror    ebx, 9
xor    eax, ebx
stosd                    ; e[i] = eax
loop   xd_l1
; x[i]^= e[(i - 1) & 3];
dec    edx               ; edx = -1
mov    cl, 12
xd_l2:
mov    eax, edx          ; eax = edx & 3
and    eax, 3
mov    eax, [edi+eax*4]  ; eax = e[(i - 1) & 3]
inc    edx               ; i++
xor    [esi+edx*4], eax  ; x[i] ^= eax
loop   xd_l2

; XCHG(x[7], x[4]);
mov    eax, [esi+7*4]
xchg   eax, [esi+4*4]

; XCHG(x[7], x[5]);
xchg   eax, [esi+5*4]

; XCHG(x[7], x[6]);
xchg   eax, [esi+6*4]
mov    [esi+7*4], eax

; x[0] ^= rc[r];
xor    [esi], ebp
mov    cl, 4
xd_l6:
; x0 = x[i+0];
lodsd

; x1 = x[i+4];
mov    x1, [esi+16-4]

; x2 = ROTR32(x[i+8], 21);
mov    x2, [esi+32-4]
ror    x2, 21

; x[i+8] = ROTR32((~x0 & x1) ^ x2, 24);
not    x0
and    x0, x1
xor    x0, x2
ror    x0, 24
mov    [esi+32-4], x0

; x[i+4] = ROTR32((~x2 & x0) ^ x1, 31);
push   x2
not    x2
and    x2, [esi-4]
xor    x2, x1
ror    x2, 31
mov    [esi+16-4], x2
pop    x2

; x[i+0] ^= ~x1 & x2;
not    x1
and    x1, x2
xor    [esi-4], x1
loop   xd_l6

; XCHG(x[8], x[10]);
; XCHG(x[9], x[11]);
mov    eax, [esi+8*4]
mov    ebp, [esi+9*4]

xchg   eax, [esi+10*4]
xchg   ebp, [esi+11*4]

mov    [esi+8*4], eax
mov    [esi+9*4], ebp

; --r
inc    eax
cmp    al, 12
jnz    xoo_main

; release memory
; restore registers
ret
```

### ARM assembly

ARM offers a nice instruction BIC which is the equivilant of (x & ~y) in C.

This instruction is used in the Rho and Chi modules combined.

```.arm
.arch armv6
.text
.align  2

.global xoodoo

state .req r0
x     .req r1

r     .req r2
i     .req r3

x0    .req r4
x1    .req r5
x2    .req r6
x3    .req r7

rc    .req r8
xt    .req r9

e     .req sp

xoodoo:
// save registers
push   {r0-r12, lr}

mov    r, #12              // 12 rounds
sub    sp, #16             // allocate 16 bytes
xoodoo_main:
mov    i, #0               // i = 0
mov    x, state
theta_l0:
ldr    x2, [x, #32]        // x2 = x[i+8]
ldr    x1, [x, #16]        // x1 = x[i+4]
ldr    x0, [x], #4         // x0 = x[i+0], advance x by 4

// e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
eor    x0, x1
eor    x0, x2
ror    x0, #18

// e[i]^= ROTR32(e[i], 9);
eor    x0, x0, ror #9
str    x0, [sp, i, lsl #2]  // store in e

cmp    i, #4               // i<4
bne    theta_l0            //

// x[i]^= e[(i - 1) & 3];
mov    i, #0               // i = 0
mov    x, state            // x = state
theta_l1:
sub    xt, i, #1
and    xt, #3               // xt = i & 3
ldr    xt, [sp, xt, lsl #2] // xt = e[(i - 1) & 3]
ldr    x0, [x, i, lsl #2]   // x0 = x[i]
eor    x0, xt               // x0 ^= xt
str    x0, [x, i, lsl #2]   // x[i] = x0
cmp    i, #12               // i<12
bne    theta_l1

// Rho west
// XCHG(x[7], x[4]);
// XCHG(x[7], x[5]);
// XCHG(x[7], x[6]);
add    x, state, #16       // x = &state[4]
ldm    x, {x0, x1, x2, x3}
mov    xt, x0              // xt = x[4]
mov    x0, x3              // x[4] = x[7]
mov    x3, x2              // x[7] = x[6]
mov    x2, x1              // x[6] = x[5]
mov    x1, xt              // x[5] = xt
stm    x, {x0, x1, x2, x3}

mov    x, state

// Iota
// x[0] ^= rc[r];
ldr    x0, [x]             // load word
eor    x0, xt              // xor
str    x0, [x]             // store word

mov    i, #4
chi:
// Chi and Rho east
// x0 = x[i+0];
ldr    x0, [x]

// x1 = x[i+4];
ldr    x1, [x, #16]

// x2 = ROTR32(x[i+8], 21);
ldr    x2, [x, #32]
ror    x2, #21

// x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24);
bic    xt, x1, x0
eor    xt, x2
ror    xt, #24
str    xt, [x, #32]

// x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31);
bic    xt, x0, x2
eor    xt, x1
ror    xt, #31
str    xt, [x, #16]

// x[i+0]^= x2 & ~x1;
bic    xt, x2, x1
eor    xt, x0
str    xt, [x], #4

// i--
subs   i, #1
bne    chi

add    x, state, #32       // x = &state[8]

// XCHG(x[8], x[10]);
ldm    x, {x0, x1, x2, x3}
push   {x0}
mov    x0, x2
pop    {x2}

// XCHG(x[9], x[11]);
push   {x1}
mov    x1, x3
pop    {x3}
stm    x, {x0, x1, x2, x3}

subs   r, #1               // r--
bne    xoodoo_main         // r>0

// release stack

// restore registers, and return
pop    {r0-r12, pc}

// round constants
rc_tab:
.hword 0x058, 0x038, 0x3c0, 0x0d0
.hword 0x120, 0x014, 0x060, 0x02c
.hword 0x380, 0x0f0, 0x1a0, 0x012
```

### ARM64 assembly

```// Xoodoo in ARM64 assembly
// 268 bytes

.arch armv8-a
.text

.global xoodoo

xoodoo:
sub    sp, sp, 16          // allocate 16 bytes
mov    w9, 12               // 12 rounds
L0:
mov    w7, 0                // i = 0
mov    x1, x0
L1:
ldr    w4, [x1, 32]         // w4 = s[i+8]
ldr    w3, [x1, 16]         // w3 = s[i+4]
ldr    w2, [x1], 4          // w2 = s[i+0], advance x1 by 4

// e[i] = R(s[i] ^ s[i+4] ^ s[i+8], 18);
eor    w2, w2, w3
eor    w2, w2, w4
ror    w2, w2, 18

// e[i] ^= R(e[i], 9);
eor    w2, w2, w2, ror 9
str    w2, [sp, x7, lsl 2]  // store in e

add    w7, w7, 1            // i++
cmp    w7, 4                // i < 4
bne    L1                   //

// s[i]^= e[(i - 1) & 3];
mov    w7, 0                // i = 0
L2:
sub    w2, w7, 1
and    w2, w2, 3            // w2 = i & 3
ldr    w2, [sp, x2, lsl 2]  // w2 = e[(i - 1) & 3]
ldr    w3, [x0, x7, lsl 2]  // w3 = s[i]
eor    w3, w3, w2           // w3 ^= w2
str    w3, [x0, x7, lsl 2]  // s[i] = w3
add    w7, w7, 1            // i++
cmp    w7, 12               // i < 12
bne    L2

// Rho west
// X(s[7], s[4]);
// X(s[7], s[5]);
// X(s[7], s[6]);
ldp    w2, w3, [x0, 16]
ldp    w4, w5, [x0, 24]
stp    w5, w2, [x0, 16]
stp    w3, w4, [x0, 24]

// Iota
// s[0] ^= *rc++;
ldr    w3, [x0]            // load word
eor    w3, w3, w2          // xor
str    w3, [x0]            // store word

mov    w7, 4
mov    x1, x0
L3:
// Chi and Rho east
// a = s[i+0];
ldr    w2, [x1]

// b = s[i+4];
ldr    w3, [x1, 16]

// c = R(s[i+8], 21);
ldr    w4, [x1, 32]
ror    w4, w4, 21

// s[i+8] = R((b & ~a) ^ c, 24);
bic    w5, w3, w2
eor    w5, w5, w4
ror    w5, w5, 24
str    w5, [x1, 32]

// s[i+4] = R((a & ~c) ^ b, 31);
bic    w5, w2, w4
eor    w5, w5, w3
ror    w5, w5, 31
str    w5, [x1, 16]

// s[i+0]^= c & ~b;
bic    w5, w4, w3
eor    w5, w5, w2
str    w5, [x1], 4

// i--
subs   w7, w7, 1
bne    L3

// X(s[8], s[10]);
// X(s[9], s[11]);
ldp    w2, w3, [x0, 32] // 8, 9
ldp    w4, w5, [x0, 40] // 10, 11
stp    w2, w3, [x0, 40]
stp    w4, w5, [x0, 32]

subs   w9, w9, 1           // r--
bne    L0                  // r != 0

// release stack
ret
// round constants
rc:
.hword 0x058, 0x038, 0x3c0, 0x0d0
.hword 0x120, 0x014, 0x060, 0x02c
.hword 0x380, 0x0f0, 0x1a0, 0x012
```

### Summary

Xoodoo with 12 rounds and a 384-bit state is much more compact than Keccak with 22 rounds and an 800-bit state. Presumably Keccak is more secure, but cannot be vectorized and perform as well as Xoodoo or Gimli. Sources here.

Posted in arm, assembly, cryptography, programming, x86 | Tagged , , , , , , | Leave a comment

## SHACAL-2 Block Cipher

### Introduction

SHACAL-1 and SHACAL-2 are block ciphers based on the Secure Hash Standard. They were proposed by Helena Handschuh and David Naccache while both working for the now defunct smartcard company, Gemplus International. SHACAL-1 is an 80 round block cipher built atop of the SHA-1 compression function. It supports a 160-bit block with key sizes of between 128 and 512-bits. SHACAL-2 is a 64 round block cipher built atop of the SHA-256 compression function and supports a larger 256-bit block. Both were submitted to the New European Schemes for Signatures, Integrity and Encryption (NESSIE) project. SHACAL-1 was dropped from the portfolio in 2003 due to concerns about its key schedule, but SHACAL-2 was finally selected. The best known attack against SHACAL-2 is a related-key rectangle attack on round 44 of the 64 rounds in total.

### Macros and data types

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)

#define rev32(x) __builtin_bswap32(x)

typedef unsigned long long Q;
typedef unsigned int W;
typedef unsigned char B;

typedef struct _sha256_ctx {
W s[8];
union {
B b[64];
W w[16];
Q q[8];
}x;
Q len;
}sha256_ctx;
```

### Compact code

The following is not optimized for performance. It’s a simplified version intended to be used as a reference. The mk parameter should point to a key between 128 and 512-bits while data should point to a 256-bit block of plaintext to encrypt.

```#define CH(x,y,z)(((x)&(y))^(~(x)&(z)))
#define MAJ(x,y,z)(((x)&(y))^((x)&(z))^((y)&(z)))
#define EP0(x)(R(x,2)^R(x,13)^R(x,22))
#define EP1(x)(R(x,6)^R(x,11)^R(x,25))
#define SIG0(x)(R(x,7)^R(x,18)^((x)>>3))
#define SIG1(x)(R(x,17)^R(x,19)^((x)>>10))

void shacal2(void*mk,void*data) {
W t1,t2,i,w[64],s[8],*k=mk,*x=data;

W K[64]=
{ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 };

F(16)w[i]=rev32(k[i]);
// expand key
for(i=16;i<64;i++)
w[i]=SIG1(w[i-2])+w[i-7]+SIG0(w[i-15])+w[i-16];
F(8)s[i]=rev32(x[i]);
// encrypt
F(64) {
t1=s[7]+EP1(s[4])+CH(s[4],s[5],s[6])+w[i]+K[i];
t2=EP0(s[0])+MAJ(s[0],s[1],s[2]);
s[7]=s[6],s[6]=s[5],s[5]=s[4],s[4]=s[3]+t1;
s[3]=s[2],s[2]=s[1],s[1]=s[0],s[0]=t1+t2;
}
// store 256-bit ciphertext
F(8)x[i]=rev32(s[i]);
}
```

Sources here.

## XTEA Block Cipher

### Introduction

TEA Extensions (XTEA) is a 64-bit block cipher with support for 128-bit keys. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA) which was discussed previously in this post. XTEA compared to its predecessor contains a more complex key-schedule and rearrangement of shifts, XORs, and additions. Only Encryption is implemented here, that can then be used in Counter (CTR) mode to create a stream cipher. XTEA is not considered a secure block cipher and shouldn’t be considered for any project that requires a high level of security. Moreover, the following code is not optimized for performance. It’s a simplified version intended to be used only as a reference.

### Encryption

This version uses 32 rounds by default. Despite i being initialized to 64, it’s possible to perform encryption of each 32-bit value via swapping and an if statement. mk should point to a 128-bit key while data should point to a 64-bit block of plaintext to encrypt.

```// uses 32 rounds by default
void xtea(void *key, void *data) {
int      i;
uint32_t x0, x1, t, sum=0;

uint32_t *k=(uint32_t*)key;
uint32_t *x=(uint32_t*)data;

x0 = x[0]; x1 = x[1];

for (i=64; i>0; i--) {
t = sum;
// add constant every 2nd round
if (i & 1) {
sum += 0x9E3779B9;
t = sum >> 11;
}
x0 += ((((x1 << 4) ^
(x1 >> 5)) + x1) ^
(sum + k[t & 3]));

XCHG(x0, x1);
}
// save 64-bit cipher text
x[0] = x0; x[1] = x1;
}
```

### x86 assembly

```bits 32

%define x0  eax
%define x1  ebx

%define t0  ebp
%define t1  esi

%define k   edi

%define sum edx

xtea:
_xtea:
mov    edi, [esp+32+4]   ; edi = key
mov    esi, [esp+32+8]   ; esi = data
push   64
pop    ecx
xor    edx, edx          ; sum = 0
push   esi
lodsd
xchg   eax, x1
lodsd
xchg   eax, x1
xtea_enc:
mov    t0, x1            ; t0   = x1 << 4
shl    t0, 4

mov    t1, x1            ; t1   = x1 >> 5
shr    t1, 5

xor    t0, t1            ; t0  ^= t1
add    t0, x1            ; t0  += x1;

mov    t1, sum           ; t1   = sum
test   cl, 1
jz     xtea_l1

add    sum, 0x9E3779B9   ; sum += 0x9E3779B9
mov    t1, sum
shr    t1, 11            ; t1 = sum >> 11
xtea_l1:
and    t1, 3             ; t1  &= 3
mov    t1, [k+4*t1]      ; t1 = sum + k[t1]

xor    t0, t1            ; t0 ^= t1

add    x0, t0            ; x0 += t0
xchg   x0, x1            ; XCHG(x0, x1);
loop   xtea_enc

pop    edi               ; edi = x
stosd                    ; x[0] = x0
xchg   eax, x1
stosd                    ; x[1] = x1
ret
```

### ARM / AArch32

ARM allows each instruction to have conditional execution and that allows us to eliminate the branch used in x86 assembly.

```    .arch armv7-a
.text

.equ ROUNDS, 32

.global xtea

// xtea(void*mk, void*data);
xtea:
// save registers
push   {r2-r8, lr}

mov    r7, #(ROUNDS * 2)

ldm    r1, {r2, r4}         // x0  = x[0], x1 = x[1];
mov    r3, #0               // sum = 0;
ldr    r5, =#0x9E3779B9     // c   = 0x9E3779B9;
L0:
mov    r6, r3               // t0 = sum;
tst    r7, #1               // if (i & 1)

addne  r3, r5               // sum += 0x9E3779B9;
lsrne  r6, r3, #11          // t0 = sum >> 11

and    r6, #3               // t0 %= 4
ldr    r6, [r0, r6, lsl #2] // t0 = k[t0];
add    r8, r3, r6           // t1 = sum + t0
mov    r6, r4, lsl #4       // t0 = (x1 << 4)
eor    r6, r4, lsr #5       // t0^= (x1 >> 5)
add    r6, r4               // t0+= x1
eor    r6, r8               // t0^= t1
mov    r8, r4               // backup x1
add    r4, r6, r2           // x1 = t0 + x0

// XCHG(x0, x1)
mov    r2, r8               // x0 = x1
subs   r7, r7, #1           // i--
bne    L0                   // i>0

// store 64-bit ciphertext
stm    r1, {r2, r4}
pop    {r2-r8, pc}
```

Because the link register (LR) is saved upon entry point, and restored to the program counter (PC) at end, this returns execution directly to the caller.

### ARM64 / AArch64

```// XTEA in ARM64 assembly
// 92 bytes

.arch armv8-a
.text

.equ ROUNDS, 32

.global xtea

// xtea(void*mk, void*data);
xtea:
mov    w7, ROUNDS * 2

ldp    w2, w4, [x1]         // x0  = x[0], x1 = x[1];
mov    w3, wzr              // sum = 0;
ldr    w5, =0x9E3779B9      // c   = 0x9E3779B9;
L0:
mov    w6, w3               // t0 = sum;
tbz    w7, 0, L1            // if ((i & 1)==0) goto L1;

// the next 2 only execute if (i % 2) is not zero
add    w3, w3, w5           // sum += 0x9E3779B9;
lsr    w6, w3, 11           // t0 = sum >> 11
L1:
and    w6, w6, 3            // t0 %= 4
ldr    w6, [x0, x6, lsl 2]  // t0 = k[t0];
add    w8, w3, w6           // t1 = sum + t0
mov    w6, w4, lsl 4        // t0 = (x1 << 4)
eor    w6, w6, w4, lsr 5    // t0^= (x1 >> 5)
add    w6, w6, w4           // t0+= x1
eor    w6, w6, w8           // t0^= t1
mov    w8, w4               // backup x1
add    w4, w6, w2           // x1 = t0 + x0

// XCHG(x0, x1)
mov    w2, w8               // x0 = x1
subs   w7, w7, 1
bne    L0                   // i > 0
stp    w2, w4, [x1]
ret
```

### Summary

It’s safe to assume people searching the internet for a “Tiny Encryption Algorithm” will come accross TEA/XTEA, but for a variety of reasons, it is not suitable to use anymore. The Chaskey block cipher is a good alternative with support for 128-bit blocks.

sources here.

Posted in arm, assembly, programming, security, x86 | Tagged , , , , , , | 1 Comment

## BlaBla Stream Cipher

### Introduction

BlaBla is a 256-bit stream cipher designed by Jean-Philippe Aumasson. It uses the same permutation function as the cryptographic hash algorithm BLAKE2b, that is similar to the permutation function used in the ChaCha stream cipher, hence the name. Frank Denis described it as “…yet another evil experiment from Jean-Philippe Aumasson” Frank informed me the permutation function used in BlaBla is also used in this project here which is an Authenticated Encryption Associated Data (AEAD) algorithm using a Masked Even Mansour (MEM) construction. More information can be found in Improved Masking for Tweakable Blockciphers with Applications to Authenticated Encryption

### Initialization

ChaCha uses an internal state of 512-bits or 64-bytes and works efficiently on 32-bit CPUs. BlaBla uses an internal state of 1024-bits or 128-bytes and works efficiently on 64-bit CPUs. There are nine 64-bit constants used to initialize the internal state. The first four of these include the same 32-bit integers used to initialize the 256-bit key variant of ChaCha, which is the string “expand 32-byte k”. The other values were possibly generated randomly, but there’s currently no explanation for their origin at the moment.

```// setup the key/internal state
void blabla_setkey(blabla_ctx *c, const void *key, const void *nonce) {
int i;

c->q[ 0] = 0x6170786593810fab;
c->q[ 1] = 0x3320646ec7398aee;
c->q[ 2] = 0x79622d3217318274;

// copy 256-bit key
F(4) c->q[i+4] = ((W*)key)[i];

c->q[ 9] = 0xb68f143029225fc9;
c->q[10] = 0x8da1e08468303aa6;
c->q[11] = 0xa48a209acd50a4a7;
c->q[12] = 0x7fdc12f23f90778c;

// set 64-bit counter
c->q[13] = 1;

// copy 128-bit nonce
F(2)c->q[i+14]=((W*)nonce)[i];
}
```

### AMD64 assembly

```blabla_iv1:
dq     0x6170786593810fab
dq     0x3320646ec7398aee
dq     0x79622d3217318274
blabla_iv2:
dq     0xb68f143029225fc9
dq     0x8da1e08468303aa6
dq     0xa48a209acd50a4a7
dq     0x7fdc12f23f90778c

; void blabla_setkey(blabla_ctx *c, const void *key, const void *nonce)
blabla_setkey:
push   rsi                  ; save key
; copy 4 IV
lea    rsi, [rel blabla_iv1]
push   (4*8)/4
pop    rcx
rep    movsd
pop    rsi                  ; restore key
; copy 256-bit key to internal state
; F(4) c->q[i+4] = k[i];
mov    cl, 32/4
rep    movsd
; copy 5 more IV
lea    rsi, [rel blabla_iv2]
mov    cl, (5*8)/4
rep    movsd
; set 64-bit counter
; c->q[13] = 1;
push   1
pop    rax
stosq
; copy 128-bit nonce to internal state
; F(2)c->q[i+14]=n[i];
push   rdx
pop    rsi
movsq
movsq
ret
```

### Stream Generator

```void blabla_stream(blabla_ctx *s, void *out) {
W a,b,c,d,i,t,r,*x=(W*)out;
uint16_t v[8]={0xC840,0xD951,0xEA62,0xFB73,
0xFA50,0xCB61,0xD872,0xE943};

// store internal state in buffer
F(16)x[i] = s->q[i];

// permute buffer
F(80) {
d=v[i%8];
a=(d&15);b=(d>>4&15);
c=(d>>8&15);d>>=12;

for(r=0x3F101820;r;r>>=8)
x[a]+=x[b],
x[d]=R(x[d]^x[a],(r&255)),
X(a,c),X(b,d);
}
// add internal state to buffer
F(16)x[i] += s->q[i];
// increase counter of internal state
s->q[13]++;
}
```

### AMD64 assembly

```blabla_v:
dw     0c840H, 0d951H
dw     0ea62H, 0fb73H
dw     0fa50H, 0cb61H
dw     0d872H, 0e943H

; void blabla_stream(blabla_ctx *s, void *out)
blabla_stream:
push    rbx
push    rcx
push    rsi
push    rdi
push    rbp

; store internal state in buffer
; F(16)x[i] = s->q[i];
push    128/4
pop     rcx
xchg    rsi, rdi
push    rsi               ; save s
push    rdi               ; save out
rep     movsd
pop     rdi               ; restore out

; permute buffer
; F(80) {
xor     eax, eax
bb_sx0:
push    rax               ; save i
; d=v[i%8];
and     al, 7
lea     rsi, [rel blabla_v]
movzx   edx, word[rsi+rax*2]
; a=(d&15);b=(d>>4&15);
mov     eax, edx          ; a = d & 15
and     eax, 15
mov     ebx, edx          ; b = d >> 4 & 15
shr     ebx, 4
and     ebx, 15
; c=(d>>8&15);d>>=12;
mov     ebp, edx          ; c = d >> 8 & 15
shr     ebp, 8
and     ebp, 15
shr     edx, 12           ; d >>= 12

; for (r=0x7080C10;r;r>>=8)
mov    ecx, 0x3F101820 ; load rotation values
bb_sx1:
; x[a] += x[b]
mov    rsi, [rdi+rbx*8]

; x[d] = R(x[d] ^ x[a], (r & 255))
mov    rsi, [rdi+rdx*8]
xor    rsi, [rdi+rax*8]
ror    rsi, cl
mov    [rdi+rdx*8], rsi

; X(a, c); X(b, d);
xchg   rax, rbp
xchg   rbx, rdx

; r >>= 8
shr    ecx, 8       ; shift until done
jnz    bb_sx1

pop    rax
inc    al
cmp    al, 80
jnz    bb_sx0

; add internal state to buffer
; F(16)x[i] += s->q[i];
pop     rsi         ; restore state
push    16
pop     rcx
bb_sx5:
lodsq
stosq
loop    bb_sx5

; update 64-bit counter
; c->q[13]++;
inc     qword[rsi+13*8-128]

pop     rbp
pop     rdi
pop     rsi
pop     rcx
pop     rbx
ret
```

### Encryption/Decryption

```// encrypt or decrypt stream of len-bytes
void blabla_encrypt(blabla_ctx *ctx, void *buf, size_t len) {
W i, r;
B c[128], *p=(B*)buf;

while(len) {
// generate 128-bytes of ciphertext
blabla_stream(ctx, c);
r = len>128?128:len;
// xor plaintext with ciphertext
F(r) p[i] ^= c[i];
// decrease total length, advance buffer
len -= r; p+= r;
}
}
```

### AMD64 assembly

```; void blabla_encrypt(blabla_ctx *ctx, void *buf, size_t len)
blabla_encrypt:
push    rbx               ; save rbx

push    rsi               ; rbx = buf
pop     rbx

push    rdx               ; rcx = len
pop     rcx

sub     rsp, 124
push    rax
push    rsp               ; rsi = c[128]
pop     rsi
bb_e0:
jrcxz   bb_e3             ; exit if len==0
; blabla_stream(ctx, c);
call    blabla_stream
xor     eax, eax          ; i = 0
bb_e1:
mov     dl, byte[rsi+rax] ;
xor     byte[rbx], dl     ; *p ^= c[i]
inc     rbx               ; p++
inc     al                ; i++
cmp     al, 128           ; i<128
loopne  bb_e1             ; --len
jmp     bb_e0
bb_e3:
pop     rax
pop     rbx               ; restore rbx
ret
```

Sources here.

## HIGHT Block Cipher

### Introduction

HIGHT which stands for HIGh security and light weigHT is a 64-bit block cipher with support for 128-bit keys. It was first proposed at the 2006 Cryptographic Hardware and Embedded Systems (CHES) conference held in Japan. HIGHT attracted a lot of attention upon its release because it only uses add-rotate-xor (ARX) operations on 8-bit words. It was selected in 2010 by the Telecommunications Technology Association (TTA) of Korea to become part of the ISO/IEC 18033-3 portfolio that also consists of the following 64-bit block ciphers: TDEA, MISTY1, CAST-128. Some points to note:

• 64-bit input/output data block size
• 128-bit key length
• 32-rounds using 2**8 modular addition, rotation, and XOR (ARX)
• No S-Box (non-linear operations using modular addition)
• Designed for low-resource device (data storage, power, etc.)

### Constant Generation

“Random” constants are used in the generation of subkeys. These are to protect the cipher against simple Slide Attacks. I’ve included the option of using a lookup table or function to calculate the constants at runtime.

```// use lookup table for constants
#ifdef LUT
uint8_t rc[128]=
{ 0x5a, 0x6d, 0x36, 0x1b, 0x0d, 0x06, 0x03, 0x41,
0x60, 0x30, 0x18, 0x4c, 0x66, 0x33, 0x59, 0x2c,
0x56, 0x2b, 0x15, 0x4a, 0x65, 0x72, 0x39, 0x1c,
0x4e, 0x67, 0x73, 0x79, 0x3c, 0x5e, 0x6f, 0x37,
0x5b, 0x2d, 0x16, 0x0b, 0x05, 0x42, 0x21, 0x50,
0x28, 0x54, 0x2a, 0x55, 0x6a, 0x75, 0x7a, 0x7d,
0x3e, 0x5f, 0x2f, 0x17, 0x4b, 0x25, 0x52, 0x29,
0x14, 0x0a, 0x45, 0x62, 0x31, 0x58, 0x6c, 0x76,
0x3b, 0x1d, 0x0e, 0x47, 0x63, 0x71, 0x78, 0x7c,
0x7e, 0x7f, 0x3f, 0x1f, 0x0f, 0x07, 0x43, 0x61,
0x70, 0x38, 0x5c, 0x6e, 0x77, 0x7b, 0x3d, 0x1e,
0x4f, 0x27, 0x53, 0x69, 0x34, 0x1a, 0x4d, 0x26,
0x13, 0x49, 0x24, 0x12, 0x09, 0x04, 0x02, 0x01,
0x40, 0x20, 0x10, 0x08, 0x44, 0x22, 0x11, 0x48,
0x64, 0x32, 0x19, 0x0c, 0x46, 0x23, 0x51, 0x68,
0x74, 0x3a, 0x5d, 0x2e, 0x57, 0x6b, 0x35, 0x5a };
#else

void gen_const(uint8_t *ci)
{
int     i, j;
uint8_t c;

union {
uint8_t  b[128+8];
uint32_t w[(128+8)/4];
} s;

// zero initialize s
memset(&s, 0, sizeof(s));

// set initial bits
s.w[1] = 65537;
s.w[0] = s.w[1] << 8;

// set first constant
// precalculated from bits of s array
ci[0] = 0x5A;

for(i=1; i<128; i++) {
c = s.b[i + 2] ^
s.b[i - 1];

s.b[i + 6] = c;

for(j=1; j<7; j++) {
c += c;
c ^= s.b[i + 6 - j];
}
ci[i] = c;
}
}
#endif
```

The assembly code is less than 128 bytes, which is used instead of a precomputed array.

```gen_constx:
mov    esi, edi ; esi = ci
xor    eax, eax
mov    cl, 80h + 8 ; allocate 136 bytes
sub    esp, ecx
mov    edi, esp
; memset(&s, 0, sizeof(s));
rep    stosb
; s.w[1] = 65537;
mov    dword[esp+4], 65537
; s.w[0] = s.w[1] << 8;
mov    dword[esp], 65537 << 8
; ci[0] = 0x5A;
mov    byte[esi], 0x5A
; i = 1;
inc    eax
gen_l1:
; c = s.b[i+2] ^ s.b[i-1];
mov    cl, [esp+eax+2]
xor    cl, [esp+eax-1]
; s.b[i+6] = c
mov    [esp+eax+6], cl
; j = 1
cdq
inc    edx
gen_l2:
; c += c
; c ^= s.b[(i + 6) - j];
lea    ebx, [eax+6]
sub    ebx, edx
xor    cl, [esp+ebx]
; j++
inc    edx
cmp    dl, 7
jnz    gen_l2
; ci[i] = c;
mov    [esi+eax], cl
; i++
inc    al
; i<128
jns    gen_l1
lea    esp, [esp+eax+8]
ret
```

### Key Schedule

The master key is expanded from 128-bits to 1024-bits. Each byte of the master key, indexed using an unexplained calculation, is added to 1 of the generated constants.

```void hight128_setkey(void *in, void *out)
{
uint8_t i, j, idx;
w128_t  *wk, *mk;
uint8_t *sk;

mk=(w128_t*)in;
wk=(w128_t*)out;
sk=(uint8_t*)out;

// apply key whitening
wk->w[0] = mk->w[3];
wk->w[1] = mk->w[0];

#ifdef LUT
memcpy(&sk[8], rc, sizeof(rc));
#else
// generate constants
gen_const(&sk[8]);
#endif

// generate subkeys
for(i=0; i<8; i++) {
sk += 8;
for(j=0; j<8; j++) {
idx = (j - i + 8) & 7;

sk[0] += mk->b[idx  ];
sk[8] += mk->b[idx+8];
sk++;
}
}
}
```

There may be a way to implement generation of the subkeys in 1 loop. The calculation of idx currently requires 2 loops.

```hight128_setkeyx:
_hight128_setkeyx:
mov    esi, [esp+32+4] ; esi = in
mov    edi, [esp+32+8] ; edi = out
mov    eax, [esi+3*4]
stosd
mov    eax, [esi]
stosd
; i=0
xor    ecx, ecx
call   gen_constx
sk_l1:
; j=0
xor    edx, edx
sk_l2:
; idx = ((j + 8) - i) & 7;
lea    ebx, [edx+8]
sub    ebx, ecx
and    ebx, 7
; sk[0] += mk->b[idx  ];
mov    al, [esi+ebx]
; sk[8] += mk->b[idx+8];
mov    al, [esi+ebx+8]
; sk++;
inc    edi
; j++
inc    edx
; j<8
cmp    dl, 8
jnz    sk_l2
; sk += 8
; i++
inc    ecx
; i<8
cmp    cl, 8
jnz    sk_l1
ret
```

### Round functions

2 Linear functions which provide diffusion. Although they’re shown separately here, a different version inlines them in the main encryption loop.

```uint8_t F0(uint8_t x) {
return ROTL8(x, 1) ^ ROTL8(x, 2) ^ ROTL8(x, 7);
}

uint8_t F1(uint8_t x) {
return ROTL8(x, 3) ^ ROTL8(x, 4) ^ ROTL8(x, 6);
}
```

### Encryption

The only nonlinear operations here are the modular addition. Perhaps it would have made more sense to use an s-box layer. A 4 x 4 bit-slice sbox would work well here.

```void hight128_encrypt(void *data, void *keys)
{
int      i;
w64_t    *x;
uint8_t  *sk, *wk=(uint8_t*)keys;

x  = (w64_t*)data;
sk = &wk[8];

// mix key with 1st 4 bytes
x->b[0] += wk[0]; x->b[2] ^= wk[1];
x->b[4] += wk[2]; x->b[6] ^= wk[3];

for(i=0; i<32; i++) {
// circular shift left by 8 bits
x->q = ROTL64(x->q, 8);
// apply linear/non-linear operations
x->b[2] += (F1(x->b[1]) ^ *sk++);
x->b[4] ^= (F0(x->b[3]) + *sk++);
x->b[6] += (F1(x->b[5]) ^ *sk++);
x->b[0] ^= (F0(x->b[7]) + *sk++);
}
// circular shift right by 8 bits
x->q = ROTL64(x->q, 56);

// mix key with 2nd 4 bytes
x->b[0] += wk[4]; x->b[2] ^= wk[5];
x->b[4] += wk[6]; x->b[6] ^= wk[7];
}
```

The assembly code for this is slightly different from the C code, and wasn’t enjoyable to implement 😉 For each round, there are 2 loops, consisting of 1 addition and 1 XOR with the linear functions inlined. The 64-bit rotation uses ADD/ADC.

```; rotate cipher text 64-bits
rotl64:
mov    eax, [edi]
mov    ebx, [edi+4]
rt_l0:
loop   rt_l0
stosd
xchg   eax, ebx
stosd
ret

hight128_encryptx:
_hight128_encryptx:
mov    edi, [esp+32+4] ; data
mov    esi, [esp+32+8] ; wk and sk
push   2
pop    ecx
push   edi
hi_l0:
lodsw
; x->b[0] += wk[0];
; x->b[2] ^= wk[1];
xor    [edi+2], ah
scasd
loop   hi_l0
pop    edi
lodsd
; save wk[4]
push   eax
; 32 rounds
mov    cl, 32
hi_enc:
push   ecx
; x->q = ROTL64(x->q, 8);
mov    cl, 8
call   rotl64
mov    cl, 2
movzx  edx, cl
hi_l1:
; c = x->b[j-1];
mov    al, [edi+edx-1]
; c = ROTL8(c, 3) ^ ROTL8(c, 4) ^ ROTL8(c, 6);
mov    ah, al
rol    ah, 3
rol    al, 4
xor    ah, al
rol    al, 6-4
xor    ah, al
; x->b[j] += (c ^ *sk++);
lodsb
xor    al, ah
; c = x->b[j+1];
mov    al, [edi+edx+1]
; c = ROTL8(c, 1) ^ ROTL8(c, 2) ^ ROTL8(c, 7);
mov    ah, al
rol    ah, 1
rol    al, 2
xor    ah, al
rol    al, 7-2
xor    ah, al
; x->b[(j+2) & 7] ^= (c + *sk++);
lodsb
lea    ebx, [edx+2]
and    bl, 7
xor    [edi+ebx], al
loop   hi_l1
pop    ecx
loop   hi_enc
; x->q = ROTL64(x->q, 56);
mov    cl, 56
call   rotl64
; restore wk[4]
pop    eax
; x->b[0] += wk[0];
; x->b[2] ^= wk[1];
xor    [edi+2], ah
bswap  eax                ; instead of shr eax, 16
; x->b[4] += wk[6];
; x->b[6] ^= wk[7];
xor    [edi+6], al
ret
```

I’ve no doubt the above code can be optimized further, but it would require removing the dependency on EDI in the main loop. The load/save operations using EDI are not necessary except at the beginning and end of code. If the plaintext was loaded into 2 32-bit registers, there’s a possibility of removing the 64-bit rotation in the main loop by rotating registers and swapping bytes.

### Summary

Because of the linear functions, it’s difficult to see how one could optimize it for a 32-bit or 64-bit CPU. I’ve speculated that if the bytes of plaintext were rearranged, one could optimize it further for a 16-bit CPU, but it would still need 8-bit operations for the linear functions. The assembly code generated by MSVC is approx. 440 bytes and the assembly is currently 283 bytes.

Sources here.

### Introduction

RoadRunneR is a 64-bit block cipher with support for 80 and 128-bit keys. It was designed by Adnan Baysal and Suhap Sahin and published in 2015. The 80-bit variant uses 10 rounds while the 128-bit uses 12. In this post, I’ll show x86 assembly and C code for the 128-bit version. The authors of cipher state in their proposal the main objectives of RoadRunneR:

• Implementation efficiency in 8-bit CPUs
• No table and SRAM usage
• Provable security like in wide trail design strategy

RoadRunneR is a perfect choice for devices with very restricted memory resources and for applications requiring reasonable throughput expectations.

Although the cipher is specifically designed for 8-bit CPUs, the code shown here has a mixture of 8-bit and 32-bit code.

### Compact code

The following implementation combines both key scheduling and encryption in one function. The mk parameter should point to a 128-bit master key and the data parameter should point to a 64-bit block of plaintext to be encrypted.

```#define R(v,n)(((v)<<(n))|((v)>>(8-(n))))
#define X(x,y)t=x,x=y,y=t;

typedef unsigned char B;
typedef unsigned int W;

// S-Layer
void S(void *p) {
B t, *x=(B*)p;

t = x[3];
x[3] &= x[2];
x[3] ^= x[1];
x[1] |= x[2];

x[1] ^= x[0];
x[0] &= x[3];
x[0] ^=  t;
t    &= x[1];
x[2] ^=  t;
}

// encrypt 64-bits of data using 128-bit key
void roadrunner(void *mk, void *data) {
int i, j, r;
W   t, *x=(W*)data, *k=(W*)mk;
B   s, *p, k_idx=4;

// apply K-Layer
x[0] ^= k[0];

// apply 12 rounds of encryption
for(r=12; r>0; r--) {
// F round
t=x[0]; // save original 32-bits
p=(B*)x;
for(i=0; i<3; i++) {
if(i==2) p[3] ^= r;
// apply S-Layer
S(p);
for (j=0; j<4; j++) {
// apply L-Layer
s = R(p[j], 1) ^ p[j];
p[j] ^= R(s,  1);
// apply K-Layer
p[j] ^= ((B*)k)[(k_idx%16)+j];
}
k_idx+=4;
}
// apply S-Layer
S(p);
x[0]^= x[1]; x[1] = t;
}
// permute
X(x[0], x[1]);
// apply K-Layer
x[0] ^= k[1];
}
```

### x86 assembly

```; -----------------------------------------------
; RoadRunneR-64/128 block cipher in x86 assembly
;
; size: 135 bytes
;
; global calls use cdecl convention
;
; -----------------------------------------------

bits 32

%ifndef BIN
%endif

_edi resd 1
_esi resd 1
_ebp resd 1
_esp resd 1
_ebx resd 1
_edx resd 1
_ecx resd 1
_eax resd 1
.size:
endstruc

mov    edi, [esp+32+4]    ; k = mk
mov    esi, [esp+32+8]    ; x = data
xor    ebx, ebx           ; k_idx = 0
; apply K-Layer
; x[0] ^= ((W*)mk)[0];
mov    ebp, [edi]         ; t = ((W*)mk)[0]
xor    [esi], ebp         ; x[0] ^= t
; apply 12 rounds of encryption
; for(r=12; r>0; r--) {
push   12
pop    ecx
rr_L0:
; F round
pushad                    ; save all registers: for(r=12
mov    ebp, [esi]         ; t = x[0]
mov    edx, ecx           ; edx = r
mov    cl, 3              ; i = 3
rr_L1:
inc    ebx                ; k_idx = (k_idx + 4) % 16
and    ebx, 3
pushad                    ; save all registers: for(i=3
cmp    cl, 1              ; if (i == 1)
jne    rr_L2
xor    [esi+3], dl        ; p[3] ^= r
rr_L2:
lea    edi, [edi+ebx*4]   ; edi = &k[k_idx]
; apply S-Layer
; S(p);
call   S
; for (j=3; j>=0; j--) {
mov    cl, 4              ; j = 4
; apply L-Layer
rr_L3:
; s = R(p[j], 1) ^ p[j];
mov    al, [esi]          ; s = p[j]
rol    al, 1              ; s = R(s, 1)
xor    al, [esi]          ; s^= p[j]
; s = R(s, 1) ^ p[j];
rol    al, 1              ; s = R(s, 1)
xor    al, [esi]          ; s^= p[j]

; apply K-Layer
; p[j] = s ^ ((B*)k)[(k_idx%16)+j];
xor    al, byte[edi]
mov    [esi], al
cmpsb
loop   rr_L3              ; j>0; j--

loop   rr_L1              ; i>0; i--

; apply S-Layer
; S(p);
call   S
; x[0]^= x[1];
lodsd
xor    eax, [esi]
mov    [esi-4], eax

; x[1] = t;
mov    [esi], ebp
mov    [esp+_ebx], ebx    ; save k_idx
loop   rr_L0              ; r>0; r--

; permute
; X(x[0], x[1]);
lodsd
xchg   eax, [esi]

; apply K-layer
; x[0] ^= ((W*)mk)[1];
xor    eax, [edi+4]
mov    [esi-4], eax

ret

; S-Layer
S:
mov    ebx, esi      ; ebx = esi
lodsd
mov    edx, eax      ; t = ROTR32(x[0], 16);
ror    edx, 16
and    [ebx+3], dl   ; x[3] &= t[0];
xor    [ebx+3], ah   ; x[3] ^= x[1];
or     [ebx+1], dl   ; x[1] |= t[0];
xor    [ebx+1], al   ; x[1] ^= x[0];
mov    dl, [ebx+3]
and    [ebx+0], dl   ; x[0] &= x[3];
xor    [ebx+0], dh   ; x[0] ^= t[1];
and    dh, [ebx+1]   ; t[1] &= x[1];
xor    [ebx+2], dh   ; x[2] ^= t[1];
ret
```

### S Layer

Using bit-slice techniques for an S-box layer has become popular lately as more research into lightweight cryptography for resource constrained environments progresses. Block ciphers such as NOEKEON, PRIDE, RECTANGLE and Fantomas/Robin all use a bit-sliced S-box layer. Some advantages of using a bit-slice S-box layer in both hardware and software implementations include protection against side-channel analysis, and elimination of S-Box array that would require ROM/RAM. Authors say the S-box for RoadRunneR was found through a brute force search of assembly code combinations. The search space was restricted to 4 input words, a single temporary word, and the following instructions: AND, OR, XOR, NOT, and MOV.

Finding Optimal Bitsliced Implementations of 4×4-bit S-boxes. describes this process in more detail. The following values for the 4 x 4 S-Layer are used.

```// S-Layer
void sbox(uint8_t *x)
{
uint8_t t;

t = x[3];

x[3] &= x[2];
x[3] ^= x[1];
x[1] |= x[2];
x[1] ^= x[0];
x[0] &= x[3];
x[0] ^=  t;
t &= x[1];
x[2] ^=  t;
}
```

Cryptographic Analysis of All 4 x 4 – Bit S-Boxes

### L Layer

A linear function on each byte of the state to provide diffusion. Such a linear function is normally implemented using an XOR of rotated or shifted values. It’s commonly seen in many block ciphers and hash algorithms like Serpent, SM3 and SM4.

### K Layer

This is not referenced in the original paper, but is essentially Key Whitening AKA Key Addition, which was first implemented by Ron Rivest in DES-X. The theory is that by XOR’ing bits of the key against plaintext during the encryption process, this helps strengthen the final ciphertext against exhaustive brute force attacks. It’s a technique used in many SPN block ciphers. For example, it’s known as AddRoundKey in the Advanced Encryption Standard (AES).

### SLK Function

Combining Layers S, L and K is the following function.

```// SLK function
void SLK(w32_t *x, uint8_t *sk)
{
int     i;
uint8_t t;
uint8_t *p=x->b;

// apply S-Layer
sbox(p);

for (i=3; i>=0; i--) {
// apply L-Layer
t   = ROTL8(*p, 1) ^ *p;
*p ^= ROTL8(t,  1);

// apply K-Layer
*p++ ^= *sk++;
}
}
```

### The F-Function

The main function consists of a 4-round Substitution–permutation network (SPN) structure. The first three rounds have the same function called SLK, which is the application of S, L and K layers already described. The final round only applies the S layer. In addition to aforementioned layers is the addition of a round constant after the second call to SLK. This is simply the round number XOR’ed against the least significant byte of the state, which for the 128-bit key variant would be from 12 down to 1.

It is, as the authors describe, intended to protect against simple Slide Attacks. Biryukov and Wagner state the following in Advanced Slide Attacks.

The most obvious type of slide attack is usually easy to prevent by destroying self-similarity in iterative ciphers, for example by adding iteration counters or random constants.

```// F round
void F(w64_t *blk, void *key,
uint8_t *key_idx, uint8_t ci)
{
int      i;
uint32_t t;
uint8_t  *rk=(uint8_t*)key;
w32_t    *x=(w32_t*)blk;

// save 32-bits
t = x->w;

for (i=3; i>0; i--) {
if (i==1) x->b[3] ^= ci;
// apply S,L,K layers
SLK (x, rk + *key_idx);
*key_idx = (*key_idx + 4) & 15;
}

// apply S-Layer
sbox(x->b);

blk->w[0]^= blk->w[1];
blk->w[1] = t;
}
```

### Encryption

The variable key_idx which acts as a master key index is initialized to 4, skipping the first 4 bytes which are XOR’d against the plaintext as part of K-Layer before 12 rounds of F. Note how rnd is initialized to 12 and passed to F. This is the round constant used to protect against simple slide attacks.

```// encrypt 64-bits of data using 128-bit key
{
int      rnd;
uint8_t  key_idx;
uint32_t t;
w64_t    *x=(w64_t*)data;
uint32_t *rk=(uint32_t*)key;

// initialize master key index
key_idx = 4;

// apply K-Layer
x->w[0] ^= rk[0];

// apply rounds
for (rnd=RR_ROUNDS; rnd>0; rnd--) {
F(x, rk, &key_idx, rnd);
}
// P-Layer?
XCHG(x->w[0], x->w[1]);
// apply K-Layer
x->w[0] ^= rk[1];
}
```

Sources here.

## CHAM Block Cipher

### Introduction

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices was published in December 2017 at the 20th Annual International Conference on Information Security and Cryptology held in South Korea. CHAM consists of three ciphers, CHAM-64/128 for 16-bit architectures, CHAM-128/128 and CHAM-128/256 for 32-bit architectures. This post will only focus on CHAM-128/128 that operates on four branches of 32 bits each. The only operations used are 32-bit modular addition, rotation and exclusive-OR (ARX). The design of CHAM draws inspiration from the SPECK and SIMON block ciphers published by the NSA.

The following are parameters for all three variants. N is the block length, K is the key length, R is the number of rounds, W is the width of a word and K/W is the number of words per key.

### Key schedule

Eight round keys are generated from the 128-bit master key. Each round key is 32-bits in length.

```void cham128_setkey(void *in, void *out)
{
int i;
uint32_t *k=(uint32_t*)in;
uint32_t *rk=(uint32_t*)out;

for (i=0; i<KW; i++) {
rk[i] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 8);
rk[(i + KW) ^ 1]	= k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 11);
}
}
```

### x86 assembly

```%define K 128   ; key length
%define N 128   ; block length
%define R 80    ; number of rounds
%define W 32    ; word length
%define KW K/W  ; number of words per key

cham128_setkeyx:
_cham128_setkeyx:
mov    esi, [esp+32+4]  ; k  = in
mov    edi, [esp+32+8]  ; rk = out
xor    eax, eax         ; i  = 0
sk_l0:
mov    ebx, [esi+eax*4] ; ebx = k[i]
mov    ecx, ebx         ; ecx = k[i]
mov    edx, ebx         ; edx = k[i]
rol    ebx, 1           ; ROTL32(k[i], 1)
rol    ecx, 8           ; ROTL32(k[i], 8)
xor    edx, ebx         ; k[i] ^ ROTL32(k[i], 1)
xor    edx, ecx
mov    [edi+eax*4], edx ; rk[i] = edx
xor    edx, ecx         ; reset edx
rol    ecx, 3           ; k[i] ^ ROTL32(k[i], 11)
xor    edx, ecx
lea    ebx, [eax+KW]
xor    ebx, 1
mov    [edi+ebx*4], edx ; rk[(i + KW) ^ 1] = edx
inc    al
cmp    al, KW
jnz    sk_l0
ret
```

### Encryption

There are 80 rounds in total; this is significantly more than the 34 used for SPECK-128/128.

```void cham128_encrypt(void *keys, void *data)
{
int i;
uint32_t x0, x1, x2, x3;
uint32_t t;
uint32_t *rk=(uint32_t*)keys;
uint32_t *x=(uint32_t*)data;

x0 = x[0]; x1 = x[1];
x2 = x[2]; x3 = x[3];

for (i=0; i<R; i++)
{
if ((i & 1) == 0) {
x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 1) ^ rk[i & 7]), 8);
} else {
x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 8) ^ rk[i & 7]), 1);
}
XCHG(x0, x1);
XCHG(x1, x2);
XCHG(x2, x3);
}
x[0] = x0; x[1] = x1;
x[2] = x2; x[3] = x3;
}
```

### Compact code

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void cham(void*mk,void*p){
W rk[8],*w=p,*k=mk,i,t;

F(4)
t=k[i]^R(k[i],31),
rk[i]=t^R(k[i],24),
rk[(i+4)^1]=t^R(k[i],21);
F(80)
t=w[3],w[0]^=i,w[3]=rk[i&7],
w[3]=w[0]+(w[3]^R(w[1],(i&1)?24:31)),
w[3]=R(w[3],(i&1)?31:24),
w[0]=w[1],w[1]=w[2],w[2]=t;
}
```

### x86 assembly

Only the encryption here, since if you were to implement with CTR mode, decryption isn’t necessary.

```; -----------------------------------------------
; CHAM-128/128 block cipher in x86 assembly
;
; size: 124 bytes
;
; global calls use cdecl convention
;
; -----------------------------------------------

bits 32

%ifndef BIN
global cham
global _cham
%endif

%define K 128   ; key length
%define N 128   ; block length
%define R 80    ; number of rounds
%define W 32    ; word length
%define KW K/W  ; number of words per key

%define x0 ebp
%define x1 ebx
%define x2 edx
%define x3 esi
%define rk edi

cham:
_cham:
mov    esi, [esp+32+4]  ; k = key
mov    ebp, [esp+32+8]  ; x = data
mov    edi, esp         ; edi = rk
xor    eax, eax         ; i  = 0
sk_l0:
mov    ebx, [esi+eax*4] ; ebx = k[i]
mov    ecx, ebx         ; ecx = k[i]
mov    edx, ebx         ; edx = k[i]
rol    ebx, 1           ; ROTL32(k[i], 1)
rol    ecx, 8           ; ROTL32(k[i], 8)
xor    edx, ebx         ; k[i] ^ ROTL32(k[i], 1)
xor    edx, ecx
mov    [edi+eax*4], edx ; rk[i] = edx
xor    edx, ecx         ; reset edx
rol    ecx, 3           ; k[i] ^ ROTL32(k[i], 11)
xor    edx, ecx
lea    ebx, [eax+KW]
xor    ebx, 1
mov    [edi+ebx*4], edx ; rk[(i + KW) ^ 1] = edx
inc    eax
cmp    al, KW
jnz    sk_l0

xchg   esi, ebp
push   esi
lodsd
xchg   eax, x0
lodsd
xchg   eax, x1
lodsd
xchg   eax, x2
lodsd
xchg   eax, x3
xor    eax, eax ; i = 0
; initialize sub keys
enc_l0:
mov    edi, ebp  ; k = keys
jmp    enc_lx
enc_l1:
test   al, 7    ; i & 7
jz     enc_l0
enc_lx:
push   eax      ; save i
mov    cx, 0x0108
test   al, 1    ; if ((i & 1)==0)
jnz    enc_l2

xchg   cl, ch
enc_l2:
xor    x0, eax          ; x0 ^= i
mov    eax, x1
rol    eax, cl          ;
xor    eax, [edi]       ; ROTL32(x1, r0) ^ *rk++
scasd
xchg   cl, ch
rol    x0, cl

xchg   x0, x1          ; XCHG(x0, x1);
xchg   x1, x2          ; XCHG(x1, x2);
xchg   x2, x3          ; XCHG(x2, x3);

pop    eax      ; restore i
inc    eax      ; i++
cmp    al, R    ; i<R
jnz    enc_l1

pop    edi
xchg   eax, x0
stosd           ; x[0] = x0;
xchg   eax, x1
stosd           ; x[1] = x1;
xchg   eax, x2
stosd           ; x[2] = x2;
xchg   eax, x3
stosd           ; x[3] = x3;
ret
```

### ARM / AArch32 assembly

```  .arm
.arch armv7-a
.text

.global cham

k   .req r0
x   .req r1

// data
x0 .req r0
x1 .req r2
x2 .req r3
x3 .req r4

// round keys
rk .req sp

k0 .req r6
k1 .req r7
k2 .req r8

i  .req r10

cham:
// save registers
push   {r0-r12,lr}

// allocate memory for round keys
sub    sp, #32

// derive round keys from 128-bit key
mov    i, #0                 // i  = 0
cham_init:
ldr    k0, [k, i, lsl #2]    // k0 = k[i];
ror    k1, k0, #31           // k1 = ROTR32(k0, 31);
ror    k2, k0, #24           // k2 = ROTR32(k0, 24);
eor    k0, k1                // k0^= k1;
eor    k1, k0, k2            // rk[i] = k0 ^ k2;
str    k1, [rk, i, lsl #2]
eor    k0, k2, ror #29       // k0 ^= ROTR32(k2, 29);
add    k1, i, #4             // k1 = (i+KW)
eor    k1, #1                // k1 = (i+KW) ^ 1
str    k0, [rk, k1, lsl #2]  // rk[(i+KW)^1] = k0;
cmp    i, #4                 // i<KW
bne    cham_init             //

ldm    x, {x0, x1, x2, x3}

// perform encryption
mov    i, #0                 // i = 0
cham_enc:
mov    k0, x3
eor    x0, i                 // x0 ^= i
tst    i, #1                 // if (i & 1)

// x3  = rk[i & 7];
and    k1, i, #7             // k1 = i & 7;
ldr    x3, [rk, k1, lsl #2]  // x3 = rk[i & 7];

// execution depends on (i % 2)
// x3 ^= (i & 1) ? ROTR32(x1, 24) : ROTR32(x1, 31);
eorne  x3, x1, ror #24       //
eoreq  x3, x1, ror #31       //

add    x3, x0                // x3 += x0;

// x3 = (i & 1) ? ROTR32(x3, 31) : ROTR32(x3, 24);
rorne  x3, #31               // x3 = ROTR32(x3, 31);
roreq  x3, #24               // x3 = ROTR32(x3, 24);

// swap
mov    x0, x1                // x0 = x1;
mov    x1, x2                // x1 = x2;
mov    x2, k0                // x2 = k0;

cmp    i, #80                // i<R
bne    cham_enc              //

// save 128-bit cipher text
stm    x, {x0, x1, x2, x3}   // x[0] = x0; x[1] = x1;
// x[2] = x2; x[3] = x3;
// release memory for round keys

// restore registers
pop    {r0-r12, pc}
```

### ARM64 / AArch64 assembly

```// CHAM 128/128 in ARM64 assembly
// 160 bytes

.arch armv8-a
.text
.global cham

// cham(void*mk,void*p);
cham:
sub    sp, sp, 32
mov    w2, wzr
mov    x8, x1
L0:
// t=k[i]^R(k[i],31),
ldr    w5, [x0, x2, lsl 2]
eor    w6, w5, w5, ror 31

// rk[i]=t^R(k[i],24),
eor    w7, w6, w5, ror 24
str    w7, [sp, x2, lsl 2]

// rk[(i+4)^1]=t^R(k[i],21);
eor    w7, w6, w5, ror 21
eor    w5, w5, 1
str    w7, [sp, x5, lsl 2]

// i++
// i < 4
cmp    w2, 4
bne    L0

ldp    w0, w1, [x8]
ldp    w2, w3, [x8, 8]

// i = 0
mov    w4, wzr
L1:
tst    w4, 1

// t=w[3],w[0]^=i,w[3]=rk[i%8],
mov    w5, w3
eor    w0, w0, w4
and    w6, w4, 7
ldr    w3, [sp, x6, lsl 2]

// w[3]^=R(w[1],(i & 1) ? 24 : 31),
mov    w6, w1, ror 24
mov    w7, w1, ror 31
csel   w6, w6, w7, ne
eor    w3, w3, w6

// w[3]+=w[0],

// w[3]=R(w[3],(i & 1) ? 31 : 24),
mov    w6, w3, ror 31
mov    w7, w3, ror 24
csel   w3, w6, w7, ne

// w[0]=w[1],w[1]=w[2],w[2]=t;
mov    w0, w1
mov    w1, w2
mov    w2, w5

// i++
// i < 80
cmp    w4, 80
bne    L1

stp    w0, w1, [x8]
stp    w2, w3, [x8, 8]
ret
```

Sources here.

Posted in assembly, cryptography, encryption, programming | Tagged , , | 1 Comment