Light Message Authentication Code (LightMAC)

Introduction

A Message Authentication Code or MAC is a cryptographic primitive that enables two parties using a shared secret key to authenticate messages exchanged between them.

MACs are used daily to secure online communications and will inevitably become a critical component of embedded systems in the future that have wireless functionality; specifically passive, low-power and IoT devices.

Some MACs are constructed from cryptographic hash algorithms. The Hash-based message authentication code (HMAC) published in 1996 for example recommended using MD5 or SHA-1. However, due to resources required for these hash functions, they were never suitable for microcontrollers.

Other MACs are constructed from block ciphers and this is more ideal for a developer where ROM and RAM are in short supply since there’s already a number of lightweight block ciphers available.

There’s no shortage of cryptographic solutions for desktop computers and mobile devices, however there still remains a significant problem with some of the current standardized algorithms when attempting to implement for resource constrained software or hardware devices.

A draft Report on Lightweight Cryptography published by NIST in August 2016 states that none of the NIST approved hash functions are suitable for resource constrained environments based on the findings in A Study on RAM Requirements of Various SHA-3 Candidates on Low-cost 8-bit CPUs

There’s currently an effort on part of engineers, cryptographers and organizations to design, evaluate and standardize lightweight cryptographic primitives for the embedded industry.

Whether we agree or disagree on the need to embed wireless devices in everyday objects, there will be many that have networking or wireless functionality in future which will require robust encryption solutions.

LightMAC

A MAC Mode for Lightweight Block Ciphers was designed by Atul Luykx, Bart Preneel, Elmar Tischhauser, Kan Yasuda and published in 2016.

\begin{array}{l}  \\\hline   \textbf{Algorithm 1: }\text{LightMAC} _{K_1,K_2}(M)  \\\hline  \quad\textbf{Input: } K_1, K_2\in \{0, 1\}^k,\: M\in \{0,1\}^{\leq 2^s(n-s)}\\  \quad\textbf{Output: } T\in\{0, 1\}^t\\  \textbf{\scriptsize{1}}\;\; V\leftarrow 0^n\in\{0, 1\}^n\\  \textbf{\scriptsize{2}}\;\; M[1]M[2]\cdots M[\ell]\xleftarrow{n-s}M\\  \textbf{\scriptsize{3}}\;\; \textbf{for } i = 1\textbf{ to } \ell - 1\textbf{ do}\\  \textbf{\scriptsize{4}}\;\; \mid V\leftarrow V\oplus E_{K_1}(i_s M[i])\\  \textbf{\scriptsize{5}}\;\; \textbf{end}\\  \textbf{\scriptsize{6}}\;\; V\leftarrow V\oplus (M[\ell]10^\ast)\\  \textbf{\scriptsize{7}}\;\; T\leftarrow \lfloor E_{K_2}(V)\rfloor_t\\  \textbf{\scriptsize{8}}\;\; \textbf{return } T    \\\hline  \end{array}

The other lightweight MAC algorithms under evaluation by NIST are Chaskey and TuLp. I’ve already covered Chaskey in a previous post.

All code and details of the algorithm shown here are derived from reference implementations provided by Atul here on github

Any developers that require a reference implementation should use those source codes instead of what’s shown here.

Instead of using PRESENT block cipher, I’ll use SPECK64/128 due to the ARX design which is much easier to implement on x86 architecture.

Block Cipher Parameters

LightMAC is a mode of operation and depends directly on an external block cipher.

The only other block ciphers that come close to the size of Speck would be XTEA or Chaskey. The following is a list of parameters recommended for use with Speck.

LightMAC Parameters

The authors define the following in reference material which are based on the block cipher.

  • COUNTER_LENGTH
  • Length of the protected counter sum in bytes. Not greater than N/2.

  • TAG_LENGTH
  • Length of tag in bytes. Should be at least 64-bits but not greater than N.

  • BLOCK_LENGTH
  • Length of block in bytes. Same as N.

  • BC_KEY_LENGTH
  • Length of block cipher key. The MAC key is twice this length.

The following table shows some example parameters using existing lightweight block ciphers.

Cipher (E) Block Length (N) Cipher Key Length MAC Key Length (K) Counter Length (S) Tag Length (T)
PRESENT 64-bits 128-bits 256-bits 32-bits 64-bits
SPECK 64-bits 128-bits 256-bits 32-bits 64-bits
AES 128-bits 128-bits 256-bits 32-bits 128-bits

Initialization

In the specification, V denotes an intermediate/local variable of cipher block size N. It is initialized to zero and updated after every encryption using an XOR operation with ciphertext before returning the result in T (truncated if required)

But in my own implementation, I assume T to be of cipher block size N and initialize it to zero. I then update T instead which is why I prefer readers use the reference implementation instead of what’s shown here. 🙂

My reason for not allocating space for V and using T directly is simply to reduce the amount of code required for the assembly code.

Updating

The update process is very similar to what you see used in cryptographic hash algorithms. I was gonna have a more detailed description here but I think comments should be clear enough.

Finalization

An end bit (0x80) is appended to M buffer along with any data remaining or none if the input length was a multiple of the block cipher length.

This is then XORed with any previous cipher block state before being encrypted with the 2nd key before returning.

x86 Assembly code

First, here’s the SPECK block cipher using 64-bit block and 128-bit key.

%define SPECK_RNDS    27
; *****************************************
; Light MAC parameters based on SPECK64-128
;
; N = 64-bits
; K = 128-bits
;
%define COUNTER_LENGTH 4  ; should be <= N/2
%define BLOCK_LENGTH   8  ; equal to N
%define TAG_LENGTH     8  ; >= 64 && <= N
%define BC_KEY_LENGTH 16  ; K

%define ENCRYPT speck64_encryptx
%define LIGHTMAC_KEY_LENGTH BC_KEY_LENGTH*2 ; K*2
    
%define k0 edi    
%define k1 ebp    
%define k2 ecx    
%define k3 esi

%define x0 ebx    
%define x1 edx

; esi = input
; ebp = key

speck64_encryptx:   
      pushad

      push    esi            ; save M
      
      lodsd                  ; x0 = x->w[0]
      xchg    eax, x0
      lodsd                  ; x1 = x->w[1]
      xchg    eax, x1
      
      mov     esi, ebp       ; esi = key
      lodsd
      xchg    eax, k0        ; k0 = key[0] 
      lodsd
      xchg    eax, k1        ; k1 = key[1]
      lodsd
      xchg    eax, k2        ; k2 = key[2]
      lodsd 
      xchg    eax, k3        ; k3 = key[3]    
      xor     eax, eax       ; i = 0
spk_el:
      ; x0 = (ROTR32(x0, 8) + x1) ^ k0;
      ror     x0, 8
      add     x0, x1
      xor     x0, k0
      ; x1 = ROTL32(x1, 3) ^ x0;
      rol     x1, 3
      xor     x1, x0
      ; k1 = (ROTR32(k1, 8) + k0) ^ i;
      ror     k1, 8
      add     k1, k0
      xor     k1, eax
      ; k0 = ROTL32(k0, 3) ^ k1;
      rol     k0, 3
      xor     k0, k1    
      xchg    k3, k2
      xchg    k3, k1
      ; i++
      inc     eax
      cmp     al, SPECK_RNDS    
      jnz     spk_el
      
      pop     edi    
      xchg    eax, x0        ; x->w[0] = x0
      stosd
      xchg    eax, x1        ; x->w[1] = x1
      stosd
      popad
      ret

Now LightMAC.

You might notice how ctr and idx variables are initialized to zero at the same time using CDQ instruction. Once PUSHAD is executed, it preserves EDX on the stack and is then used as the protected counter sum S.

Although we convert the counter to big endian format before saving in block buffer, it wouldn’t affect the security to skip this. I’ve retained it for compatibility with reference but might remove it later.

; void lightmac_tag(const void *msg, uint32_t msglen, 
;     void *tag, void* mkey) 
lightmac_tagx:
_lightmac_tagx:
      pushad
      lea     esi, [esp+32+4]; esi = argv
      lodsd
      xchg    eax, ebx       ; ebx = msg
      lodsd
      cdq                    ; ctr = 0, idx = 0, 
      xchg    eax, ecx       ; ecx = msglen
      lodsd
      xchg    eax, edi       ; edi = tag
      lodsd
      xchg    eax, ebp       ; ebp = mkey      
      pushad                 ; allocate N-bytes for M
      ; zero initialize T
      mov     [edi+0], edx   ; t->w[0] = 0;
      mov     [edi+4], edx   ; t->w[1] = 0;
      ; while we have msg data
lmx_l0:
      mov     esi, esp       ; esi = M
      jecxz   lmx_l2         ; exit loop if msglen == 0
lmx_l1:
      ; add byte to M
      mov     al, [ebx]      ; al = *data++
      inc     ebx
      mov     [esi+edx+COUNTER_LENGTH], al           
      inc     edx            ; idx++
      ; M filled?
      cmp     dl, BLOCK_LENGTH - COUNTER_LENGTH
      ; --msglen
      loopne  lmx_l1
      jne     lmx_l2
      ; add S counter in big endian format
      inc     dword[esp+_edx]; ctr++
      mov     eax, [esp+_edx]
      ; reset index
      cdq                    ; idx = 0
      bswap   eax            ; m.ctr = SWAP32(ctr)
      mov     [esi], eax
      ; encrypt M with E using K1
      call    ENCRYPT
      ; update T
      lodsd                  ; t->w[0] ^= m.w[0];
      xor     [edi+0], eax      
      lodsd                  ; t->w[1] ^= m.w[1];
      xor     [edi+4], eax         
      jmp     lmx_l0         ; keep going
lmx_l2:
      ; add the end bit
      mov     byte[esi+edx+COUNTER_LENGTH], 0x80
      xchg    esi, edi       ; swap T and M
lmx_l3:
      ; update T with any msg data remaining    
      mov     al, [edi+edx+COUNTER_LENGTH]
      xor     [esi+edx], al
      dec     edx
      jns     lmx_l3
      ; advance key to K2
      add     ebp, BC_KEY_LENGTH
      ; encrypt T with E using K2
      call    ENCRYPT
      popad                  ; release memory for M
      popad                  ; restore registers
      ret

Summary

The x86 assembly generated by MSVC using /O2 /Os is 238 bytes. The x86 assembly written by hand is 152 bytes.

In order for developers to benefit from LightMAC on microcontrollers, they should choose a lightweight block cipher but not necessarily SPECK. It’s only used here for illustration.

See lmx32.asm and lightmac.c for any future updates.

Posted in assembly, cryptography, encryption, programming | Tagged , , , , , , , , , | Leave a comment

PRESENT Block Cipher

Introduction

PRESENT is a 64-bit block cipher published in 2007 which provides support for key lengths of 80 and 128-bits. It was designed specifically for hardware implementation and uses a Substitution Permutation Network (SPN) structure which is similar to AES but at the same time completely different.

The difference is that PRESENT is bit oriented whereas AES is byte oriented. For that reason, PRESENT is not suitable for software implementation although I found the bit-permutation layer interesting and was just curious to see results using x86 assembly.

It was designed by Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe.

PRESENT is intended to be used in situations where low-power consumption and high chip efficiency is desired. It has been standardized in ISO-29192 along with CLEFIA which was designed by Sony Corporation.

Today, we’ll focus on the 128-bit key version since it has more potential for future applications than something supporting only 80-bit keys.

I’ve no doubt you can implement PRESENT in 32-bit x86 assembly but because it’s a hardware design, I decided to stick with x86-64 assembly and only implement encryption.

In future blog entries, depending on the design of an algorithm, it will be either 32 or 64-bit assembly and maybe both if suitable but I’ll avoid using MMX which was used in some past entries.

You can find a number of implementations in C and python here.

Key whitening

A linear operation seen in pretty much every block cipher designed since being used by Ron Rivest in 1984 to strengthen the DES block cipher.

It uses a simple eXclusive OR operation with the key and data.

\textbf{addRoundKey. } \text{Given round key } K_i=\kappa_{63}^i\dots \kappa_0^i \text{ for } 1\leq i\leq 32 \text{ and current }\\  \textsc{state } b_{63}\dots b_0, \text{ addRoundKey consists of the operation for } 0\leq j\leq 63,
b_j\rightarrow b_j\oplus\kappa_j^i.

Byte substitution

The non-linear layer is based on a single 4-bit S-box which was designed with hardware optimizations in mind. It replaces each nibble of 64-bits with 4-bit value from predefined table.

\textbf{sBoxlayer. }\text{The S-box used in }\textsc{present }\text{is a 4-bit to 4-bit S-box } S:\mathbb{F}^4_2 \rightarrow\mathbb{F}_2^4.\\  \text{The action of this box in hexadecimal notation is given by the following table.}

\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & A & B & C & D & E & F \\\hline    P(i)  & C & 5 & 6 & B & 9 & 0 & A & D & 3 & E & F & 8 & 4 & 7 & 1 & 2  \\\hline  \end{array}

Here’s code generated by MSVC

; 62   :     hi.b[0] = sub_byte(hi.b[0]);               

  movzx edx, cl
  mov QWORD PTR hi$[rbp-48], rcx
  mov eax, edx
  and edx, 15
  shr rax, 4
  mov cl, BYTE PTR sbox$3697[rbp+rax-48]
  shl cl, 4
  or  cl, BYTE PTR sbox$3697[rbp+rdx-48]

The assembly code loads a memory pointer for sbox into the RBX register and expects the AL register to hold the value we’re replacing with XLATB.

For those of you who don’t know, the one-byte instruction XLATB locates a byte entry from a table in memory which is pointed to by the base register (BX).

Using the contents of the AL register as a table index, it then copies the contents of the table entry back into the AL register.

lsb:
    pop     rbx              ; rbx = sbox
    mov     cl, al           ; cl = x
    shr     al, 4            ; al = sbox[x >> 4] << 4
    xlatb                    ; 
    shl     al, 4    
    
    xchg    al, cl
    
    and     al, 15           ; al |= sbox[x & 0x0F]  
    xlatb
    or      al, cl    
    pop     rcx
    pop     rbx    
    ret
    
sub_byte:
    push    rbx 
    push    rcx 
    call    lsb
    ; sbox
    db      0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    db      0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

Bit Permutation

A linear operation that permutates 64-bits of the ciphertext state. Each bit is repositioned according to a predefined table.

\textbf{pLayer. } \text{The bit permutation used in }\textsc{present} \text{ is given by the following table.}\\  \text{Bit i of }\textsc{state } \text{is moved to bit position } P(i).

\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15\\    P(i)  & 0 & 16 & 32 & 48 & 1 & 17 & 33 & 49 & 2 & 18 & 34 & 50 & 3 & 19 & 35 & 51  \\\hline  \hline i & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31\\    P(i)  & 4 & 20 & 36 & 52 & 5 & 21 & 37 & 53 & 6 & 22 & 38 & 54 & 7 & 23 & 39 & 55  \\\hline  \hline i & 32 & 33 & 34 & 35 & 36 & 37 & 38 & 39 & 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47\\    P(i)  & 8 & 24 & 40 & 56 & 9 & 25 & 41 & 57 & 10 & 26 & 42 & 58 & 11 & 27 & 43 & 59  \\\hline  \hline i & 48 & 49 & 50 & 51 & 52 & 53 & 54 & 55 & 56 & 57 & 58 & 59 & 60 & 61 & 62 & 63\\    P(i)  & 12 & 28 & 44 & 60 & 13 & 29 & 45 & 61 & 14 & 30 & 46 & 62 & 15 & 31 & 47 & 63  \\\hline  \end{array}

As you might already tell from the table, each bit position increases sequentially by 16 modulo 63.

To obtain the correct bit position during permutation without the use of a table, we can multiply the loop counter by 16 and modulo the result by 63.

The approach I use is to initialize a 32-bit variable R to 0x30201000. Imagine this to be a 4 byte array initialized to 0,16,32,48 which are used for shifting the first 4 bits of ciphertext state.

By incrementing each byte of the R variable every round and rotating by 8 bits, this will generate every shift value required for the permutation operation.

Here’s assembly output from MSVC for the C code above.

; 83   :     t.q = 0;    

  xor eax, eax

; 84   :     r   = 0x30201000;    

  mov r9d, 807407616        ; 30201000H

; 85   :     for (j=0; j<64; j++) {

  xor ebx, ebx
$LL3@present128@2:

; 86   :       // test bit at j'th position and shift left
; 87   :       t.q |= ((p.q >> j) & 1) << (r & 255);      

  mov rdx, QWORD PTR p$[rbp-16]
  mov ecx, ebx
  inc ebx
  shr rdx, cl
  mov ecx, r9d
  and edx, 1
  shl rdx, cl
  or  rax, rdx

; 88   :       r = ROTR32(r+1, 8);

  inc r9d
  ror r9d, 8
  cmp ebx, 64         ; 00000040H
  jl  SHORT $LL3@present128@2

One thing C can’t directly do is take advantage of the x86 status flags so the assembly code is much different.

If bit 0 of P variable is 1, a right shift by 1 will set the carry flag. Then we just avoid setting bits of t variable with Jump If Not Carry (JNC)

Bit Test and Set (BTS) will set bit in RBP based on value in CL.

Here’s handwritten assembly taking advantage of status flags and bit manipulation instructions.

;
    mov     ecx, 0x30201000  ; r   = 0x30201000;
    xor     ebp, ebp         ; t.q = 0;
    xor     ebx, ebx         ; j   = 0;
pse_l2:
    shr     rax, 1           ; CF = (p.q >> j) & 1
    jnc     pse_l3           ; no bit carried
    bts     rbp, rcx         ; set bit in rbp at position in cl
pse_l3:    
    inc     cl               ; r = ROTR32(r+1, 8);
    ror     ecx, 8
    
    add     bl, 4            ; j++, j < 64
    jne     pse_l2

The only other thing to mention is how the j counter represented by the BL register is updated. Notice how it’s being increased by 4 instead of 1.

64 x 4 = 256 so once we’ve permutated 64-bits of ciphertext, BL register will be zero again, hence the use of Jump If Not Equal (JNE) which is testing the Zero Flag (ZF)

While describing the loop above, I rewrote the code to use the following instead which takes advantage of ECX being zero after the loop you can see in full routine.

;
    mov     ebx, 0x30201000  ; r   = 0x30201000;
    xor     ebp, ebp
    mov     cl, 64
pse_l2:
    shr     rax, 1           ; CF = (p.q >> j) & 1
    jnc     pse_l3           ; no bit carried
    bts     rbp, rbx         ; set bit in rbp at position in bl
pse_l3:    
    inc     bl               ; r = ROTR32(r+1, 8);
    ror     ebx, 8
    loop    pse_l2

Key scheduling

Function creates 32 64-bit subkeys using linear/non-linear operations from a 128-bit key.

; rcx = present_ctx
; rdx = key    
present128_setkeyx:
    push    rsi
    push    rdi
    push    rbp
    push    rbx
    
    push    rcx              
    pop     rdi              ; rdi = ctx
    
    mov     rax, [rdx+8]     ; hi = key[1]
    mov     rdx, [rdx  ]     ; lo = key[0]
    
    xor     ecx, ecx         ; rnd = 0
psk:
    stosq                    ; ctx->key[rnd] = hi.q;  
    
    lea     ebx, [rcx*2+2]   ; lo.q  ^= (rnd + rnd) + 2;
    xor     rdx, rbx
    
    push    rax              ; t.q    = hi.q;
    pop     rbx
    push    rdx
    pop     rbp
    
    shl     rax, 61          ; hi.q   = (hi.q & 7) << 61;
    shr     rbp, 3           ; hi.q  |= (lo.q >> 3);
    or      rax, rbp
    
    shl     rdx, 61          ; lo.q    = (lo.q & 7) << 61;
    shr     rbx, 3           ; lo.q   |= ( t.q >> 3);
    or      rdx, rbx
    
    rol     rax, 8           ; hi.q    = ROTL64(hi.q, 8);
    call    sub_byte
    ror     rax, 8
    inc     cl
    cmp     cl, PRESENT_RNDS
    jne     psk
    jmp     pse_l4

Encryption

Assembly

; rcx = present_ctx
; rdx = buf    
present128_encryptx:
    push    rsi
    push    rdi
    push    rbp
    push    rbx
    
    push    rcx              ; rdi = present_ctx
    pop     rdi
    
    mov     rax, [rdx]       ; p.q = buf[0]
    
    push    PRESENT_RNDS-1
    pop     rcx
pse_l0:    
    xor     rax, [rdi]       ; p.q ^= ctx->key[i]
    scasq                    ; advance key position
    push    rcx              ; save i
    mov     cl, 8            ; j = 0
pse_l1:
    call    sub_byte         ; p.b[j] = sub_byte(p.b[j]); 
    ror     rax, 8           ; p.q    = ROTR64(s.q, 8);
    loop    pse_l1           ; j++ < 8
    ;
    mov     ebx, 0x30201000  ; r   = 0x30201000;
    xor     ebp, ebp
pse_l2:
    shr     rax, 1           ; CF = (p.q >> j) & 1
    jnc     pse_l3           ; no bit carried
    bts     rbp, rbx         ; set bit in rbp at position in bl
pse_l3:    
    inc     bl               ; r = ROTR32(r+1, 8);
    ror     ebx, 8

    add     cl, 4            ; j++
    jne     pse_l2

    xchg    rax, rbp         ; p.q = t.q 
    pop     rcx
    loop    pse_l0 
    
    ; post whitening
    xor     rax, [rdi]  ; p.q ^= ctx->key[PRESENT_RNDS-1];
    mov     [rdx], rax  ; buf[0] = p.q
pse_l4:    
    pop     rbx    
    pop     rbp    
    pop     rdi    
    pop     rsi    
    ret

Summary

PRESENT is clearly intended for hardware and not terribly efficient in software but has some interesting features/ideas that could be used elsewhere.

The size of x86-64 assembly is approx. 188 bytes.

Thanks to 0x4d_ for \LaTeX formulas.

Posted in assembly, cryptography, programming, security | Tagged , , | Leave a comment

LEA-128 Block Cipher

Introduction

LEA is a 128-bit block cipher with support for 128, 192 and 256-bit keys published in 2014. It was designed by Deukjo Hong, Jung-Keun Lee, Dong-Chan Kim, Daesung Kwon, Kwon Ho Ryu, and Dong-Geon Lee.

The only operations used for encryption and the key schedule are 32-bit Addition, eXclusive OR and Rotation (ARX structure): the designers state “the usage of 32-bit and 64-bit processors will grow rapidly compared to 8-bit and 16-bit ones”

Today I’ll just focus on an implementation using 128-bit keys referred to as LEA-128. This just about fits onto the 32-bit x86 architecture. The 256-bit version requires additional registers and is probably better suited for 64-bit mode.

Key Schedule

During generation of subkeys, a number of predefined constants are used.

\textbf{Constants. } \text{The key schedule uses several constants for generating round keys, which are defined as}

\delta[0] = 0xc3efe9db, \quad \delta[1] = 0x44626b02,
\delta[2] = 0x79e27c8a, \quad \delta[3] = 0x78df30ec,
\delta[4] = 0x715ea49e, \quad \delta[5] = 0xc785da0a,
\delta[6] = 0xe04ef22a, \quad \delta[7] = 0xe5c40957.

\text{They are obtained from hexadecimal expression of } \sqrt{766995}, \text{ where 76, 69 and 95 are ASCII codes of 'L', 'E' and 'A'.}

There are 3 different key schedule functions but I only focus on the 128-bit variant for now.

\textbf{Key Schedule with a 128-Bit Key. } \text{Let } K = (K[0], K[1], K[2], K[3]) \text{ be a 128-bit key. We set } T[i] = K[i]\: for\: 0\leq i < 4.\\   \text{ Round key} RK_i = (RK_i[0], RK_i[1],\dots , RK_i[5])\: for\: 0\leq i < 24 \text{ are produced through the following relations:}

T[0]\leftarrow ROL_1(T[0]\oplus ROL_i(\delta[i\:mod\: 4])),\\  T[1]\leftarrow ROL_3(T[1]\oplus ROL_{i+1}(\delta[i\:mod\: 4])),\\  T[2]\leftarrow ROL_6(T[2]\oplus ROL_{i+2}(\delta[i\:mod\: 4])),\\  T[3]\leftarrow ROL_{11}(T[3]\oplus ROL_{i+3}(\delta[i\:mod\: 4])),\\  RK_i\leftarrow (T[0], T[1], T[2], T[3], T[1]).

Encryption

This is a really simple algorithm to implement and there’s not much to say that can’t be found in the specification published by the authors.

So here’s the function in C to perform encryption on 128-bits of plaintext using 128-bit key in one single call. I’d suggest using something like this with counter (CTR) mode so it doesn’t require the decryption procedure.

Full x86 assembly

You might notice the constants are different from C sources. For whatever reason, the last 3 are rotated a number of bits left before entering the encryption loop as you can see above.

Obviously a compiler will be smart enough to see this and automatically optimize but for assembly code, we must rotate them manually. They’re stored on the stack using PUSHAD. So, EDI, ESI, EBP and ESP are used for TD array.

ESP has to be initialized after the PUSHAD for obvious reasons. We don’t want to cause an exception.

bits 32
    
    %ifndef BIN
      global lea128_encryptx
      global _lea128_encryptx
    %endif
    
struc pushad_t
  _edi resd 1
  _esi resd 1
  _ebp resd 1
  _esp resd 1
  _ebx resd 1
  _edx resd 1
  _ecx resd 1
  _eax resd 1
  .size:
endstruc
    
%define k0 ebx
%define k1 edx
%define k2 edi
%define k3 ebp
     
%define p0 [esi+4*0]     
%define p1 [esi+4*1]     
%define p2 [esi+4*2]     
%define p3 [esi+4*3]
     
%define t ecx
%define x eax
%define y ecx
     
%define LEA128_RNDS 24
     
lea128_encryptx:
_lea128_encryptx:
    pushad   
    ; initialize 4 constants
    mov    edi, 0xc3efe9db   ; td[0]
    mov    esi, 0x88c4d604   ; td[1]
    mov    ebp, 0xe789f229   ; td[2]
    pushad
    mov    dword[esp+_esp], 0xc6f98763   ; td[3] 
    mov    esi, [esp+64+4]   ; esi = key
    ; load key
    lodsd
    xchg   eax, k0
    lodsd
    xchg   eax, k1
    lodsd
    xchg   eax, k2
    lodsd
    xchg   eax, k3
    mov    esi, [esp+64+8]   ; esi = block
    xor    eax, eax          ; i = 0    
lea_l0:
    push   eax
    and    al, 3
    mov    t, [esp+eax*4+4]  ; t = td[i % 4]
    rol    t, 4
    mov    [esp+eax*4+4], t  ; td[i & 3] = ROTL32(t, 4);
    ror    t, 4
    
    ; **************************************
    ; create subkey
    ; **************************************
    ; k0 = ROTL32(k0 + t, 1);
    add    k0, t             
    rol    k0, 1
    
    ; k1 = ROTL32(k1 + ROTL32(t, 1),  3);
    rol    t, 1              
    add    k1, t
    rol    k1, 3
    
    ; k2 = ROTL32(k2 + ROTL32(t, 2),  6);
    rol    t, 1              
    add    k2, t
    rol    k2, 6
    
    ; k3 = ROTL32(k3 + ROTL32(t, 3), 11);
    rol    t, 1              
    add    k3, t
    rol    k3, 11
    
    ; **************************************
    ; encrypt block
    ; **************************************
    ; p3 = ROTR32((p2 ^ k3) + (p3 ^ k1), 3);
    mov    x, p2             
    mov    y, p3
    xor    x, k3
    xor    y, k1
    add    x, y
    ror    x, 3
    mov    p3, x
    
    ; p2 = ROTR32((p1 ^ k2) + (p2 ^ k1), 5);
    mov    x, p1             
    mov    y, p2
    xor    x, k2
    xor    y, k1
    add    x, y
    ror    x, 5
    mov    p2, x
    
    ; p1 = ROTL32((p0 ^ k0) + (p1 ^ k1), 9);
    mov    x, p0             
    mov    y, p1
    xor    x, k0
    xor    y, k1
    add    x, y
    rol    x, 9
    mov    p1, x
    
    ; rotate block 32-bits
    mov    t, p3
    xchg   t, p2             ; XCHG(p0, p1); 
    xchg   t, p1             ; XCHG(p1, p2);
    xchg   t, p0             ; XCHG(p2, p3);
    mov    p3, t
    
    pop    eax
    inc    eax               ; i++
    cmp    al, LEA128_RNDS
    jnz    lea_l0
    
    popad
    popad
    ret

Summary

The assembly generated by MSVC using /O2 /Os is 232 bytes for the single call (encryption only) and 266 for 2 separate functions. The hand written x86 assembly for just the single call is currently 161 bytes.

See lx.asm for x86 assembly or lea.c for C source

Thanks to 0x4d_ for \LaTeX formulas.

Posted in assembly, cryptography, encryption, programming | Tagged , , , | Leave a comment

SM3 Cryptographic Hash Algorithm (Chinese Standard)

Introduction

In December of 2007, the Chinese National Cryptographic Administration Bureau released the specification of a Trusted Cryptography Module, detailing a cryptoprocessor to be used within the Trusted Computing framework in China.

The module specifies a set of cryptographic algorithms that include the SM4 block cipher, the SM2 asymmetric algorithm and SM3 cryptographic hash algorithm.

I’ve discussed SM4 in an earlier post here.

SM3 uses a Merkle-Damgard construction similar to MD5 or SHA-2 that processes 512-bit input message blocks and returns a 256-bit hash value.

It was designed by Mathematician and cryptographer Xiaoyun Wang who is responsible for discovering attacks against many cryptographic hash functions, most notably MD5 and SHA-1.

At the CRYPTO 2004 conference, she and co-authors demonstrated collision attacks against MD5 and SHA-0. Attacks against SHA-1 were published later in 2005.

Unfortunately there’s no explanation for how the changes made to SHA-2 strengthen it against attacks so I’ll briefly cover some of the minor differences between SM3 and SHA-2

Initialization

In the original SHA-2 specification, the 8 initialization values are the first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19). See here for more detailed description of how to create these.

H_0^{(0)}\: =\: 6a09e667
H_1^{(0)}\: =\: bb67ae85
H_2^{(0)}\: =\: 3c6ef372
H_3^{(0)}\: =\: a54ff53a
H_4^{(0)}\: =\: 510e527f
H_5^{(0)}\: =\: 9b05688c
H_6^{(0)}\: =\: 1f83d9ab
H_7^{(0)}\: =\: 5be0cd19

In SM3, the following 8 values are used:

H_0^{(0)}\: =\: 7380166f
H_1^{(0)}\: =\: 4914b2b9
H_2^{(0)}\: =\: 172442d7
H_3^{(0)}\: =\: da8a0600
H_4^{(0)}\: =\: a96f30bc
H_5^{(0)}\: =\: 163138aa
H_6^{(0)}\: =\: e38dee4d
H_7^{(0)}\: =\: b0fb0e4e

Unfortunately, there’s no explanation for why or how these values were selected.

init

Updating and Finalization

The update and final processing are exactly the same as SHA-2 which is based on the original design for MD4 by Ron Rivest. I will not explain here since the main changes take place inside the compression function.

Message Expansion

The 512-bit message is expanded into 68 32-bit words. The following 2 linear functions are defined:

P_1 for expansion, P_0 for compression

P_0(X) = X\oplus (X\lll 9)\oplus (X\lll 17)
P_1(X) = X\oplus (X\lll 15)\oplus (X\lll 23)

p_aux

\text{for}\: i=16\: \text{to}\: 67\\  \hphantom{---}W_i\leftarrow P_1(W_{i-16}\oplus W_{i-9}\oplus(W_{i-3}\lll 15))\oplus (W_{i-13}\lll 7)\oplus   W_{i-6}\\  \text{endfor}

\text{for}\: i = 0\: \text{to}\: 63\\  \hphantom{---}W_i'=W_i\oplus W_{i+4}\\  \text{endfor}

exp_code

The additional 4 words in SM3 are required for a separate expansion that I’ve inlined with the compression function to try minimize code usage.

W_i'=W_i\oplus W_{i+4}\: 0\leq i < 64.

You can see in main compression loop while creating TT1 and TT2.

exp_code2

Compression

For SHA-2, 64 words are used during compression which are described here in detail.

Both use the same sequence of sixty-four constant 32-bit words, K0{256}, K1{256}, …,K63{256} These words represent the first thirty-two bits of the fractional parts of the cube roots of the first sixty-four prime numbers. In hex, these constant words are (from left to right)

For SM3, only 2 constants are used. One is selected depending on the position of loop counter, then a circular shift by loop counter is applied before adding into the state. This certainly helps make SM3 more compact than SHA-2 although I can’t say if it strengthens in some way.

T_i =  \begin{cases}  79cc4519 & 0\leq i \leq 15 \\[1ex]  7a879d8a & 16\leq i \leq 63  \end{cases}

t_code

There are 3 boolean operations used during the compression which are similar to bitwise majority MAJ and bitwise choice CH operations in SHA-2.

FF_i(X,Y,Z) =  \begin{cases}  X\oplus Y\oplus Z  & 0\leq i \leq 15 \\[1ex]  (X\wedge Y)\vee (X\wedge Z)\vee (Y\wedge Z) & 16\leq i \leq 63  \end{cases}

GG_i(X,Y,Z) =  \begin{cases}  X\oplus Y\oplus Z  & 0\leq i \leq 15 \\[1ex]  (X\wedge Y) \vee (\neg X \wedge Z) & 16\leq i \leq 63  \end{cases}

The GG operation is changed to use less instructions, the original is.

orig_bool

The new one along with others.

bool_code

ABCDEFGH\leftarrow V^{(i)}
\begin{aligned}  \text{for}\; & i= 0\; \text{to}\: 63\\  & SS_1\leftarrow ((A\lll 12)+E+(T_i\lll i))\lll 7\\  & SS_2\leftarrow SS_1\oplus(A\lll 12)\\  & TT_1\leftarrow FF_i(A,B,C)+D+SS_2+W_i'\\  & TT_2\leftarrow GG_i(E,F,G)+H+SS_1+W_i\\  & D\leftarrow C\\  & C\leftarrow B\lll 9\\  & B\leftarrow A\\  & A\leftarrow TT_1\\  & H\leftarrow G\\  & G\leftarrow F\lll 19\\  & F\leftarrow E\\  & E\leftarrow P_0(TT_2)\\  \text{endfor}&\;  \end{aligned}

Updating state

At the end of compression function in SHA-2, the input hash is updated by adding the output from compression.

H_0^{(i)}=a+H_0^{(i-1)}
H_1^{(i)}=b+H_1^{(i-1)}
H_2^{(i)}=c+H_2^{(i-1)}
H_3^{(i)}=d+H_3^{(i-1)}
H_4^{(i)}=e+H_4^{(i-1)}
H_5^{(i)}=f+H_5^{(i-1)}
H_6^{(i)}=g+H_6^{(i-1)}
H_7^{(i)}=h+H_7^{(i-1)}

In SM3, it is instead XOR’d which follows a Davies-Meyer idea for compression functions.

H_0^{(i)}=a\oplus H_0^{(i-1)}
H_1^{(i)}=b\oplus H_1^{(i-1)}
H_2^{(i)}=c\oplus H_2^{(i-1)}
H_3^{(i)}=d\oplus H_3^{(i-1)}
H_4^{(i)}=e\oplus H_4^{(i-1)}
H_5^{(i)}=f\oplus H_5^{(i-1)}
H_6^{(i)}=g\oplus H_6^{(i-1)}
H_7^{(i)}=h\oplus H_7^{(i-1)}

The output state from compression is XORed with the previous hash value to produce the next hash value. In the first round when there is no previous hash value it uses a constant pre-specified initial values.

H_i=E_{m_i}(H_{i-1})\oplus H_{i-1}.

Here’s the full compression function.

void SM3_Transform (SM3_CTX *ctx) 
{
    uint32_t tt1, tt2, i, t, ss1, ss2, x, y;
    uint32_t w[68], s[8];

    #define a s[0]
    #define b s[1]
    #define c s[2]
    #define d s[3]
    #define e s[4]
    #define f s[5]
    #define g s[6]
    #define h s[7]
    
    // load state into local buffer
    memcpy((uint8_t*)&s[0], (uint8_t*)&ctx->s.w[0], 8*4);
    
    // load data in big endian format
    for (i=0; i<16; i++) {
      w[i] = SWAP32(ctx->buf.w[i]);
    }

    // expand message
    for (i=16; i<68; i++) {
      x = ROTL32(w[i- 3], 15);
      y = ROTL32(w[i-13],  7);
      
      x ^= w[i-16];
      x ^= w[i- 9];
      y ^= w[i- 6];
      
      w[i] = P1(x) ^ y; 
    }

    // compression function
    for (i=0; i<64; i++) {
      t  = (i < 16) ? 0x79cc4519 : 0x7a879d8a;
      
      ss2 = ROTL32(a, 12);      
      ss1 = ROTL32(ss2 + e + ROTL32(t, i), 7);
      ss2 ^= ss1;
      
      tt1 = d + ss2 + (w[i] ^ w[i+4]);
      tt2 = h + ss1 + w[i];
      
      if (i < 16) {
        tt1 += F(a, b, c);
        tt2 += F(e, f, g);
      } else {
        tt1 += FF(a, b, c);
        tt2 += GG(e, f, g);       
      }
      d = c;
      c = ROTL32(b, 9);
      b = a;
      a = tt1;
      h = g;
      g = ROTL32(f, 19);
      f = e;
      e = P0(tt2); 
    }
    
    // Davies–Meyer idea for compression function
    for (i=0; i<8; i++) {
      ctx->s.w[i] ^= s[i];
    }    
    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
    #undef g
    #undef h
}

And assembly just to illustrate

%define _a dword[edi+0*4]
%define _b dword[edi+1*4]
%define _c dword[edi+2*4]
%define _d dword[edi+3*4]
%define _e dword[edi+4*4]
%define _f dword[edi+5*4]
%define _g dword[edi+6*4]
%define _h dword[edi+7*4]

_SM3_Transformx:
    pushad

    ; load state into esi
    lea     esi, [ebx+state]
    push    esi  ; save for later

    ; allocate 512 bytes
    push    64
    pop     ecx
    shl     ecx, 3
    sub     esp, ecx

    ; load state into local buffer
    mov     edi, esp
    push    8
    pop     ecx
    rep     movsd

    ; store message in big endian format
    mov     cl, 16
s3t_l0:
    lodsd
    bswap   eax
    stosd
    loop    s3t_l0

    ; expand message
    mov     cl, 68-16
s3t_l1:
    ; x = ROTL32(w[i-3], 15);
    mov     eax, [edi- 3*4]
    rol     eax, 15
    ; y = ROTL32(w[i-13], 7);
    mov     ebx, [edi-13*4]
    rol     ebx, 7
    ; x ^= w[i-16];
    xor     eax, [edi-16*4]
    ; x ^= w[i-9];
    xor     eax, [edi- 9*4]
    ; y ^= w[i-6];
    xor     ebx, [edi- 6*4]
    ; x ^= ROTL32(x, 15) ^ ROTL32(x, 23);
    mov     edx, eax
    rol     edx, 15
    xor     eax, edx
    rol     edx, 23-15
    xor     eax, edx
    ; x ^= y;
    xor     eax, ebx
    ; w[i] = x;
    stosd
    loop    s3t_l1

    ; permute message
    mov     edi, esp
s3t_l2:
    ; t  = (i < 16) ? 0x79cc4519 : 0x7a879d8a;
    cmp     ecx, 16
    sbb     eax, eax
    and     eax, 0x79cc4519 - 0x7a879d8a
    add     eax, 0x7a879d8a
    ; ss2 = ROTL32(a, 12);
    mov     ebx, _a
    rol     ebx, 12
    ; ss1 = ROTL32(ss2 + e + ROTL32(t, i), 7);
    rol     eax, cl
    add     eax, ebx
    add     eax, _e
    rol     eax, 7
    ; ss2 ^= ss1;
    xor     ebx, eax
    ; tt1 = d + ss2 + (w[i] ^ w[i+4]);
    mov     ebp, eax         ; save ss1
    mov     esi, [edi+4*ecx+32]       ; esi = w[i]
    mov     edx, esi         ; save w[i]
    xor     esi, [edi+4*ecx+32+16]    ; esi ^= w[i+4]
    add     esi, _d
    lea     eax, [esi+ebx]   ; set tt1
    ; tt2 = h + ss1 + w[i];
    lea     ebx, [edx+ebp]
    add     ebx, _h
    ; if (i < 16) {
    cmp     ecx, 16
    jae     s3t_l3
    ; tt1 += F(a, b, c);
    mov     edx, _c
    xor     edx, _b
    xor     edx, _a
    add     eax, edx
    ; tt2 += F(e, f, g);
    mov     edx, _g
    xor     edx, _f
    xor     edx, _e
    add     ebx, edx
    jmp     s3t_l4
s3t_l3:
    ; tt1 += FF(a, b, c);
    mov     edx, _b
    or      edx, _a
    mov     ebp, _b
    and     edx, _c
    and     ebp, _a
    or      edx, ebp
    add     eax, edx
    ; tt2 += GG(e, f, g);
    mov     edx, _g
    xor     edx, _f
    and     edx, _e
    xor     edx, _g
    add     ebx, edx
s3t_l4:
    ; d = c;
    mov     edx, _c
    mov     _d, edx
    ; c = ROTL32(b, 9);
    mov     edx, _b
    rol     edx, 9
    mov     _c, edx
    ; b = a;
    mov     edx, _a
    mov     _b, edx
    ; a = tt1;
    mov     _a, eax
    ; h = g;
    mov     edx, _g
    mov     _h, edx
    ; g = ROTL32(f, 19);
    mov     edx, _f
    rol     edx, 19
    mov     _g, edx
    ; f = e;
    mov     edx, _e
    mov     _f, edx
    ; e = P0(tt2);
    ; e = x ^ ROTL32(x,  9) ^ ROTL32(x, 17)
    mov     edx, ebx
    rol     edx, 9
    xor     ebx, edx
    rol     edx, 17-9
    xor     ebx, edx
    mov     _e, ebx

    inc     ecx
    cmp     ecx, 64
    jnz     s3t_l2

    mov     esi, esp
    lea     esp, [esi+ecx*8]

    pop     edi
    mov     cl, 8
s3t_l5:
    lodsd
    xor     eax, [edi]
    stosd
    loop    s3t_l5
    popad
    ret

Summary

Size of C generated assembly using /O2 /Os switches is approx. 609 bytes. The x86 assembly is approx. 470 bytes. See sources here.

Thanks to 0x4d_ for \LaTeX equations.

Posted in assembly, cryptography, programming, security | Tagged , , , , | Leave a comment

Chaskey-LTS Block Cipher

Introduction

The Chaskey cipher is a 128-bit block, 128-bit key symmetric encryption algorithm which is the underlying function used for the Chaskey Message Authentication Code (MAC).

It’s based on an Even-Mansour construction which makes it very simple to implement and because of its permutation function derived from SipHash, using only ARX instructions makes it highly suitable for many architectures.

Shimon Even and Yishay Mansour published a paper in 1997 titled A Construction of a Cipher From a Single Pseudorandom Permutation which suggested an incredibly simple but provably secure design for a cryptographic algorithm.

even_mansour

Key Whitening, a technique invented by Ron Rivest in 1984 is performed before and after the application of F which is a publicly known permutation function.

Key Whitening consists of using a simple XOR operation of the key with data which is intended to increase resistance to brute force attack and is used in many modern ciphers today.

F function

The permutation function uses 16 rounds of ADD/ROL/XOR (ARX) instructions for encryption and is derived from the SipHash algorithm.

SipHash is a fast short-input Pseudo-Random-Function(PRF) designed and published in 2012 by Jean-Philippe Aumasson and Daniel J. Bernstein.

The decryption of ciphertext is simply reversing the permutation function using SUB/ROR/XOR.

perm

void chas_encrypt(int enc, void *key, void *buf) 
{
   int      i;
   uint32_t *v=(uint32_t*)buf;
   uint32_t *k=(uint32_t*)key;
   
   // pre-whiten
   for (i=0; i<4; i++) {
     v[i] ^= k[i];
   }

   // apply permutation function
   for (i=0; i<16; i++) {
     if (enc==CHASKEY_ENCRYPT)
     {
       v[0] += v[1]; 
       v[1]=ROTL32(v[1], 5); 
       v[1] ^= v[0]; 
       v[0]=ROTL32(v[0],16);       
       v[2] += v[3]; 
       v[3]=ROTL32(v[3], 8); 
       v[3] ^= v[2];
       v[0] += v[3]; 
       v[3]=ROTL32(v[3],13); 
       v[3] ^= v[0];
       v[2] += v[1]; 
       v[1]=ROTL32(v[1], 7); 
       v[1] ^= v[2]; 
       v[2]=ROTL32(v[2],16);
     } else {     
       v[2]=ROTR32(v[2],16);
       v[1] ^= v[2];
       v[1]=ROTR32(v[1], 7);
       v[2] -= v[1];
       v[3] ^= v[0];
       v[3]=ROTR32(v[3],13);
       v[0] -= v[3];
       v[3] ^= v[2];
       v[3]=ROTR32(v[3], 8);
       v[2] -= v[3];
       v[0]=ROTR32(v[0],16);
       v[1] ^= v[0];
       v[1]=ROTR32(v[1], 5);
       v[0] -= v[1];
     }
   }
   // post-whiten
   for (i=0; i<4; i++) {
     v[i] ^= k[i];
   }
}

The assembly is straight forward. We load buffer into ESI, key into EDI and enc into ECX. Load 4 32-bit registers with 128-bit data, apply pre-whitening with 128-bit key. Test ECX for zero, then save flag status with PUSHFD. This then frees ECX to use as a loop counter which is set to 16 (for LTS).

After each round of permutation, restore the flag status with POPFD and keep looping until ECX is zero. Finally apply post-whitening using 128-bit key, save and return.

%define v0 eax
%define v1 ebx
%define v2 edx
%define v3 ebp

chas_encryptx:
_chas_encryptx:
    pushad
    lea     esi, [esp+32+4]
    lodsd
    xchg    ecx, eax          ; ecx = enc
    lodsd
    xchg    edi, eax          ; edi = key
    lodsd
    xchg    eax, esi          ; esi = buf
    push    esi
    ; load buf
    lodsd
    xchg    eax, v3
    lodsd
    xchg    eax, v1
    lodsd
    xchg    eax, v2
    lodsd
    xchg    eax, v3
    ; pre-whiten
    xor     v0, [edi   ]
    xor     v1, [edi+ 4]
    xor     v2, [edi+ 8]
    xor     v3, [edi+12]
    test    ecx, ecx
    mov     cl, 16
ck_l0:
    pushfd
    jz      ck_l1
    ; encrypt
    add     v0, v1
    rol     v1, 5
    xor     v1, v0
    rol     v0, 16
    add     v2, v3
    rol     v3, 8
    xor     v3, v2
    add     v0, v3
    rol     v3, 13
    xor     v3, v0
    add     v2, v1
    rol     v1, 7
    xor     v1, v2
    rol     v2, 16
    jmp     ck_l2
ck_l1:
    ; decrypt
    ror     v2, 16
    xor     v1, v2
    ror     v1, 7
    sub     v2, v1
    xor     v3, v0
    ror     v3, 13
    sub     v0, v3
    xor     v3, v2
    ror     v3, 8
    sub     v2, v3
    ror     v0, 16
    xor     v1, v0
    ror     v1, 5
    sub     v0, v1
ck_l2:
    popfd
    loop    ck_l0
ck_l3:
    ; post-whiten
    xor     v0, [edi   ]
    xor     v1, [edi+ 4]
    xor     v2, [edi+ 8]
    xor     v3, [edi+12]
    pop     edi
    ; save buf
    stosd
    xchg    eax, v1
    stosd
    xchg    eax, v2
    stosd
    xchg    eax, v3
    stosd
    popad
    ret

Summary

C code is approx. 185 bytes using /O2 /Os flags. Assembly is currently 132 bytes although It’s difficult to say if that could be reduced any further.

See sources here

Posted in assembly, cryptography, encryption, programming, security | Tagged , , | Leave a comment

SM4 block cipher (Chinese Standard for WAPI)

Introduction

SM4 (formerly SMS4) is a 128-bit block cipher with a 128-bit user key and 32 rounds. It’s used in the WLAN Authentication and Privacy Infrastructure (WAPI), a Chinese WLAN national standard.

It was published or rather declassified in 2006 and standardized in 2012.

I’m not clear what the SM stands for. There are other specifications like SM2 which describes Elliptic Curve algorithms for digital signatures and key exchange. SM3 which is a cryptographic hash algorithm and of course, SM4 a symmetric block cipher which I’ll implement here in C and x86 assembly optimized for size.

The only other specification to mention is SM9 which documents a set of identity-based cryptographic schemes from pairings.

English translations of the specifications for SM2, SM3 have been provided by Sean Shen and Xiaodong Lee at China Internet Network Information Center (CNNIC) a non-profit organization based in China.

But the only english translation for SM4 was by Whitfield Diffie and Dr. George Ledin.

About the C code

If you want high performance implementations of SM4 in C and x86/x86-x64 assembly, please look at GMSSL which appears to be a fork of OpenSSL but includes SM2, SM3 SM4 and SM9.

Mixer-substitution T

T is a substitution that generates 32 bits of output from 32 bits of input. Source code includes the full 256-byte array which obviously increases the size of code considerably.

  • Non-Linear function

(b_0, b_1, b_2, b_3) = \tau (A) = (Sbox(a_0),Sbox(a_1),Sbox(a_2),Sbox(a_3))

The Sbox is a 256-byte array and there’s no description of how these elements were chosen. If anyone knows, please leave a comment.

sbox

  • Linear function (for key setup)

L'(B) = B \oplus (B \lll 13)\oplus (B \lll 23)

  • Linear function (for encryption)

C=L(B)=B\oplus(B\lll 2)\oplus (B\lll 10)\oplus (B\lll 18)\oplus(B\lll 24)

t_function

t_l0:
    pop    ebx
    xor    eax, x1
    xor    eax, x2
    xor    eax, x3
    ; apply non-linear substitution
    mov    cl, 4
t_l1:    
    xlatb
    ror    eax, 8
    loop   t_l1
    mov    ebx, eax
    mov    ecx, eax
    mov    edx, eax
    mov    ebp, eax
    ; apply linear substitution
    popfd
    jc     t_l2
    ; for key setup
    rol    ebx, 13
    rol    ecx, 23
    xor    eax, ebx
    xor    eax, ecx
    jmp    t_l3
t_l2:    
    ; for encryption
    rol    ebx, 2
    rol    ecx, 10
    rol    edx, 18
    rol    ebp, 24
    
    xor    eax, ebx
    xor    eax, ecx
    xor    eax, edx
    xor    eax, ebp
t_l3:
    mov    [esp+_eax], eax    
    popad
    ret
    
; in:  eax
; out: eax  
T_function:
    pushad
    pushfd
    call   t_l0  ; pushes address of sbox on stack
    ; sbox for SM4 goes here

The round function F

F(X_0,X_1,X_2,X_3,rk)=X_0\oplus T(X_1\oplus X_2\oplus X_3\oplus rk)

The value of 1 in last parameter to T function tells it to use the linear function for encryption. In x86 assembly, I use the Carry Flag (CF) setting or clearing with STC and CLC instructions.

f_code

The constant parameter CK

(ck_{i,0},ck_{i,1},ck_{i,2},ck_{i,3}) \in \Big(Z_2^8\Big)^4,\quad then\quad ck_{ij}=(4i+j)\times 7\: (mod \: 256)

You can include the precomputed array.

ck_constants

Or generate at runtime using some code.

ck_code

; expects ecx to hold index
; returns constant in eax
CK:
    pushad
    xor    eax, eax          ; ck = 0
    cdq                      ; j  = 0
ck_l0: 
    shl    eax, 8            ; ck <<= 8
    lea    ebx, [ecx*4+edx]  ; ebx = (i*4) + j
    imul   ebx, ebx, 7       ; ebx *= 7
    or     al, bl            ; ck |= ebx %= 256
    inc    edx               ; j++
    cmp    edx, 4            ; j<4
    jnz    ck_l0
    mov    [esp+_eax], eax   ; return ck
    popad
    ret

Key setup

setkey

sm4_setkeyx:
_sm4_setkeyx:
    pushad
    mov    edi, [esp+32+4]  ; edi = ctx
    mov    esi, [esp+32+8]  ; esi = 128-bit key
    ; load the key
    lodsd
    bswap  eax
    xchg   eax, rk0
    lodsd
    bswap  eax
    xchg   eax, rk1
    lodsd
    bswap  eax
    xchg   eax, rk2
    lodsd
    bswap  eax
    xchg   eax, rk3
    
    ; xor FK values
    xor    rk0, 0xa3b1bac6    
    xor    rk1, 0x56aa3350    
    xor    rk2, 0x677d9197    
    xor    rk3, 0xb27022dc
    xor    ecx, ecx
sk_l1:    
    call   CK
    clc
    call   T_function 
    xor    rk0, eax
    mov    eax, rk0
    stosd                ; rk[i] = rk0
    xchg   rk0, rk1
    xchg   rk1, rk2
    xchg   rk2, rk3
    inc    ecx           ; i++
    cmp    ecx, 32
    jnz    sk_l1       
    popad
    ret

Encryption

encrypt

sm4_encryptx:
_sm4_encryptx:
    pushad
    mov    edi, [esp+32+4] ; edi = ctx
    mov    esi, [esp+32+8] ; esi = buf
    push   esi ; save buffer for later
    
    ; load data
    lodsd
    bswap  eax
    xchg   eax, x0
    lodsd
    bswap  eax    
    xchg   eax, x1
    lodsd
    bswap  eax    
    xchg   eax, x2
    lodsd
    bswap  eax    
    xchg   eax, x3
    
    ; do 32 rounds
    push   32
    pop    ecx
e_l0:
    ; apply F round
    mov    eax, [edi] ; rk[i]
    scasd
    stc
    call   T_function     
    xor    x0, eax
    xchg   x0, x1
    xchg   x1, x2
    xchg   x2, x3
    loop   e_l0
    
    ; save data
    pop    edi
    xchg   eax, x3
    bswap  eax    
    stosd
    xchg   eax, x2
    bswap  eax    
    stosd
    xchg   eax, x1
    bswap  eax    
    stosd
    xchg   eax, x0
    bswap  eax    
    stosd    
    popad
    ret

Summary

If there was a way to generate the sbox on the fly, this could be a good cipher for resource constrained devices. The size of code using /O2 /Os switches resulted in 690 bytes. The assembly for just encryption is approx. 500 bytes.

If you want to contribute or just access full source code, see here

Thanks to 0x4d_ for \LaTeX formulas.

Posted in assembly, cryptography, encryption, programming | Tagged , , , , , , | 1 Comment

CubeMAC128 Message Authentication Code

Introduction

CubeMAC128 is a cryptographic Message Authentication Code (MAC) designed for packet authentication that was proposed in 2010 by mathematician and cryptographer Daniel J. Bernstein.

The CubeMAC proposal was in response to NIST concerns about using CubeHash as a MAC function for small messages. The parameters for MAC were 16 initial rounds, a 32-byte block and 32 final rounds. The recommended key length is 512-bits or 64-bytes.

You can find a detailed discussion about it in CubeHash parameter tweak: 10 times smaller MAC overhead

Permutation Function

Like other designs by Dan, the pseudo-random-function used to provide properties of confusion and diffusion only uses Add-Rotate-Xor operations which makes it suitable for many different architectures.

The following is taken from reference code.

static void transform(hashState *state)
{
  int i;
  int r;
  crypto_uint32 y[16];

  for (r = 0;r < CUBEHASH_ROUNDS;++r) {
    for (i = 0;i < 16;++i) state->x[i + 16] += state->x[i];
    for (i = 0;i < 16;++i) y[i ^ 8] = state->x[i];
    for (i = 0;i < 16;++i) state->x[i] = ROTATE(y[i],7);
    for (i = 0;i < 16;++i) state->x[i] ^= state->x[i + 16];
    for (i = 0;i < 16;++i) y[i ^ 2] = state->x[i + 16];
    for (i = 0;i < 16;++i) state->x[i + 16] = y[i];
    for (i = 0;i < 16;++i) state->x[i + 16] += state->x[i];
    for (i = 0;i < 16;++i) y[i ^ 4] = state->x[i];
    for (i = 0;i < 16;++i) state->x[i] = ROTATE(y[i],11);
    for (i = 0;i < 16;++i) state->x[i] ^= state->x[i + 16];
    for (i = 0;i < 16;++i) y[i ^ 1] = state->x[i + 16];
    for (i = 0;i < 16;++i) state->x[i + 16] = y[i];
  }
}

The changes made are to obviously reduce code at the expense of performance but not intentionally. Rather than move upwards in steps of one for variable i, it’s set to 15 and moved down until less than zero.

This should eliminate a CMP opcode and depends on Sign status flag (SF) to indicate when to end loop.

// permutation function
void permute(cube_state *s)
{
    int      i, j, k, n;
    uint32_t y[16];

    for (n=16; n>0; n--) 
    {
      for (k=7, j=2; j>0; k+=4, j--)
      {
        for (i=15; i>=0; --i) s->w[i + 16] += s->w[i];
        for (i=15; i>=0; --i) y[i ^ (j*4)] = s->w[i];
        for (i=15; i>=0; --i) s->w[i] = ROTL32(y[i], k);
      
        for (i=15; i>=0; --i) s->w[i] ^= s->w[i + 16];
        for (i=15; i>=0; --i) y[i ^ j] = s->w[i + 16];
        for (i=15; i>=0; --i) s->w[i + 16] = y[i];
      }
    }
}

The assembly works similar except we’re using the LOOP instruction quite a lot with PUSHAD/POPAD.

; ==================================
      ; permutation function
      ;
      ; edi = s
      pushad                   ; save registers
      pushad                   ; allocate 32-bytes
      pushad                   ; allocate 32-bytes
      mov    esi, esp          ; esi = y[16]
      ; for (n=16; n>0; n--)
      push   16
      pop    ecx
      ; for (k=7, j=2; j>0; k+=4, j--)
pm_l0:
      push   ecx               ; save n
      mov    cl, 16
      push   2
      pop    ebp               ; j=2
      push   7
      pop    edx               ; k=7
pm_l1:
      ; **************************
      ; for (i=15; i>=0; --i)
      ;   s->w[i + 16] += s->w[i];
      ; **************************
      pushad
pm_l2:
      mov    eax, [edi]
      add    [edi+64], eax
      scasd
      loop   pm_l2
      popad
      ; **************************
      ; for (i=15; i>=0; --i)
      ;   y[i ^ (j*4)] = s->w[i];
      ; **************************
      pushad
      shl    ebp, 2
pm_l3:
      lea    ebx, [ecx-1]
      mov    eax, [edi+ebx*4]
      xor    ebx, ebp
      mov    [esi+ebx*4], eax
      loop   pm_l3
      popad
      ; **************************
      ; for (i=15; i>=0; --i)
      ;   s->w[i] = ROTL32(y[i], k);
      ; **************************
      pushad
      xchg   ecx, edx
pm_l4:
      lodsd
      rol    eax, cl
      stosd
      dec    edx
      jnz    pm_l4
      popad
      ; **************************
      ; for (i=15; i>=0; --i)
      ;   s->w[i] ^= s->w[i + 16];
      ; **************************
      pushad
pm_l5:
      mov    eax, [edi]
      xor    eax, [edi+64]
      stosd
      loop   pm_l5
      popad
      ; **************************
      ; for (i=15; i>=0; --i)
      ;   y[i ^ j] = s->w[i + 16];
      ; **************************
      pushad
pm_l6:
      lea    ebx, [ecx-1]
      mov    eax, [edi+ebx*4+64]
      xor    ebx, ebp
      mov    [esi+ebx*4], eax
      loop   pm_l6
      popad
      ; **************************
      ; for (i=15; i>=0; --i)
      ;   s->w[i + 16] = y[i];
      ; **************************
      pushad
      add    edi, 64
      rep    movsd
      popad

      add    edx, 4          ; k += 4
      dec    ebp             ; j--
      jnz    pm_l1

      pop    ecx
      loop   pm_l0           ; will set CF to 0

      popad                  ; release 32-bytes
      popad                  ; release 32-bytes
      popad                  ; restore registers
      ret

Sponge Function

We absorb 32-byte blocks of data into the state before applying the permutation function.

// absorb data into the state
uint32_t absorb(cube_state *s, 
    const void *msg, uint32_t len)
{
    uint32_t i, idx=0;
    uint8_t  *p=(uint8_t*)msg;
    
    for (i=0; i<len; i++) {
      s->b[idx++] ^= p[i];
      if (idx == 32) {
        permute(s);
        idx = 0;
      }
    }  
    return idx;
}

We inline this to minimize the overall size of code.

;
      ; ==================================
      ; absorb data into state
      ;
      ; ebx = data
      ; ecx = len
      ; edi = s
absorb:
      xor    eax, eax      ; idx = 0
      jecxz  abs_l1        ; exit if len == 0
abs_l0:
      mov    dl, [ebx]
      xor    [edi+eax], dl ; s->b[idx] ^= *data
      inc    eax           ; idx++
      inc    ebx           ; data++
      cmp    al, 32        ; absorbed block?
      loopne abs_l0        ; while (al != 32 && ecx != 0)
      jne    abs_l1        ; if (al != 32 && ecx == 0) goto abs_l1
      call   ebp           ; permute(s)
      jmp    absorb        ; keep going
abs_l1:
      popfd
      cmc                  ; CF = !CF 
      jc     cm_l0         ; loop twice

MAC Function

This is purely intended for small messages; authenticating network packets for example. It takes a 512-bit key, a message and returns 128-bit MAC.

The length of message shouldn’t exceed 4GB but that should be obvious. I was naturally thinking about packets of 512-bytes or less 🙂

// cube message authentication code
void cube_macx(const void *mkey, uint32_t keylen,
    const void *msg, uint32_t msglen, void *tag)
{
    uint32_t   idx;  
    cube_state s;
    
    // 1. initialize state
    memset(&s, 0, sizeof(cube_state));

    s.w[0] = 16; // 16-byte output
    s.w[1] = 32; // 32-byte block size
    s.w[2] = 16; // 16 rounds per block
    
    permute(&s);
    
    // 2. absorb key
    absorb (&s, mkey, 64);
    
    // 3. absorb message
    idx = absorb(&s, msg, msglen);

    // 4. absorb end bit
    s.b[idx] ^= 0x80;
    permute(&s);
    
    // 5. absorb final bit
    s.w[31] ^= 1;
    
    permute(&s);
    permute(&s);
    
    // 6. return 128-bit tag
    memcpy(tag, s.b, 16);  
}

So here’s the rest of assembly code with permutation function stripped out.

cube_macx:
_cube_macx:
      pushad
      call   ld_pm
      ; permutation function goes here
ld_pm:
      pop    ebp           ; ebp = permute
      lea    esi, [esp+32+4]
      xor    eax, eax
      xor    ecx, ecx
      mov    cl, 128
      sub    esp, ecx
      ; 1. initialize local state
      ; memset(&s, 0, sizeof(cube_state));
      mov    edi, esp
      rep    stosb

      mov    edi, esp
      push   edi
      ; s.w[0] = 16;
      mov    al, 16
      stosd
      ; s.w[1] = 32;
      mov    al, 32
      stosd
      ; s.w[2] = 16;
      mov    al, 16
      stosd
      pop    edi
      ; permute(&s);
      call   ebp
      ; 2. absorb key
      ; 3. absorb message
cm_l0:
      pushfd      
      lodsd
      xchg   ebx, eax        ; ebx = data
      lodsd
      xchg   ecx, eax        ; ecx = len
      ; ==================================
      ; absorb data into state
      ;
      ; ebx = data
      ; ecx = len
      ; edi = s
absorb:
      xor    eax, eax      ; idx = 0
      jecxz  abs_l1        ; exit if len == 0
abs_l0:
      mov    dl, [ebx]
      xor    [edi+eax], dl ; s->b[idx] ^= *data
      inc    eax           ; idx++
      inc    ebx           ; data++
      cmp    al, 32        ; absorbed block?
      loopne abs_l0        ; while (al != 32 && ecx != 0)
      jne    abs_l1        ; if (al != 32 && ecx == 0) goto abs_l1
      call   ebp           ; permute(s)
      jmp    absorb        ; keep going
abs_l1:
      popfd
      cmc                  ; CF = !CF 
      jc     cm_l0         ; loop twice
      
      ; 4. absorb end bit
      xor    byte[edi+eax], 0x80
      call   ebp

      ; 5. absorb final bit
      xor    byte[edi+31*4], 1
      call   ebp
      call   ebp

      ; 6. return 128-bit tag
      lodsd
      xchg   eax, edi        ; edi = tag
      xchg   eax, esi        ; esi = s
      mov    cl, 16
      rep    movsb

      ; release stack
      pop    eax
      add    esp, 124
      popad
      ret

Summary

The size of C code using /Os /O2 switches was approx. 380 bytes. The assembly code is currently 197 bytes which I doubt can be reduced much more.

I’ve documented Poly1305 here which is another MAC function designed by the same author.

If I had to choose a compact function for authentication of small packets, I’d probably pick CubeMAC instead although it hasn’t received as much scrutiny by the cryptographic community as Poly1305.

Posted in assembly, cryptography, encryption, programming, security | Tagged , , , , , | Leave a comment

Speck Block Cipher

Introduction

Speck is a family of lightweight block ciphers publicly released by the National Security Agency (NSA) in June 2013. It’s an ARX (add-rotate-xor) design optimized for performance in software implementations and has been suggested for use on resource constrained devices.

Speck supports a variety of block and key sizes. A block is always two words, but the words may be 16, 24, 32, 48 or 64 bits in size. The corresponding key is 2, 3 or 4 words. The round function consists of two rotations, adding the right word to the left word, xoring the key into the left word, then and xoring the left word to the right word. The number of rounds depends on the parameters selected.

There’s some interesting speculation on the reasons for publishing Simon and Speck in post to a crypto mailing list Speculation on the origin of Speck and Simon

There are 2 implementations provided here. 1 will use 64-bit block size, 128-bit key and 27 rounds since this all fits easily onto 32-bit mode of x86 architecture. The other uses 128-bit block size, 256-bit key and 34 rounds since this all fits easily onto x86-64 mode of x86 architecture.

Key schedule

void speck_setkey(void *in, void *out)
{
  uint32_t i, t, k0, k1, k2, k3;

  // copy 128-bit key to local space
  k0 = ((uint32_t*)in)[0];
  k1 = ((uint32_t*)in)[1];
  k2 = ((uint32_t*)in)[2];
  k3 = ((uint32_t*)in)[3];
    
  // expand 128-bit key into round keys
  for (i=0; i<SPECK_RNDS; i++)
  {
    ((uint32_t*)out)[i] = k0;
    
    k1 = (ROTR32(k1, 8) + k0) ^ i;
    k0 = ROTL32(k0, 3) ^ k1;
    
    // rotate left 32-bits
    XCHG(k3, k2, t);
    XCHG(k3, k1, t);
  }
}
%define SPECK_RNDS 27
    
%define k0 eax    
%define k1 ebx    
%define k2 ebp    
%define k3 edx
    
speck_setkeyx:
_speck_setkeyx:
    pushad
    mov    esi, [esp+32+4]   ; esi = in
    mov    edi, [esp+32+8]   ; edi = ks
    lodsd
    xchg   eax, k3
    lodsd
    xchg   eax, k1
    lodsd
    xchg   eax, k2
    lodsd
    xchg   eax, k3
    xor    ecx, ecx
spk_sk:
    ; ((uint32_t*)ks)[i] = k0;
    stosd
    ; k1 = (ROTR32(k1, 8) + k0) ^ i;
    ror    k1, 8
    add    k1, k0
    xor    k1, ecx
    ; k0 = ROTL32(k0, 3) ^ k1;
    rol    k0, 3
    xor    k0, k1
    ; rotate left 32-bits
    xchg   k3, k2
    xchg   k3, k1
    ; i++
    inc    ecx
    cmp    cl, SPECK_RNDS    
    jnz    spk_sk   
    popad
    ret

Encryption/Decryption

void speck_encrypt(void *in, void *keys, int enc)
{
  uint8_t i;
  uint32_t *ks=(uint32_t*)keys;
  
  // copy input to local space
  uint32_t x0=((uint32_t*)in)[0];
  uint32_t x1=((uint32_t*)in)[1];
  
  for (i=0; i<SPECK_RNDS; i++)
  {
    if (enc==SPECK_DECRYPT)
    {
      x1 = ROTR32(x0 ^ x1, 3);
      x0 = ROTL32((x0 ^ ks[SPECK_RNDS-1-i]) - x1, 8);        
    } else {
      x0 = (ROTR32(x0, 8) + x1) ^ ks[i];
      x1 = ROTL32(x1, 3) ^ x0;
    }
  }
  // save result
  ((uint32_t*)in)[0] = x0;
  ((uint32_t*)in)[1] = x1;
}
%define x0 eax    
%define x1 ebx
    
speck_encryptx:
_speck_encryptx:
    pushad
    lea    esi, [esp+32+4]
    lodsd
    xchg   edi, eax          ; edi = ks
    lodsd
    xchg   eax, ecx          ; ecx = enc
    lodsd
    xchg   eax, esi          ; esi = in
    push   esi
    lodsd    
    xchg   eax, x1
    lodsd
    xchg   eax, x1
    test   ecx, ecx
    mov    cl, SPECK_RNDS
    jz     spk_e0
spk_d0:
    ; x1 = ROTR32(x1 ^ x0, 3);
    xor    x1, x0
    ror    x1, 3
    ; x0 = ROTL32((x0 ^ ks[SPECK_RNDS-1-i]) - x1, 8);
    xor    x0, [edi+4*ecx-4]
    sub    x0, x1
    rol    x0, 8
    loop   spk_d0
    jmp    spk_end    
spk_e0:
    ; x0 = (ROTR32(x0, 8) + x1) ^ ks[i];
    ror    x0, 8
    add    x0, x1
    xor    x0, [edi]
    scasd
    ; x1 = ROTL32(x1, 3) ^ x0;
    rol    x1, 3
    xor    x1, x0
    loop   spk_e0
spk_end:
    pop    edi
    ; ((uint32_t*)in)[0] = x0;
    stosd
    xchg   eax, x1
    ; ((uint32_t*)in)[1] = x1;
    stosd    
    popad
    ret

Just encryption

Most block ciphers are usually insecure by themselves. Counter Mode (CTR) is recommended for turning a block cipher into a stream cipher which then only requires encryption. Here’s the function with key scheduling and encryption integrated.

speck64_encryptx:
_speck64_encryptx:    
    pushad    
    mov    esi, [esp+32+8]   ; esi = in
    push   esi               ; save
    
    lodsd
    xchg   eax, x0           ; x0 = in[0]
    lodsd
    xchg   eax, x1           ; x1 = in[1]
    
    mov    esi, [esp+32+8]   ; esi = key
    lodsd
    xchg   eax, k0           ; k0 = key[0] 
    lodsd
    xchg   eax, k1           ; k1 = key[1]
    lodsd
    xchg   eax, k2           ; k2 = key[2]
    lodsd 
    xchg   eax, k3           ; k3 = key[3]    
    xor    eax, eax          ; i = 0
spk_el:
    ; x0 = (ROTR32(x0, 8) + x1) ^ k0;
    ror    x0, 8
    add    x0, x1
    xor    x0, k0
    ; x1 = ROTL32(x1, 3) ^ x0;
    rol    x1, 3
    xor    x1, x0
    ; k1 = (ROTR32(k1, 8) + k0) ^ i;
    ror    k1, 8
    add    k1, k0
    xor    k1, eax
    ; k0 = ROTL32(k0, 3) ^ k1;
    rol    k0, 3
    xor    k0, k1    
    xchg   k3, k2
    xchg   k3, k1
    ; i++
    inc    eax
    cmp    al, SPECK_RNDS    
    jnz    spk_el
    
    pop    edi    
    xchg   eax, x0
    stosd
    xchg   eax, x1
    stosd
    popad
    ret

The x86-64 version to support 256-bit keys and 128-bit blocks is only 24 bytes more.

;
; speck128/256 encryption in 88 bytes
;
%ifndef BIN
    global speck128_encryptx
%endif

%define k0 rdi    
%define k1 rbp    
%define k2 rsi    
%define k3 rcx

%define x0 rbx    
%define x1 rdx

speck128_encryptx:   
    push   rbp
    push   rbx
    push   rdi
    push   rsi   

    mov    k0, [rcx   ]      ; k0 = key[0]
    mov    k1, [rcx+ 8]      ; k1 = key[1]
    mov    k2, [rcx+16]      ; k2 = key[2]
    mov    k3, [rcx+24]      ; k3 = key[3]
    
    push   rdx
    mov    x0, [rdx  ]       ; x0 = in[0]
    mov    x1, [rdx+8]       ; x1 = in[1] 
    
    xor    eax, eax          ; i = 0
spk_el:
    ; x1 = (ROTR64(x1, 8) + x0) ^ k0;
    ror    x1, 8
    add    x1, x0
    xor    x1, k0
    ; x0 =  ROTL64(x0, 3) ^ x1;
    rol    x0, 3
    xor    x0, x1
    ; k1 = (ROTR64(k1, 8) + k0) ^ i;
    ror    k1, 8
    add    k1, k0
    xor    k1, rax
    ; k0 = ROTL64(k0, 3) ^ k1;
    rol    k0, 3
    xor    k0, k1
    
    xchg   k3, k2
    xchg   k3, k1
    ; i++
    add    al, 1
    cmp    al, SPECK_RNDS    
    jnz    spk_el
    
    pop    rax
    mov    [rax  ], x0
    mov    [rax+8], x1
    
    pop    rsi
    pop    rdi
    pop    rbx
    pop    rbp
    ret

Summary

instruction set setkey + encrypt + decrypt (bytes) setkey + encrypt (bytes)
x86 105 64
x86-64 132 88

MSVC generated code is 139 bytes using /O2 /Os flags. x86 assembly is 105 bytes for enc+dec or 64 bytes for just enc. Nice algorithm that would be fun for those new to cryptography.

See sources here

Posted in assembly, cryptography, encryption, programming, security | Tagged , , , , | Leave a comment

NOEKEON Block cipher

Introduction

NOEKEON is a 128-bit block cipher designed by Joan Daemen, Michaël Peeters, Gilles Van Assche, Vincent Rijmen and submitted to the NESSIE project in September 2000.

The two ciphers are “direct mode” NOEKEON, to be used for maximum efficiency where related-key attacks are not possible, and “indirect mode” NOEKEON where they are.

Cryptanalysis by Lars Knudsen and Håvard Raddum in April 2001 showed that “indirect mode” NOEKEON was still vulnerable to certain peculiar kinds of related-key cryptanalysis, and showed weaknesses in NOEKEON-variant ciphers which cast doubt on the design strategy behind NOEKEON and thus on its security. As a result, it was not a NESSIE selected algorithm.

The authors of NOEKEON contend in On NOEKEON, no! that the related-key attacks required to break “indirect mode” NOEKEON are not a practical concern, and that it is as a result of deliberate design that NOEKEON is not vulnerable to the attacks that break the variant ciphers; they assert that NOEKEON is still a good and useful cipher.

Noekeon, The Return presented in Jan 2010 argues hardened versions are still suitable for resource constrained environments.

About the code

The C code is derived from the reference sources submitted to NESSIE by the authors. For the initial snippets of code shown here, It should be clear that the endianess of data and key are not converted before and after the encryption/decryption process. Only the reduced version shown at end of post performs conversion.

Omitting the endian conversion on x86 obviously invalidates ciphertext results when comparing with test vectors but it does not, would not affect security of the cipher itself.

Direct and Indirect Mode

The only difference between the two is that with indirect mode we encrypt 16 null bytes using the master key and the resulting ciphertext is used as key for encryption and decryption.

Because it’s such a simple step, I’ve not included it here.

Gamma

Gamma is an involutive (inverse of itself) non-linear mapping that operates on the state.

It can be specified alternatively as a 16-byte S-box applied to each of the boxes of the state and is essentially a function for creating diffusion. I have not investigated if using an sbox would result in smaller code.

void Gamma(uint32_t *a)
{
  uint32_t t;
  
  a[1] ^= ~((a[3]) | (a[2]));
  a[0] ^=   a[2] & a[1];  
  
  t     = a[3]; 
  a[3]  = a[0]; 
  a[0]  = t;
  a[2] ^= a[0] ^ a[1] ^ a[3];
  
  a[1] ^= ~((a[3]) | (a[2]));
  a[0] ^=   a[2] & a[1];  
}

Theta

Theta is a linear mapping that takes the Working Key k and operates on the state.

void Theta(const uint32_t *k, uint32_t *a)
{
  uint32_t t, i;

  t = a[0] ^ a[2]; 
  t ^= ROTR32(t, 8) ^ ROTL32(t, 8);
  
  a[1] ^= t;
  a[3] ^= t;
  
  for (i=0; i<4; i++) {
    a[i] ^= k[i];
  }

  t = a[1] ^ a[3]; 
  t ^= ROTR32(t, 8) ^ ROTL32(t, 8);
  
  a[0] ^= t;
  a[2] ^= t;  

}

Pi1 and Pi2

The shift operations Pi1 and Pi2 perform cyclic shifts of three of the four words over different offsets.

void Pi1(uint32_t *a){
  a[1] = ROTL32(a[1], 1);
  a[2] = ROTL32(a[2], 5);
  a[3] = ROTL32(a[3], 2);
}

void Pi2(uint32_t *a){
  a[1] = ROTR32(a[1], 1);
  a[2] = ROTR32(a[2], 5);
  a[3] = ROTR32(a[3], 2);
}

Round Function

void Round(
    uint32_t *Key, 
    uint32_t *State, 
    uint8_t Constant1, 
    uint8_t Constant2)
{
  State[0] ^= Constant1;
  Theta(Key, State);
  State[0] ^= Constant2;
  Pi1(State);
  Gamma(State);
  Pi2(State);
}

Encryption and Decryption

void Noekeon(void *buf, const void *k, int enc)
{
  int8_t i;
  uint8_t rc=0x80;

  uint32_t nv[4], State[4], Key[4];
  const uint8_t rc_tab[]= 
{ 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 
  0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4 };
  
  memcpy(State, buf, 16);
  memcpy(Key, k, 16);
  
  if (enc==NOEKEON_ENCRYPT)
  {
    for (i=0; i<Nr; i++) {
      Round(Key, State, rc, 0);
      rc = rc_tab[i];
    }
    State[0] ^= rc;
    Theta(Key, State);
  } else {
    memset(nv, 0, 16);
    Theta(nv, Key);

    for (i=Nr-1; i>=0; --i) {
      rc = rc_tab[i];
      Round(Key, State, 0, rc);
    }
    Theta(Key, State);
    State[0] ^= 0x80;
  }  
  memcpy(buf, State, 16);
}

Reduced version

To implement the above code in assembly was slightly disappointing because of the need to apply the Theta function to the key for decryption. That meant keeping Theta as a separate function.

After some tweaking of the code, the following is what eventually gets converted into assembly.

void Theta(
    uint32_t *a, 
    const uint32_t *k)
{
  uint32_t t, i;

  t = a[0] ^ a[2]; 
  
  t ^= ROTR32(t, 8) ^ ROTL32(t, 8);
  
  a[1] ^= t;
  a[3] ^= t;
  
  for (i=0; i<4; i++) {
    a[i] ^= k[i];
  }

  t = a[1] ^ a[3]; 
  t ^= ROTR32(t, 8) ^ ROTL32(t, 8);
  
  a[0] ^= t;
  a[2] ^= t;  

}

void Round(
    uint32_t *s, 
    uint32_t *Key,
    int enc, 
    int rnd, 
    int end)
{
  uint32_t t;
  uint32_t rc1, rc2;
  const uint8_t rc_tab[]=   
{ 0x80,
  0x1B, 0x36, 0x6C, 0xD8, 
  0xAB, 0x4D, 0x9A, 0x2F, 
  0x5E, 0xBC, 0x63, 0xC6, 
  0x97, 0x35, 0x6A, 0xD4 };
  
  rc1 = rc_tab[rnd];
  rc2 = 0;
  if (enc==NOEKEON_DECRYPT) {
    XCHG(rc1, rc2, t);
  }
  
  s[0] ^= rc1;
  Theta(s, Key);
  s[0] ^= rc2;
  
  if (end) return;
  
  //Pi1
  s[1] = ROTL32(s[1], 1);
  s[2] = ROTL32(s[2], 5);
  s[3] = ROTL32(s[3], 2);
  
  // Gamma
  s[1] ^= ~((s[3]) | (s[2]));
  s[0] ^=   s[2] & s[1];  
  
  XCHG(s[0], s[3], t);
  
  s[2] ^= s[0] ^ s[1] ^ s[3];
  
  s[1] ^= ~((s[3]) | (s[2]));
  s[0] ^=   s[2] & s[1];  
  
  // Pi2
  s[1] = ROTR32(s[1], 1);
  s[2] = ROTR32(s[2], 5);
  s[3] = ROTR32(s[3], 2);
}

void swapcpy(
    void *dst, 
    const void *src)
{
  int i;
  for (i=0; i<4; i++) {
    ((uint32_t*)dst)[i] = SWAP32(((uint32_t*)src)[i]);
  }
}

void Noekeon(
    const void *k, 
    void *buf, 
    int enc)
{
  int i;

  uint32_t NullVector[4], State[4], Key[4];
  
  swapcpy(Key, k);
  swapcpy(State, buf);

  if (enc==NOEKEON_ENCRYPT)
  {
    for (i=0; i<=Nr; i++) {
      Round(State, Key, enc, i, i==Nr);
    }
  } else {
    memset(NullVector, 0, 16);
    Theta(Key, NullVector);

    for (i=Nr; i>=0; --i) {
      Round(State, Key, enc, i, i==0);
    }
  }
  swapcpy(buf, State);
}

The assembly code could be optimized further.

struc pushad_t
  _edi resd 1
  _esi resd 1
  _ebp resd 1
  _esp resd 1
  _ebx resd 1
  _edx resd 1
  _ecx resd 1
  _eax resd 1
  .size:
endstruc
    
%define Nr 16
    
%define a0 eax 
%define a1 ebx 
%define a2 ecx 
%define a3 edx

%define t0 ebp
%define t1 esi

Thetax:
_Thetax:
    pushad
    push   esi
    lodsd
    xchg   a3, eax
    lodsd
    xchg   a1, eax
    lodsd
    xchg   a2, eax
    lodsd
    xchg   a3, eax
    ; t = a[0] ^ a[2];
    mov    t0, a0
    xor    t0, a2
    ; t ^= ROTR32(t, 8) ^ ROTL32(t, 8);
    mov    t1, t0
    rol    t1, 8
    xor    t0, t1    
    ror    t1, 16
    xor    t0, t1
    ; a[1] ^= t;    
    xor    a1, t0
    ; a[3] ^= t;
    xor    a3, t0
    ; a[0] ^= k[0];
    xor    a0, [edi]
    ; a[1] ^= k[1];
    xor    a1, [edi+ 4]
    ; a[2] ^= k[2];
    xor    a2, [edi+ 8]
    ; a[3] ^= k[3];
    xor    a3, [edi+12]
    ; t = a[1] ^ a[3];
    mov    t0, a1
    xor    t0, a3
    ; t ^= ROTR32(t, 8) ^ ROTL32(t, 8);
    mov    t1, t0
    rol    t1, 8
    xor    t0, t1    
    ror    t1, 16
    xor    t0, t1
    ; a[0] ^= t;    
    xor    a0, t0
    ; a[2] ^= t;
    xor    a2, t0
    pop    edi
    stosd
    xchg   eax, a1
    stosd
    xchg   eax, a2
    stosd
    xchg   eax, a3
    stosd
    popad
    ret
   
; esi = State
; edi = Key
; ecx = enc
; eax = rnd   
; CF  = end
Round:
    pushad
    pushfd
    call   nx_rc
    db     0x80
    db     0x1B, 0x36, 0x6C, 0xD8 
    db     0xAB, 0x4D, 0x9A, 0x2F 
    db     0x5E, 0xBC, 0x63, 0xC6 
    db     0x97, 0x35, 0x6A, 0xD4
nx_rc:
    pop    ebx
    xlatb
    jecxz  nxr_enc
    xchg   al, ah
nxr_enc:
    setne  cl
    ; State[0] ^= Constant1;
    xor    byte[esi], al
    ; Theta(State, Key);
    call   Thetax
    ; State[0] ^= Constant2;
    xor    byte[esi], ah
    jecxz  nxr_end
    mov    edi, esi    
    lodsd
    xchg   a3, eax
    lodsd
    xchg   a1, eax
    lodsd
    xchg   a2, eax
    lodsd
    xchg   a3, eax
Pi1:
    rol    a1, 1
    rol    a2, 5
    rol    a3, 2
Gamma:
    ; a[1] ^= ~((a[3]) | (a[2]));
    mov    ebp, a3
    or     ebp, a2
    not    ebp
    xor    a1, ebp
    ; a[0] ^=   a[2] & a[1];
    mov    ebp, a2
    and    ebp, a1
    xor    a0, ebp
    ; XCHG(a[0], a[3], t);
    xchg   a0, a3
    ; a[2] ^= a[0] ^ a[1] ^ a[3];
    xor    a2, a0
    xor    a2, a1
    xor    a2, a3
    ; a[1] ^= ~((a[3]) | (a[2]));
    mov    ebp, a3
    or     ebp, a2
    not    ebp
    xor    a1, ebp  
    ; a[0] ^=   a[2] & a[1];
    mov    ebp, a2
    and    ebp, a1
    xor    a0, ebp    
Pi2:
    ror    a1, 1
    ror    a2, 5
    ror    a3, 2
    stosd
    xchg   eax, a1
    stosd
    xchg   eax, a2
    stosd
    xchg   eax, a3
    stosd
nxr_end: 
    popfd   
    popad
    ret
    
swapcpy:
    pushad
    push   4
    pop    ecx
swp_l0:
    lodsd
    bswap  eax
    stosd
    loop   swp_l0
    mov    [esp+_edi], edi    
    popad
    ret
    
Noekeonx:
_Noekeonx:
    pushad
    lea    esi, [esp+32+4]
    pushad     ; allocate 32-bytes for state + key
    mov    edi, esp
    lodsd
    ; copy key to local buffer
    xchg   eax, esi
    call   swapcpy
    xchg   eax, esi
    lodsd
    ; copy data to local buffer
    xchg   eax, esi
    call   swapcpy
    xchg   eax, esi
    mov    ebp, eax
    lodsd
    cdq                      ; edx = 0
    xchg   ecx, eax          ; ecx = enc
    xchg   eax, edx          ; eax = 0
    mov    edi, esp          ; edi = Key
    lea    esi, [edi+16]     ; esi = State
    jecxz  nx_enc
    
    ; allocate a 16 null byte key
    pushad    
    push   eax
    push   eax
    push   eax
    push   eax
    mov    esi, edi          ; esi = Key
    mov    edi, esp          ; edi = NullVector 
    call   Thetax            ; Theta(Key, NullVector);
    add    esp, 16           ; release NullVector
    popad
    mov    al, Nr            ; i   = Nr    
nx_dec:
    test   eax, eax    
    call   Round
    dec    eax
    jns    nx_dec
    jmp    nx_xit    
nx_enc:
    cmp    al, Nr
    call   Round       ; Round(State, Key, enc, i, i==0);
    lea    eax, [eax+1]
    jnz    nx_enc
nx_xit:
    mov    edi, ebp
    call   swapcpy    
    popad
    popad
    ret

Summary

The C code compiled with MSVC using /O2 /Os flags resulted in 431 bytes but could be significantly smaller without endian conversion. The assembly code is currently 292 bytes although may shrink in future.

See here for sources and any future updates.

Posted in assembly, cryptography, encryption, programming, security | Tagged , , , | Leave a comment

Chaskey Message Authentication Code

Introduction

Chaskey is a lightweight MAC algorithm optimised for 32-bit micro-controllers designed by Nicky Mouha, Bart Mennink, Anthony Van Herrewege, Dai Watanabe, Bart Preneel and Ingrid Verbauwhede.

It is based on a 128-bit block cipher, the Chaskey cipher, which uses ARX operations and an Even-Mansour structure.

What follows is an implementation in C and x86 assembly optimized for size.

State structure

typedef union state_t {
  uint8_t  b[16];
  uint32_t w[4];
} state;

Key scheduling

void chas_setkey(void *out, void *in) 
{
  int      i;
  uint32_t *k=(uint32_t*)out;
  
  memcpy (out, in, 16);
  
  for (i=0; i<2; i++)
  {
    k[4] = (k[0] << 1);     
    k[4] ^= 0x87 * (k[3] >> 31);    
    k[5] = (k[1] << 1) | (k[0] >> 31); 
    k[6] = (k[2] << 1) | (k[1] >> 31); 
    k[7] = (k[3] << 1) | (k[2] >> 31);
    
    k += 4;    
  }
}

Assembly could possibly do with better optimization.

%define k0 ebp
%define k1 ebx
%define k2 ecx
%define k3 edx
    
chas_setkeyx:
_chas_setkeyx:
    pushad
    mov    edi, [esp+32+4]   ; edi = out
    mov    esi, [esp+32+8]   ; esi = in
    push   edi
    movsd
    movsd
    movsd
    movsd
    pop    esi
    clc
sk_l0:
    pushfd

    lodsd
    xchg   eax, k0
    lodsd
    xchg   eax, k1
    lodsd
    xchg   eax, k2
    lodsd
    xchg   eax, k3
 
    push   k3
    lea    eax, [k0+k0]
    add    k3, k3
    sbb    dl, dl
    and    dl, 0x87
    xor    al, dl   
    pop    k3
    stosd
    
    lea    eax, [k1+k1]
    shr    k0, 31
    or     eax, k0
    stosd
    
    lea    eax, [k2+k2]
    shr    k1, 31
    or     eax, k1
    stosd
    
    lea    eax, [k3+k3]
    shr    k2, 31
    or     eax, k2
    stosd

    popfd
    cmc
    jc     sk_l0
    
    popad
    ret

Key whitening

void chas_xor(state *out, const void *in, int len) {
  int i;

  for (i=0; i<len; i++) {
    out->b[i] ^= ((uint8_t*)in)[i];
  }
}
; ecx = length
; esi = input
; edi = v   
chas_xor:
    pushad
    jecxz  cx_l1
cx_l0:    
    mov    al, [esi]
    xor    [edi], al
    cmpsb
    loop   cx_l0
cx_l1:    
    popad
    ret

Permutation function

This is derived from SipHash permutation function which is derived from chacha and salsa stream ciphers.

void chas_permute(uint32_t v[])
{
  int i=12;
  
  do
  {
    v[0] += v[1]; 
    v[1]=ROTL32(v[1], 5); 
    v[1] ^= v[0]; 
    v[0]=ROTL32(v[0],16); 
    v[2] += v[3]; 
    v[3]=ROTL32(v[3], 8); 
    v[3] ^= v[2]; 
    v[0] += v[3]; 
    v[3]=ROTL32(v[3],13); 
    v[3] ^= v[0]; 
    v[2] += v[1]; 
    v[1]=ROTL32(v[1], 7); 
    v[1] ^= v[2]; 
    v[2]=ROTL32(v[2],16); 
  } while (--i);
}

Assembly code is straight forward but perhaps there’s room to optimize further.

%define v0 eax    
%define v1 edx    
%define v2 ebp    
%define v3 ebx
    
; ecx = 16    
; edi = v
chas_permute:
    pushad
    mov    cl, 12
    mov    esi, edi
    lodsd
    xchg   eax, v3
    lodsd
    xchg   eax, v1
    lodsd
    xchg   eax, v2
    lodsd
    xchg   eax, v3
cp_l0:
    add    v0, v1            ; v[0] += v[1];
    rol    v1, 5             ; v[1] = ROTL(v[1], 5);
    xor    v1, v0            ; v[1] ^= v[0];
    rol    v0, 16            ; v[0] = ROTL(v[0], 16);
    add    v2, v3            ; v[2] += v[3]; 
    rol    v3, 8             ; v[3] = ROTL(v[3], 8); 
    xor    v3, v2            ; v[3] ^= v[2]; 
    add    v0, v3            ; v[0] += v[3];
    rol    v3, 13            ; v[3] = ROTL(v[3], 13);
    xor    v3, v0            ; v[3] ^= v[0];
    add    v2, v1            ; v[2] += v[1];
    rol    v1, 7             ; v[1] = ROTL(v[1],  7);
    xor    v1, v2            ; v[1] ^= v[2];
    rol    v2, 16            ; v[2] = ROTL(v[2], 16);
    loop   cp_l0
    stosd
    xchg   eax, v1
    stosd
    xchg   eax, v2
    stosd
    xchg   eax, v3
    stosd
    popad   
    ret

MAC generation

void chas_mac (uint8_t *tag, 
    uint8_t *msg, uint32_t msglen, uint8_t *key) 
{
  state v;
  int   len;
  
  // copy 16 bytes of key
  memcpy(v.b, key, 16);

  // absorb message 
  for (;;)
  {
    len = (msglen < 16) ? msglen : 16;
    
    // xor state with msg data
    chas_xor(&v, msg, len);

    // final?
    if (msglen <= 16) {
      if (msglen < 16) {
        // final? add padding bit
        v.b[msglen] ^= 0x01;
      }
      key += (msglen == 16) ? 16 : 32;
      break;
    }    
    
    // apply permutation function
    chas_permute(v.w);
    
    // update position and length
    msg += 16;
    msglen -= 16;
  }

  // pre-whiten
  chas_xor(&v, key, 16);
  // permute
  chas_permute(v.w);
  // post-whiten
  chas_xor(&v, key, 16);
  // return tag
  memcpy(tag, v.b, 16);
}

Assembly code

; chaskey    
chas_macx:
_chas_macx:
    pushad
    lea    esi, [esp+32+4]
    pushad                   ; allocate 32 bytes
    mov    edi, esp          ; edi = v
    lodsd
    xchg   eax, ebp          ; ebp = tag ptr
    lodsd
    xchg   eax, ebx          ; ebx = msg ptr
    lodsd
    xchg   edx, eax          ; edx = msglen
    lodsd
    xchg   eax, esi          ; esi = key

    ; memcpy(v, &key[0], 16);
    push   16
    pop    ecx
    push   edi               ; save v
    rep    movsb
    pop    edi               ; restore v
    push   esi               ; save &key[16]
    mov    esi, ebx          ; esi = msg    
    ; absorb message
cm_l0:
    mov    cl, 16
    ; len = (msglen < 16) ? msglen : 16;
    cmp    edx, ecx
    cmovb  ecx, edx
    
    ; chas_xor(&v, msg, len);
    call   chas_xor
    mov    cl, 16
    
    ; if (msglen <= 16)
    cmp    edx, ecx
    jbe    cm_l2
    
    call   chas_permute

    ; msglen -= 16
    sub    edx, ecx
    ; msg += 16
    add    esi, ecx
    
    jmp    cm_l0
cm_l2:    
    pop    esi
    ; if (msglen < 16)
    je     cm_l4    
    ; v.b[msglen] ^= 0x01;
    xor    byte[edi+edx], 0x01
    ; load key + 32
    add    esi, ecx
cm_l4:
    ; chas_xor(v, key, 16);
    call   chas_xor
    ; chas_permute(v);
    call   chas_permute
    ; chas_xor(v, key, 16);
    call   chas_xor
    
    ; memcpy(tag, v, 16);
    mov    esi, edi
    mov    edi, ebp
    rep    movsb
        
    popad
    popad
    ret

Summary

The MSVC generated code resulted in 346 bytes using /Os /O2 flags and the x86 assembly is currently 234 bytes. The underlying block cipher doesn’t require key scheduling and this code could be reduced much further if you only needed that functionality. See sources here for future updates.

Posted in assembly, cryptography, encryption, programming, security | Tagged , , , , | 2 Comments