Asmcodes: SM3 Cryptographic Hash Algorithm (Chinese Standard)

Introduction

In December of 2007, the Chinese National Cryptographic Administration Bureau released the specification of a Trusted Cryptography Module, detailing a cryptoprocessor to be used within the Trusted Computing framework in China.

The module specifies a set of cryptographic algorithms that include the SM4 block cipher, the SM2 asymmetric algorithm and SM3 cryptographic hash algorithm.

I’ve discussed SM4 in an earlier post here.

SM3 uses a Merkle-Damgard construction similar to MD5 or SHA-2 that processes 512-bit input message blocks and returns a 256-bit hash value.

It was designed by Mathematician and cryptographer Xiaoyun Wang who is responsible for discovering attacks against many cryptographic hash functions, most notably MD5 and SHA-1.

At the CRYPTO 2004 conference, she and co-authors demonstrated collision attacks against MD5 and SHA-0. Attacks against SHA-1 were published later in 2005.

Unfortunately there’s no explanation for how the changes made to SHA-2 strengthen it against attacks so I’ll briefly cover some of the minor differences between SM3 and SHA-2

Initialization

In the original SHA-2 specification, the 8 initialization values are the first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19). See here for more detailed description of how to create these.

H_0^{(0)}\: =\: 6a09e667
H_1^{(0)}\: =\: bb67ae85
H_2^{(0)}\: =\: 3c6ef372
H_3^{(0)}\: =\: a54ff53a
H_4^{(0)}\: =\: 510e527f
H_5^{(0)}\: =\: 9b05688c
H_6^{(0)}\: =\: 1f83d9ab
H_7^{(0)}\: =\: 5be0cd19

In SM3, the following 8 values are used:

H_0^{(0)}\: =\: 7380166f
H_1^{(0)}\: =\: 4914b2b9
H_2^{(0)}\: =\: 172442d7
H_3^{(0)}\: =\: da8a0600
H_4^{(0)}\: =\: a96f30bc
H_5^{(0)}\: =\: 163138aa
H_6^{(0)}\: =\: e38dee4d
H_7^{(0)}\: =\: b0fb0e4e

Unfortunately, there’s no explanation for why or how these values were selected.

init

Updating and Finalization

The update and final processing are exactly the same as SHA-2 which is based on the original design for MD4 by Ron Rivest. I will not explain here since the main changes take place inside the compression function.

Message Expansion

The 512-bit message is expanded into 68 32-bit words. The following 2 linear functions are defined:

P_1 for expansion, P_0 for compression

P_0(X) = X\oplus (X\lll 9)\oplus (X\lll 17)
P_1(X) = X\oplus (X\lll 15)\oplus (X\lll 23)

p_aux

\text{for}\: i=16\: \text{to}\: 67\\  \hphantom{---}W_i\leftarrow P_1(W_{i-16}\oplus W_{i-9}\oplus(W_{i-3}\lll 15))\oplus (W_{i-13}\lll 7)\oplus   W_{i-6}\\  \text{endfor}

\text{for}\: i = 0\: \text{to}\: 63\\  \hphantom{---}W_i'=W_i\oplus W_{i+4}\\  \text{endfor}

exp_code

The additional 4 words in SM3 are required for a separate expansion that I’ve inlined with the compression function to try minimize code usage.

W_i'=W_i\oplus W_{i+4}\: 0\leq i < 64.

You can see in main compression loop while creating TT1 and TT2.

exp_code2

Compression

For SHA-2, 64 words are used during compression which are described here in detail.

Both use the same sequence of sixty-four constant 32-bit words, K0{256}, K1{256}, …,K63{256} These words represent the first thirty-two bits of the fractional parts of the cube roots of the first sixty-four prime numbers. In hex, these constant words are (from left to right)

For SM3, only 2 constants are used. One is selected depending on the position of loop counter, then a circular shift by loop counter is applied before adding into the state. This certainly helps make SM3 more compact than SHA-2 although I can’t say if it strengthens in some way.

T_i =  \begin{cases}  79cc4519 & 0\leq i \leq 15 \\[1ex]  7a879d8a & 16\leq i \leq 63  \end{cases}

t_code

There are 3 boolean operations used during the compression which are similar to bitwise majority MAJ and bitwise choice CH operations in SHA-2.

FF_i(X,Y,Z) =  \begin{cases}  X\oplus Y\oplus Z  & 0\leq i \leq 15 \\[1ex]  (X\wedge Y)\vee (X\wedge Z)\vee (Y\wedge Z) & 16\leq i \leq 63  \end{cases}

GG_i(X,Y,Z) =  \begin{cases}  X\oplus Y\oplus Z  & 0\leq i \leq 15 \\[1ex]  (X\wedge Y) \vee (\neg X \wedge Z) & 16\leq i \leq 63  \end{cases}

The GG operation is changed to use less instructions, the original is.

orig_bool

The new one along with others.

bool_code

ABCDEFGH\leftarrow V^{(i)}
\begin{aligned}  \text{for}\; & i= 0\; \text{to}\: 63\\  & SS_1\leftarrow ((A\lll 12)+E+(T_i\lll i))\lll 7\\  & SS_2\leftarrow SS_1\oplus(A\lll 12)\\  & TT_1\leftarrow FF_i(A,B,C)+D+SS_2+W_i'\\  & TT_2\leftarrow GG_i(E,F,G)+H+SS_1+W_i\\  & D\leftarrow C\\  & C\leftarrow B\lll 9\\  & B\leftarrow A\\  & A\leftarrow TT_1\\  & H\leftarrow G\\  & G\leftarrow F\lll 19\\  & F\leftarrow E\\  & E\leftarrow P_0(TT_2)\\  \text{endfor}&\;  \end{aligned}

Updating state

At the end of compression function in SHA-2, the input hash is updated by adding the output from compression.

H_0^{(i)}=a+H_0^{(i-1)}
H_1^{(i)}=b+H_1^{(i-1)}
H_2^{(i)}=c+H_2^{(i-1)}
H_3^{(i)}=d+H_3^{(i-1)}
H_4^{(i)}=e+H_4^{(i-1)}
H_5^{(i)}=f+H_5^{(i-1)}
H_6^{(i)}=g+H_6^{(i-1)}
H_7^{(i)}=h+H_7^{(i-1)}

In SM3, it is instead XOR’d which follows a Davies-Meyer idea for compression functions.

H_0^{(i)}=a\oplus H_0^{(i-1)}
H_1^{(i)}=b\oplus H_1^{(i-1)}
H_2^{(i)}=c\oplus H_2^{(i-1)}
H_3^{(i)}=d\oplus H_3^{(i-1)}
H_4^{(i)}=e\oplus H_4^{(i-1)}
H_5^{(i)}=f\oplus H_5^{(i-1)}
H_6^{(i)}=g\oplus H_6^{(i-1)}
H_7^{(i)}=h\oplus H_7^{(i-1)}

The output state from compression is XORed with the previous hash value to produce the next hash value. In the first round when there is no previous hash value it uses a constant pre-specified initial values.

H_i=E_{m_i}(H_{i-1})\oplus H_{i-1}.

Here’s the full compression function.

void SM3_Transform (SM3_CTX *ctx) 
{
    uint32_t tt1, tt2, i, t, ss1, ss2, x, y;
    uint32_t w[68], s[8];

    #define a s[0]
    #define b s[1]
    #define c s[2]
    #define d s[3]
    #define e s[4]
    #define f s[5]
    #define g s[6]
    #define h s[7]
    
    // load state into local buffer
    memcpy((uint8_t*)&s[0], (uint8_t*)&ctx->s.w[0], 8*4);
    
    // load data in big endian format
    for (i=0; i<16; i++) {
      w[i] = SWAP32(ctx->buf.w[i]);
    }

    // expand message
    for (i=16; i<68; i++) {
      x = ROTL32(w[i- 3], 15);
      y = ROTL32(w[i-13],  7);
      
      x ^= w[i-16];
      x ^= w[i- 9];
      y ^= w[i- 6];
      
      w[i] = P1(x) ^ y; 
    }

    // compression function
    for (i=0; i<64; i++) {
      t  = (i < 16) ? 0x79cc4519 : 0x7a879d8a;
      
      ss2 = ROTL32(a, 12);      
      ss1 = ROTL32(ss2 + e + ROTL32(t, i), 7);
      ss2 ^= ss1;
      
      tt1 = d + ss2 + (w[i] ^ w[i+4]);
      tt2 = h + ss1 + w[i];
      
      if (i < 16) {
        tt1 += F(a, b, c);
        tt2 += F(e, f, g);
      } else {
        tt1 += FF(a, b, c);
        tt2 += GG(e, f, g);       
      }
      d = c;
      c = ROTL32(b, 9);
      b = a;
      a = tt1;
      h = g;
      g = ROTL32(f, 19);
      f = e;
      e = P0(tt2); 
    }
    
    // Davies–Meyer idea for compression function
    for (i=0; i<8; i++) {
      ctx->s.w[i] ^= s[i];
    }    
    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
    #undef g
    #undef h
}

And assembly just to illustrate

%define _a dword[edi+0*4]
%define _b dword[edi+1*4]
%define _c dword[edi+2*4]
%define _d dword[edi+3*4]
%define _e dword[edi+4*4]
%define _f dword[edi+5*4]
%define _g dword[edi+6*4]
%define _h dword[edi+7*4]

_SM3_Transformx:
    pushad

    ; load state into esi
    lea     esi, [ebx+state]
    push    esi  ; save for later

    ; allocate 512 bytes
    push    64
    pop     ecx
    shl     ecx, 3
    sub     esp, ecx

    ; load state into local buffer
    mov     edi, esp
    push    8
    pop     ecx
    rep     movsd

    ; store message in big endian format
    mov     cl, 16
s3t_l0:
    lodsd
    bswap   eax
    stosd
    loop    s3t_l0

    ; expand message
    mov     cl, 68-16
s3t_l1:
    ; x = ROTL32(w[i-3], 15);
    mov     eax, [edi- 3*4]
    rol     eax, 15
    ; y = ROTL32(w[i-13], 7);
    mov     ebx, [edi-13*4]
    rol     ebx, 7
    ; x ^= w[i-16];
    xor     eax, [edi-16*4]
    ; x ^= w[i-9];
    xor     eax, [edi- 9*4]
    ; y ^= w[i-6];
    xor     ebx, [edi- 6*4]
    ; x ^= ROTL32(x, 15) ^ ROTL32(x, 23);
    mov     edx, eax
    rol     edx, 15
    xor     eax, edx
    rol     edx, 23-15
    xor     eax, edx
    ; x ^= y;
    xor     eax, ebx
    ; w[i] = x;
    stosd
    loop    s3t_l1

    ; permute message
    mov     edi, esp
s3t_l2:
    ; t  = (i < 16) ? 0x79cc4519 : 0x7a879d8a;
    cmp     ecx, 16
    sbb     eax, eax
    and     eax, 0x79cc4519 - 0x7a879d8a
    add     eax, 0x7a879d8a
    ; ss2 = ROTL32(a, 12);
    mov     ebx, _a
    rol     ebx, 12
    ; ss1 = ROTL32(ss2 + e + ROTL32(t, i), 7);
    rol     eax, cl
    add     eax, ebx
    add     eax, _e
    rol     eax, 7
    ; ss2 ^= ss1;
    xor     ebx, eax
    ; tt1 = d + ss2 + (w[i] ^ w[i+4]);
    mov     ebp, eax         ; save ss1
    mov     esi, [edi+4*ecx+32]       ; esi = w[i]
    mov     edx, esi         ; save w[i]
    xor     esi, [edi+4*ecx+32+16]    ; esi ^= w[i+4]
    add     esi, _d
    lea     eax, [esi+ebx]   ; set tt1
    ; tt2 = h + ss1 + w[i];
    lea     ebx, [edx+ebp]
    add     ebx, _h
    ; if (i < 16) {
    cmp     ecx, 16
    jae     s3t_l3
    ; tt1 += F(a, b, c);
    mov     edx, _c
    xor     edx, _b
    xor     edx, _a
    add     eax, edx
    ; tt2 += F(e, f, g);
    mov     edx, _g
    xor     edx, _f
    xor     edx, _e
    add     ebx, edx
    jmp     s3t_l4
s3t_l3:
    ; tt1 += FF(a, b, c);
    mov     edx, _b
    or      edx, _a
    mov     ebp, _b
    and     edx, _c
    and     ebp, _a
    or      edx, ebp
    add     eax, edx
    ; tt2 += GG(e, f, g);
    mov     edx, _g
    xor     edx, _f
    and     edx, _e
    xor     edx, _g
    add     ebx, edx
s3t_l4:
    ; d = c;
    mov     edx, _c
    mov     _d, edx
    ; c = ROTL32(b, 9);
    mov     edx, _b
    rol     edx, 9
    mov     _c, edx
    ; b = a;
    mov     edx, _a
    mov     _b, edx
    ; a = tt1;
    mov     _a, eax
    ; h = g;
    mov     edx, _g
    mov     _h, edx
    ; g = ROTL32(f, 19);
    mov     edx, _f
    rol     edx, 19
    mov     _g, edx
    ; f = e;
    mov     edx, _e
    mov     _f, edx
    ; e = P0(tt2);
    ; e = x ^ ROTL32(x,  9) ^ ROTL32(x, 17)
    mov     edx, ebx
    rol     edx, 9
    xor     ebx, edx
    rol     edx, 17-9
    xor     ebx, edx
    mov     _e, ebx

    inc     ecx
    cmp     ecx, 64
    jnz     s3t_l2

    mov     esi, esp
    lea     esp, [esi+ecx*8]

    pop     edi
    mov     cl, 8
s3t_l5:
    lodsd
    xor     eax, [edi]
    stosd
    loop    s3t_l5
    popad
    ret

Summary

Size of C generated assembly using /O2 /Os switches is approx. 609 bytes. The x86 assembly is approx. 470 bytes. See sources here.

Thanks to 0x4d_ for \LaTeX equations.

Advertisements
This entry was posted in assembly, cryptography, programming, security and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s