## SM3 Cryptographic Hash Algorithm (Chinese Standard)

### Introduction

In December of 2007, the Chinese National Cryptographic Administration Bureau released the specification of a Trusted Cryptography Module, detailing a cryptoprocessor to be used within the Trusted Computing framework in China.

The module specifies a set of cryptographic algorithms that include the SM4 block cipher, the SM2 asymmetric algorithm and SM3 cryptographic hash algorithm.

I’ve discussed SM4 in an earlier post here.

SM3 uses a Merkle-Damgard construction similar to MD5 or SHA-2 that processes 512-bit input message blocks and returns a 256-bit hash value.

It was designed by Mathematician and cryptographer Xiaoyun Wang who is responsible for discovering attacks against many cryptographic hash functions, most notably MD5 and SHA-1.

At the CRYPTO 2004 conference, she and co-authors demonstrated collision attacks against MD5 and SHA-0. Attacks against SHA-1 were published later in 2005.

Unfortunately there’s no explanation for how the changes made to SHA-2 strengthen it against attacks so I’ll briefly cover some of the minor differences between SM3 and SHA-2

### Initialization

In the original SHA-2 specification, the 8 initialization values are the first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19). See here for more detailed description of how to create these.

$H_0^{(0)}\: =\: 6a09e667$
$H_1^{(0)}\: =\: bb67ae85$
$H_2^{(0)}\: =\: 3c6ef372$
$H_3^{(0)}\: =\: a54ff53a$
$H_4^{(0)}\: =\: 510e527f$
$H_5^{(0)}\: =\: 9b05688c$
$H_6^{(0)}\: =\: 1f83d9ab$
$H_7^{(0)}\: =\: 5be0cd19$

In SM3, the following 8 values are used:

$H_0^{(0)}\: =\: 7380166f$
$H_1^{(0)}\: =\: 4914b2b9$
$H_2^{(0)}\: =\: 172442d7$
$H_3^{(0)}\: =\: da8a0600$
$H_4^{(0)}\: =\: a96f30bc$
$H_5^{(0)}\: =\: 163138aa$
$H_6^{(0)}\: =\: e38dee4d$
$H_7^{(0)}\: =\: b0fb0e4e$

Unfortunately, there’s no explanation for why or how these values were selected.

### Updating and Finalization

The update and final processing are exactly the same as SHA-2 which is based on the original design for MD4 by Ron Rivest. I will not explain here since the main changes take place inside the compression function.

### Message Expansion

The 512-bit message is expanded into 68 32-bit words. The following 2 linear functions are defined:

$P_1$ for expansion, $P_0$ for compression

$P_0(X) = X\oplus (X\lll 9)\oplus (X\lll 17)$
$P_1(X) = X\oplus (X\lll 15)\oplus (X\lll 23)$

$\text{for}\: i=16\: \text{to}\: 67\\ \hphantom{---}W_i\leftarrow P_1(W_{i-16}\oplus W_{i-9}\oplus(W_{i-3}\lll 15))\oplus (W_{i-13}\lll 7)\oplus W_{i-6}\\ \text{endfor}$

$\text{for}\: i = 0\: \text{to}\: 63\\ \hphantom{---}W_i'=W_i\oplus W_{i+4}\\ \text{endfor}$

The additional 4 words in SM3 are required for a separate expansion that I’ve inlined with the compression function to try minimize code usage.

$W_i'=W_i\oplus W_{i+4}\: 0\leq i < 64.$

You can see in main compression loop while creating TT1 and TT2.

### Compression

For SHA-2, 64 words are used during compression which are described here in detail.

Both use the same sequence of sixty-four constant 32-bit words, K0{256}, K1{256}, …,K63{256} These words represent the first thirty-two bits of the fractional parts of the cube roots of the first sixty-four prime numbers. In hex, these constant words are (from left to right)

For SM3, only 2 constants are used. One is selected depending on the position of loop counter, then a circular shift by loop counter is applied before adding into the state. This certainly helps make SM3 more compact than SHA-2 although I can’t say if it strengthens in some way.

$T_i = \begin{cases} 79cc4519 & 0\leq i \leq 15 \\[1ex] 7a879d8a & 16\leq i \leq 63 \end{cases}$

There are 3 boolean operations used during the compression which are similar to bitwise majority MAJ and bitwise choice CH operations in SHA-2.

$FF_i(X,Y,Z) = \begin{cases} X\oplus Y\oplus Z & 0\leq i \leq 15 \\[1ex] (X\wedge Y)\vee (X\wedge Z)\vee (Y\wedge Z) & 16\leq i \leq 63 \end{cases}$

$GG_i(X,Y,Z) = \begin{cases} X\oplus Y\oplus Z & 0\leq i \leq 15 \\[1ex] (X\wedge Y) \vee (\neg X \wedge Z) & 16\leq i \leq 63 \end{cases}$

The GG operation is changed to use less instructions, the original is.

The new one along with others.

$ABCDEFGH\leftarrow V^{(i)}$
\begin{aligned} \text{for}\; & i= 0\; \text{to}\: 63\\ & SS_1\leftarrow ((A\lll 12)+E+(T_i\lll i))\lll 7\\ & SS_2\leftarrow SS_1\oplus(A\lll 12)\\ & TT_1\leftarrow FF_i(A,B,C)+D+SS_2+W_i'\\ & TT_2\leftarrow GG_i(E,F,G)+H+SS_1+W_i\\ & D\leftarrow C\\ & C\leftarrow B\lll 9\\ & B\leftarrow A\\ & A\leftarrow TT_1\\ & H\leftarrow G\\ & G\leftarrow F\lll 19\\ & F\leftarrow E\\ & E\leftarrow P_0(TT_2)\\ \text{endfor}&\; \end{aligned}

### Updating state

At the end of compression function in SHA-2, the input hash is updated by adding the output from compression.

$H_0^{(i)}=a+H_0^{(i-1)}$
$H_1^{(i)}=b+H_1^{(i-1)}$
$H_2^{(i)}=c+H_2^{(i-1)}$
$H_3^{(i)}=d+H_3^{(i-1)}$
$H_4^{(i)}=e+H_4^{(i-1)}$
$H_5^{(i)}=f+H_5^{(i-1)}$
$H_6^{(i)}=g+H_6^{(i-1)}$
$H_7^{(i)}=h+H_7^{(i-1)}$

In SM3, it is instead XOR’d which follows a Davies-Meyer idea for compression functions.

$H_0^{(i)}=a\oplus H_0^{(i-1)}$
$H_1^{(i)}=b\oplus H_1^{(i-1)}$
$H_2^{(i)}=c\oplus H_2^{(i-1)}$
$H_3^{(i)}=d\oplus H_3^{(i-1)}$
$H_4^{(i)}=e\oplus H_4^{(i-1)}$
$H_5^{(i)}=f\oplus H_5^{(i-1)}$
$H_6^{(i)}=g\oplus H_6^{(i-1)}$
$H_7^{(i)}=h\oplus H_7^{(i-1)}$

The output state from compression is XORed with the previous hash value to produce the next hash value. In the first round when there is no previous hash value it uses a constant pre-specified initial values.

$H_i=E_{m_i}(H_{i-1})\oplus H_{i-1}.$

Here’s the full compression function.

void SM3_Transform (SM3_CTX *ctx)
{
uint32_t tt1, tt2, i, t, ss1, ss2, x, y;
uint32_t w[68], s[8];

#define a s[0]
#define b s[1]
#define c s[2]
#define d s[3]
#define e s[4]
#define f s[5]
#define g s[6]
#define h s[7]

// load state into local buffer
memcpy((uint8_t*)&s[0], (uint8_t*)&ctx->s.w[0], 8*4);

// load data in big endian format
for (i=0; i<16; i++) {
w[i] = SWAP32(ctx->buf.w[i]);
}

// expand message
for (i=16; i<68; i++) {
x = ROTL32(w[i- 3], 15);
y = ROTL32(w[i-13],  7);

x ^= w[i-16];
x ^= w[i- 9];
y ^= w[i- 6];

w[i] = P1(x) ^ y;
}

// compression function
for (i=0; i<64; i++) {
t  = (i < 16) ? 0x79cc4519 : 0x7a879d8a;

ss2 = ROTL32(a, 12);
ss1 = ROTL32(ss2 + e + ROTL32(t, i), 7);
ss2 ^= ss1;

tt1 = d + ss2 + (w[i] ^ w[i+4]);
tt2 = h + ss1 + w[i];

if (i < 16) {
tt1 += F(a, b, c);
tt2 += F(e, f, g);
} else {
tt1 += FF(a, b, c);
tt2 += GG(e, f, g);
}
d = c;
c = ROTL32(b, 9);
b = a;
a = tt1;
h = g;
g = ROTL32(f, 19);
f = e;
e = P0(tt2);
}

// Davies–Meyer idea for compression function
for (i=0; i<8; i++) {
ctx->s.w[i] ^= s[i];
}
#undef a
#undef b
#undef c
#undef d
#undef e
#undef f
#undef g
#undef h
}


And assembly just to illustrate

%define _a dword[edi+0*4]
%define _b dword[edi+1*4]
%define _c dword[edi+2*4]
%define _d dword[edi+3*4]
%define _e dword[edi+4*4]
%define _f dword[edi+5*4]
%define _g dword[edi+6*4]
%define _h dword[edi+7*4]

_SM3_Transformx:

lea     esi, [ebx+state]
push    esi  ; save for later

; allocate 512 bytes
push    64
pop     ecx
shl     ecx, 3
sub     esp, ecx

; load state into local buffer
mov     edi, esp
push    8
pop     ecx
rep     movsd

; store message in big endian format
mov     cl, 16
s3t_l0:
lodsd
bswap   eax
stosd
loop    s3t_l0

; expand message
mov     cl, 68-16
s3t_l1:
; x = ROTL32(w[i-3], 15);
mov     eax, [edi- 3*4]
rol     eax, 15
; y = ROTL32(w[i-13], 7);
mov     ebx, [edi-13*4]
rol     ebx, 7
; x ^= w[i-16];
xor     eax, [edi-16*4]
; x ^= w[i-9];
xor     eax, [edi- 9*4]
; y ^= w[i-6];
xor     ebx, [edi- 6*4]
; x ^= ROTL32(x, 15) ^ ROTL32(x, 23);
mov     edx, eax
rol     edx, 15
xor     eax, edx
rol     edx, 23-15
xor     eax, edx
; x ^= y;
xor     eax, ebx
; w[i] = x;
stosd
loop    s3t_l1

; permute message
mov     edi, esp
s3t_l2:
; t  = (i < 16) ? 0x79cc4519 : 0x7a879d8a;
cmp     ecx, 16
sbb     eax, eax
and     eax, 0x79cc4519 - 0x7a879d8a
; ss2 = ROTL32(a, 12);
mov     ebx, _a
rol     ebx, 12
; ss1 = ROTL32(ss2 + e + ROTL32(t, i), 7);
rol     eax, cl
rol     eax, 7
; ss2 ^= ss1;
xor     ebx, eax
; tt1 = d + ss2 + (w[i] ^ w[i+4]);
mov     ebp, eax         ; save ss1
mov     esi, [edi+4*ecx+32]       ; esi = w[i]
mov     edx, esi         ; save w[i]
xor     esi, [edi+4*ecx+32+16]    ; esi ^= w[i+4]
lea     eax, [esi+ebx]   ; set tt1
; tt2 = h + ss1 + w[i];
lea     ebx, [edx+ebp]
; if (i < 16) {
cmp     ecx, 16
jae     s3t_l3
; tt1 += F(a, b, c);
mov     edx, _c
xor     edx, _b
xor     edx, _a
; tt2 += F(e, f, g);
mov     edx, _g
xor     edx, _f
xor     edx, _e
jmp     s3t_l4
s3t_l3:
; tt1 += FF(a, b, c);
mov     edx, _b
or      edx, _a
mov     ebp, _b
and     edx, _c
and     ebp, _a
or      edx, ebp
; tt2 += GG(e, f, g);
mov     edx, _g
xor     edx, _f
and     edx, _e
xor     edx, _g
s3t_l4:
; d = c;
mov     edx, _c
mov     _d, edx
; c = ROTL32(b, 9);
mov     edx, _b
rol     edx, 9
mov     _c, edx
; b = a;
mov     edx, _a
mov     _b, edx
; a = tt1;
mov     _a, eax
; h = g;
mov     edx, _g
mov     _h, edx
; g = ROTL32(f, 19);
mov     edx, _f
rol     edx, 19
mov     _g, edx
; f = e;
mov     edx, _e
mov     _f, edx
; e = P0(tt2);
; e = x ^ ROTL32(x,  9) ^ ROTL32(x, 17)
mov     edx, ebx
rol     edx, 9
xor     ebx, edx
rol     edx, 17-9
xor     ebx, edx
mov     _e, ebx

inc     ecx
cmp     ecx, 64
jnz     s3t_l2

mov     esi, esp
lea     esp, [esi+ecx*8]

pop     edi
mov     cl, 8
s3t_l5:
lodsd
xor     eax, [edi]
stosd
loop    s3t_l5

Thanks to 0x4d_ for $\LaTeX$ equations.