## SM4 Block Cipher (Chinese Standard for WAPI)

### Introduction

SM4 (formerly SMS4) is a 128-bit block cipher with a 128-bit user key and 32 rounds. It’s used in the WLAN Authentication and Privacy Infrastructure (WAPI), a Chinese WLAN national standard.

It was published or rather declassified in 2006 and standardized in 2012.

I’m not clear what the SM stands for. There are other specifications like SM2 which describes Elliptic Curve algorithms for digital signatures and key exchange. SM3 which is a cryptographic hash algorithm and of course, SM4 a symmetric block cipher which I’ll implement here in C and x86 assembly optimized for size.

The only other specification to mention is SM9 which documents a set of identity-based cryptographic schemes from pairings.

English translations of the specifications for SM2, SM3 have been provided by Sean Shen and Xiaodong Lee at China Internet Network Information Center (CNNIC) a non-profit organization based in China.

But the only english translation for SM4 was by Whitfield Diffie and Dr. George Ledin.

If you want high performance implementations of SM4 in C and x86/x86-x64 assembly, please look at GMSSL which appears to be a fork of OpenSSL but includes SM2, SM3 SM4 and SM9.

### Mixer-substitution T

T is a substitution that generates 32 bits of output from 32 bits of input. Source code includes the full 256-byte array which obviously increases the size of code considerably.

• Non-Linear function

$(b_0, b_1, b_2, b_3) = \tau (A) = (Sbox(a_0),Sbox(a_1),Sbox(a_2),Sbox(a_3))$

The Sbox is a 256-byte array and there’s no description of how these elements were chosen. If anyone knows, please leave a comment.

• Linear function (for key setup)

$L'(B) = B \oplus (B \lll 13)\oplus (B \lll 23)$

• Linear function (for encryption)

$C=L(B)=B\oplus(B\lll 2)\oplus (B\lll 10)\oplus (B\lll 18)\oplus(B\lll 24)$

static const uint8_t S[256] = {
0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7,
0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05,
0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3,
0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a,
0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62,
0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95,
0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6,
0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba,
0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8,
0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b,
0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35,
0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2,
0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87,
0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52,
0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e,
0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5,
0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1,
0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55,
0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3,
0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60,
0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f,
0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f,
0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51,
0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f,
0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8,
0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd,
0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0,
0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e,
0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84,
0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20,
0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 };

// Mixer-substitution T
//
// T is a substitution that generates 32 bits from 32 bits T : Z32
// This substitution is a reversible process.
// It consists of linear and non-linear substitution operations.

uint32_t T(uint32_t x, int t)
{
int   i;
w32_t u;

u.w = x;

// apply non-linear substitution
for (i=0; i<4; i++) {
u.b[i] = S[u.b[i]];
}
// apply linear substitution
if (t==0)
{
// for key setup
return u.w ^ ROTL32(u.w, 13) ^
ROTL32(u.w, 23);
} else {
// for encryption
return u.w ^ ROTL32(u.w,  2) ^
ROTL32(u.w, 10) ^
ROTL32(u.w, 18) ^
ROTL32(u.w, 24);
}
}


The assembly sticks close to the original in C.

t_l0:
pop    ebx
xor    eax, x1
xor    eax, x2
xor    eax, x3
; apply non-linear substitution
mov    cl, 4
t_l1:
xlatb
ror    eax, 8
loop   t_l1
mov    ebx, eax
mov    ecx, eax
mov    edx, eax
mov    ebp, eax
; apply linear substitution
popfd
jc     t_l2
; for key setup
rol    ebx, 13
rol    ecx, 23
xor    eax, ebx
xor    eax, ecx
jmp    t_l3
t_l2:
; for encryption
rol    ebx, 2
rol    ecx, 10
rol    edx, 18
rol    ebp, 24

xor    eax, ebx
xor    eax, ecx
xor    eax, edx
xor    eax, ebp
t_l3:
mov    [esp+_eax], eax
ret

; in:  eax
; out: eax
T_function:
pushfd
call   t_l0  ; pushes address of sbox on stack
; sbox for SM4 goes here


### The round function F

$F(X_0,X_1,X_2,X_3,rk)=X_0\oplus T(X_1\oplus X_2\oplus X_3\oplus rk)$

The value of 1 in last parameter to T function tells it to use the linear function for encryption. In x86 assembly, I use the Carry Flag (CF) setting or clearing with STC and CLC instructions.

// The round function F
//
// This algorithm uses a nonlinear substitution structure,
// encrypting 32 bits at a time. This is called a one-round exchange.

uint32_t F(uint32_t x0, uint32_t x1,
uint32_t x2, uint32_t x3, uint32_t rk)
{
return x0 ^ T(x1 ^ x2 ^ x3 ^ rk, 1);
}


### The constant parameter CK

$(ck_{i,0},ck_{i,1},ck_{i,2},ck_{i,3}) \in \Big(Z_2^8\Big)^4,\quad then\quad ck_{ij}=(4i+j)\times 7\: (mod \: 256)$

You can include the precomputed array, or generate at runtime using the algorithm.

// The constant parameter CK
//
// generate CK constant for index i
#ifdef TABLE
uint32_t CK[]=
{ 0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269,
0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9,
0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249,
0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9,
0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229,
0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299,
0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279 };
#else
uint32_t CK(int i)
{
int      j;
uint32_t ck=0;

for (j=0; j<4; j++) {
ck <<= 8;
ck |= U8V((((i << 2) + j) * 7));
}
return ck;
}
#endif


The assembly computes at runtime, since it requires less ROM.

; expects ecx to hold index
; returns constant in eax
CK:
xor    eax, eax          ; ck = 0
cdq                      ; j  = 0
ck_l0:
shl    eax, 8            ; ck <<= 8
lea    ebx, [ecx*4+edx]  ; ebx = (i*4) + j
imul   ebx, ebx, 7       ; ebx *= 7
or     al, bl            ; ck |= ebx %= 256
inc    edx               ; j++
cmp    edx, 4            ; j<4
jnz    ck_l0
mov    [esp+_eax], eax   ; return ck
ret


### Key setup

void sm4_setkey(sm4_ctx *c, const void *key, int enc) {
uint32_t t, rk0, rk1, rk2, rk3;
int      i;

rk0 = SWAP32(((uint32_t*)key)[0]);
rk1 = SWAP32(((uint32_t*)key)[1]);
rk2 = SWAP32(((uint32_t*)key)[2]);
rk3 = SWAP32(((uint32_t*)key)[3]);

// xor FK values
rk0 ^= 0xa3b1bac6; rk1 ^= 0x56aa3350;
rk2 ^= 0x677d9197; rk3 ^= 0xb27022dc;

// generate 32 sub keys
for (i=0; i<32; i++) {
#ifdef TABLE
rk0 ^= T(rk1 ^ rk2 ^ rk3 ^ CK[i], 0);
#else
rk0 ^= T(rk1 ^ rk2 ^ rk3 ^ CK(i), 0);
#endif
c->rk[i] = rk0;
XCHG(rk0, rk1);
XCHG(rk1, rk2);
XCHG(rk2, rk3);
}
// reverse the order of keys if decrypting
if (enc == SM4_DECRYPT) {
for (i = 0; i < 16; i++) {
XCHG(c->rk[i], c->rk[31 - i]);
}
}
}


Assembly code..

sm4_setkeyx:
_sm4_setkeyx:
mov    edi, [esp+32+4]  ; edi = ctx
mov    esi, [esp+32+8]  ; esi = 128-bit key
lodsd
bswap  eax
xchg   eax, rk0
lodsd
bswap  eax
xchg   eax, rk1
lodsd
bswap  eax
xchg   eax, rk2
lodsd
bswap  eax
xchg   eax, rk3

; xor FK values
xor    rk0, 0xa3b1bac6
xor    rk1, 0x56aa3350
xor    rk2, 0x677d9197
xor    rk3, 0xb27022dc
xor    ecx, ecx
sk_l1:
call   CK
clc
call   T_function
xor    rk0, eax
mov    eax, rk0
stosd                ; rk[i] = rk0
xchg   rk0, rk1
xchg   rk1, rk2
xchg   rk2, rk3
inc    ecx           ; i++
cmp    ecx, 32
jnz    sk_l1
ret


### Encryption

// swapping bytes *shouldn't* affect the cipher

void sm4_encrypt(sm4_ctx *c, void *buf)
{
uint32_t x0, x1, x2, x3, i, t;
uint32_t *rk=c->rk;
uint32_t *x=(uint32_t*)buf;

x0 = SWAP32(x[0]); x1 = SWAP32(x[1]);
x2 = SWAP32(x[2]); x3 = SWAP32(x[3]);

for (i=0; i<32; i++) {
x0 = F(x0, x1, x2, x3, rk[i]);
XCHG(x0, x1);
XCHG(x1, x2);
XCHG(x2, x3);
}
x[0] = SWAP32(x3); x[1] = SWAP32(x2);
x[2] = SWAP32(x1); x[3] = SWAP32(x0);
}


There’s swapping of bytes between big-endian/little-endian numbering convention which shouldn’t affect the security of ciphertext.

sm4_encryptx:
_sm4_encryptx:
mov    edi, [esp+32+4] ; edi = ctx
mov    esi, [esp+32+8] ; esi = buf
push   esi ; save buffer for later

lodsd
bswap  eax
xchg   eax, x0
lodsd
bswap  eax
xchg   eax, x1
lodsd
bswap  eax
xchg   eax, x2
lodsd
bswap  eax
xchg   eax, x3

; do 32 rounds
push   32
pop    ecx
e_l0:
; apply F round
mov    eax, [edi] ; rk[i]
scasd
stc
call   T_function
xor    x0, eax
xchg   x0, x1
xchg   x1, x2
xchg   x2, x3
loop   e_l0

; save data
pop    edi
xchg   eax, x3
bswap  eax
stosd
xchg   eax, x2
bswap  eax
stosd
xchg   eax, x1
bswap  eax
stosd
xchg   eax, x0
bswap  eax
stosd
ret


### Summary

If there was a way to generate the sbox on the fly, this could be a good cipher for resource constrained devices. The size of code using /O2 /Os switches resulted in 690 bytes. The assembly for just encryption is approx. 500 bytes.

If you want to contribute or just access full source code, see here

Thanks to 0x4d_ for $\LaTeX$ formulas.