SM4 Block Cipher (Chinese Standard for WAPI)


SM4 (formerly SMS4) is a 128-bit block cipher with a 128-bit user key and 32 rounds. It’s used in the WLAN Authentication and Privacy Infrastructure (WAPI), a Chinese WLAN national standard.

It was published or rather declassified in 2006 and standardized in 2012.

I’m not clear what the SM stands for. There are other specifications like SM2 which describes Elliptic Curve algorithms for digital signatures and key exchange. SM3 which is a cryptographic hash algorithm and of course, SM4 a symmetric block cipher which I’ll implement here in C and x86 assembly optimized for size.

The only other specification to mention is SM9 which documents a set of identity-based cryptographic schemes from pairings.

English translations of the specifications for SM2, SM3 have been provided by Sean Shen and Xiaodong Lee at China Internet Network Information Center (CNNIC) a non-profit organization based in China.

But the only english translation for SM4 was by Whitfield Diffie and Dr. George Ledin.

About the C code

If you want high performance implementations of SM4 in C and x86/x86-x64 assembly, please look at GMSSL which appears to be a fork of OpenSSL but includes SM2, SM3 SM4 and SM9.

Mixer-substitution T

T is a substitution that generates 32 bits of output from 32 bits of input. Source code includes the full 256-byte array which obviously increases the size of code considerably.

  • Non-Linear function

(b_0, b_1, b_2, b_3) = \tau (A) = (Sbox(a_0),Sbox(a_1),Sbox(a_2),Sbox(a_3))

The Sbox is a 256-byte array and there’s no description of how these elements were chosen. If anyone knows, please leave a comment.


  • Linear function (for key setup)

L'(B) = B \oplus (B \lll 13)\oplus (B \lll 23)

  • Linear function (for encryption)

C=L(B)=B\oplus(B\lll 2)\oplus (B\lll 10)\oplus (B\lll 18)\oplus(B\lll 24)

static const uint8_t S[256] = {
  0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7, 
  0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05,
  0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3, 
  0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
  0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a, 
  0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62,
  0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95, 
  0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6,
  0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba, 
  0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8,
  0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b, 
  0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35,
  0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2, 
  0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87,
  0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52, 
  0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e,
  0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5, 
  0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1,
  0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55, 
  0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3,
  0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60, 
  0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f,
  0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f, 
  0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51,
  0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f, 
  0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8,
  0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd, 
  0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0,
  0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e, 
  0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84,
  0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20, 
  0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 };

// Mixer-substitution T
// T is a substitution that generates 32 bits from 32 bits T : Z32
// This substitution is a reversible process. 
// It consists of linear and non-linear substitution operations.  

uint32_t T(uint32_t x, int t)
    int   i;
    w32_t u;
    u.w = x;
    // apply non-linear substitution
    for (i=0; i<4; i++) {
      u.b[i] = S[u.b[i]];
    // apply linear substitution
    if (t==0)
      // for key setup
      return u.w ^ ROTL32(u.w, 13) ^ 
                   ROTL32(u.w, 23);
    } else {
      // for encryption
      return u.w ^ ROTL32(u.w,  2) ^ 
                   ROTL32(u.w, 10) ^ 
                   ROTL32(u.w, 18) ^ 
                   ROTL32(u.w, 24);

The assembly sticks close to the original in C.

    pop    ebx
    xor    eax, x1
    xor    eax, x2
    xor    eax, x3
    ; apply non-linear substitution
    mov    cl, 4
    ror    eax, 8
    loop   t_l1
    mov    ebx, eax
    mov    ecx, eax
    mov    edx, eax
    mov    ebp, eax
    ; apply linear substitution
    jc     t_l2
    ; for key setup
    rol    ebx, 13
    rol    ecx, 23
    xor    eax, ebx
    xor    eax, ecx
    jmp    t_l3
    ; for encryption
    rol    ebx, 2
    rol    ecx, 10
    rol    edx, 18
    rol    ebp, 24
    xor    eax, ebx
    xor    eax, ecx
    xor    eax, edx
    xor    eax, ebp
    mov    [esp+_eax], eax    
; in:  eax
; out: eax  
    call   t_l0  ; pushes address of sbox on stack
    ; sbox for SM4 goes here

The round function F

F(X_0,X_1,X_2,X_3,rk)=X_0\oplus T(X_1\oplus X_2\oplus X_3\oplus rk)

The value of 1 in last parameter to T function tells it to use the linear function for encryption. In x86 assembly, I use the Carry Flag (CF) setting or clearing with STC and CLC instructions.

// The round function F
// This algorithm uses a nonlinear substitution structure, 
// encrypting 32 bits at a time. This is called a one-round exchange.

uint32_t F(uint32_t x0, uint32_t x1, 
    uint32_t x2, uint32_t x3, uint32_t rk)
    return x0 ^ T(x1 ^ x2 ^ x3 ^ rk, 1);

The constant parameter CK

(ck_{i,0},ck_{i,1},ck_{i,2},ck_{i,3}) \in \Big(Z_2^8\Big)^4,\quad then\quad ck_{ij}=(4i+j)\times 7\: (mod \: 256)

You can include the precomputed array, or generate at runtime using the algorithm.

// The constant parameter CK
// generate CK constant for index i
#ifdef TABLE
uint32_t CK[]=
{ 0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269,
  0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9,
  0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249,
  0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9,
  0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229,
  0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299,
  0xa0a7aeb5, 0xbcc3cad1, 0xd8dfe6ed, 0xf4fb0209,
  0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279 };
uint32_t CK(int i)
    int      j;
    uint32_t ck=0;

    for (j=0; j<4; j++) {
      ck <<= 8;
      ck |= U8V((((i << 2) + j) * 7));
    return ck;

The assembly computes at runtime, since it requires less ROM.

; expects ecx to hold index
; returns constant in eax
    xor    eax, eax          ; ck = 0
    cdq                      ; j  = 0
    shl    eax, 8            ; ck <<= 8
    lea    ebx, [ecx*4+edx]  ; ebx = (i*4) + j
    imul   ebx, ebx, 7       ; ebx *= 7
    or     al, bl            ; ck |= ebx %= 256
    inc    edx               ; j++
    cmp    edx, 4            ; j<4
    jnz    ck_l0
    mov    [esp+_eax], eax   ; return ck

Key setup

void sm4_setkey(sm4_ctx *c, const void *key, int enc) {
    uint32_t t, rk0, rk1, rk2, rk3; 
    int      i;    
    // load the key
    rk0 = SWAP32(((uint32_t*)key)[0]);
    rk1 = SWAP32(((uint32_t*)key)[1]);
    rk2 = SWAP32(((uint32_t*)key)[2]);
    rk3 = SWAP32(((uint32_t*)key)[3]);
    // xor FK values
    rk0 ^= 0xa3b1bac6; rk1 ^= 0x56aa3350;
    rk2 ^= 0x677d9197; rk3 ^= 0xb27022dc;
    // generate 32 sub keys
    for (i=0; i<32; i++) {
      #ifdef TABLE
        rk0 ^= T(rk1 ^ rk2 ^ rk3 ^ CK[i], 0);
        rk0 ^= T(rk1 ^ rk2 ^ rk3 ^ CK(i), 0);
      c->rk[i] = rk0;
      XCHG(rk0, rk1);
      XCHG(rk1, rk2);
      XCHG(rk2, rk3);
    // reverse the order of keys if decrypting
    if (enc == SM4_DECRYPT) {
      for (i = 0; i < 16; i++) {
        XCHG(c->rk[i], c->rk[31 - i]);

Assembly code..

    mov    edi, [esp+32+4]  ; edi = ctx
    mov    esi, [esp+32+8]  ; esi = 128-bit key
    ; load the key
    bswap  eax
    xchg   eax, rk0
    bswap  eax
    xchg   eax, rk1
    bswap  eax
    xchg   eax, rk2
    bswap  eax
    xchg   eax, rk3
    ; xor FK values
    xor    rk0, 0xa3b1bac6    
    xor    rk1, 0x56aa3350    
    xor    rk2, 0x677d9197    
    xor    rk3, 0xb27022dc
    xor    ecx, ecx
    call   CK
    call   T_function 
    xor    rk0, eax
    mov    eax, rk0
    stosd                ; rk[i] = rk0
    xchg   rk0, rk1
    xchg   rk1, rk2
    xchg   rk2, rk3
    inc    ecx           ; i++
    cmp    ecx, 32
    jnz    sk_l1       


// swapping bytes *shouldn't* affect the cipher

void sm4_encrypt(sm4_ctx *c, void *buf)
    uint32_t x0, x1, x2, x3, i, t;
    uint32_t *rk=c->rk;
    uint32_t *x=(uint32_t*)buf;
    x0 = SWAP32(x[0]); x1 = SWAP32(x[1]);
    x2 = SWAP32(x[2]); x3 = SWAP32(x[3]);
    for (i=0; i<32; i++) {
      x0 = F(x0, x1, x2, x3, rk[i]);
      XCHG(x0, x1);
      XCHG(x1, x2);
      XCHG(x2, x3);
    x[0] = SWAP32(x3); x[1] = SWAP32(x2);
    x[2] = SWAP32(x1); x[3] = SWAP32(x0);

There’s swapping of bytes between big-endian/little-endian numbering convention which shouldn’t affect the security of ciphertext.

    mov    edi, [esp+32+4] ; edi = ctx
    mov    esi, [esp+32+8] ; esi = buf
    push   esi ; save buffer for later
    ; load data
    bswap  eax
    xchg   eax, x0
    bswap  eax    
    xchg   eax, x1
    bswap  eax    
    xchg   eax, x2
    bswap  eax    
    xchg   eax, x3
    ; do 32 rounds
    push   32
    pop    ecx
    ; apply F round
    mov    eax, [edi] ; rk[i]
    call   T_function     
    xor    x0, eax
    xchg   x0, x1
    xchg   x1, x2
    xchg   x2, x3
    loop   e_l0
    ; save data
    pop    edi
    xchg   eax, x3
    bswap  eax    
    xchg   eax, x2
    bswap  eax    
    xchg   eax, x1
    bswap  eax    
    xchg   eax, x0
    bswap  eax    


If there was a way to generate the sbox on the fly, this could be a good cipher for resource constrained devices. The size of code using /O2 /Os switches resulted in 690 bytes. The assembly for just encryption is approx. 500 bytes.

If you want to contribute or just access full source code, see here

Thanks to 0x4d_ for \LaTeX formulas.

This entry was posted in assembly, cryptography, encryption, programming and tagged , , , , , , . Bookmark the permalink.

3 Responses to SM4 Block Cipher (Chinese Standard for WAPI)

  1. Pingback: Asmcodes: SM3 Cryptographic Hash Algorithm (Chinese Standard) | x86 crypto

  2. Usepr Xf says:

    SM is the abbreviation of the chinese word 商密 which stands for commercial cipher.

    Liked by 2 people

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s