SHA-3 / Keccak cryptographic hash

Introduction

Keccak is a cryptographic hash function designed by Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche.

It was selected by NIST to become the SHA-3 standard and is a complement to SHA-2 rather than a replacement.

Currently, SHA-2 is widely used and not known publicly to have any weaknesses. Since SHA-2 is very similar to MD4, MD5 and SHA-1 in design, weaknesses may be identified in the future and SHA-3 is intended to be a drop in replacement.

When considering to choose a MAC function, SHA-3 has the advantage of not requiring HMAC like others mentioned above. This is because SHA-3 by itself is not vulnerable to Length Extension attack. To use SHA-3 as a MAC, you simply prepend the key to each message; just another reason to consider SHA-3 over SHA-2

The following x86 assembly implementation was a joint effort between Peter Ferrie and myself. It doesn’t contain the extensible output functions and is deliberately optimized for size rather than speed.

Initialization

The nice thing about Keccak/SHA-3 (from my own perspective) is that it doesn’t use any kind of initialization values. Prior hash function designs which were mostly derived from MD4, written by Ron Rivest in the early 1990s all use some kind of initial values for the state. The permutation functions also use constants which can be computed on the fly but tend to require a lot more space.

The only steps performed here are to zero intialize the state buffer and set output/input lengths.

void SHA3_Init (SHA3_CTX *ctx, int mdlen)
{
  uint32_t i;
  
  ctx->outlen = mdlen;
  ctx->buflen = 200 - (2 * mdlen);
  ctx->index  = 0;
  
  for (i=0; i<SHA3_STATE_LEN; i++) {
    ctx->state.v64[i] = 0;
  }
}

The assembly code is translation of above and tries to minimize amount of code used.

_SHA3_Initx:
    pushad
    mov    edi, [esp+32+4]    ; context
    mov    eax, [esp+32+8]    ; outlen
    stosd                     ; ctx->outlen=outlen
    add    eax, eax           ; *= 2
    push   (SHA3_STATE_LEN*8)/2
    pop    ecx
    add    ecx, ecx           ; ecx=200
    neg    eax                ; negate
    add    eax, ecx           ; add 200
    stosd                     ; buflen = 200 - (2 * outlen)
    xor    eax, eax
    stosd                     ; index=0
    rep    stosb              ; zero the state buffer
    popad
    ret

Updating

Sponge construction is used for “absorbing” data into the “sponge” / state. Blocks of data are xor’d with the state buffer before being transformed using the permutation function.

The authors have this to say.

the sponge construction is a mode of operation, based on a fixed-length permutation (or transformation) and on a padding rule, which builds a function mapping variable-length input to variable-length output. Such a function is called a sponge function.

Sponge functions are a very clever design and can be used for many other applications apart from hash functions.

The stream cipher Spritz for example also uses sponge functions and can be used for PRNG and AEAD.

I’m not endorsing Spritz here, but for more information, read Spritz: A New RC4-Like Stream Cipher

void SHA3_Update (SHA3_CTX* ctx, void *in, uint32_t inlen)
{
  uint32_t i;
  
  // update buffer and state
  for (i=0; i<inlen; i++) {    
    if (ctx->index == ctx->buflen) {
      SHA3_Transform (ctx);
      ctx->index = 0;
    }
    // absorb byte into state
    ctx->state.v8[ctx->index++] ^= ((uint8_t*)in)[i];
  }
}

The assembly code again tries to minimize code used.

_SHA3_Updatex:
    pushad
    lea    esi, [esp+32+4]
    lodsd
    push   eax               ; save ctx
    lodsd
    xchg   ebx, eax          ; ebx = input
    lodsd
    xchg   ecx, eax          ; ecx = len
    pop    esi               ; esi = ctx
    jecxz  upd_l2
    
    lodsd                    ; skip ctx->outlen
    lodsd                    ;
    xchg   eax, ebp          ; ebp = ctx->buflen
    push   esi               ; save ptr to ctx->index
    lodsd                    ; eax = ctx->index
upd_l0:
    cmp    eax, ebp          ; buffer full?
    jne    upd_l1
    call   SHA3_Transform    ; compress, expects ctx->state in esi
    xor    eax, eax          ; index = 0
upd_l1:
    mov    dl, [ebx]         ; absorb byte
    inc    ebx    
    xor    byte[esi+eax], dl
    inc    eax               ; increase index
    loop   upd_l0
    pop    edi
    stosd
upd_l2:
    popad
    ret

Finalize

void SHA3_Final (void* out, SHA3_CTX* ctx)
{
  uint32_t i;
  // absorb 3 bits, Keccak uses 1
  ctx->state.v8[ctx->index] ^= 6;
  // absorb end bit
  ctx->state.v8[ctx->buflen-1] ^= 0x80;
  // update context
  SHA3_Transform (ctx);
  // copy digest to buffer
  for (i=0; i<ctx->outlen; i++) {
    ((uint8_t*)out)[i] = ctx->state.v8[i];
  }
}

The assembly

_SHA3_Finalx:
    pushad
    mov    esi, [esp+32+8]      ; esi=ctx
    lodsd
    xchg   ecx, eax             ; ecx=ctx->outlen
    lodsd
    xchg   edx, eax             ; edx=ctx->buflen
    lodsd                       ; eax=ctx->index
    xor    byte[esi+eax], 6     ; ctx->state.v8[ctx->index] ^= 6;
    xor    byte[esi+edx-1], 80h ; ctx->state.v8[ctx->buflen-1] ^= 0x80;
    call   SHA3_Transform       ; SHA3_Transform (ctx->state);
    mov    edi, [esp+32+4]      ; edi=out
    rep    movsb                ; memcpy (out, ctx->state.v8, ctx->outlen);
    popad
    ret

Since the SHA-3 state uses 25 64-bit words, implementing a version in assembly using only 8 32-bit general purpose registers available is doable but not really necessary unless you were targeting systems released before 1998.

The round contants can be precomputed in advance, but to save space, the LFSR function is used instead.

// Primitive polynomial over GF(2): x^8+x^6+x^5+x^4+1
uint64_t rc (uint8_t *LFSR)
{
  uint64_t c;
  uint32_t i, t;

  c = 0;
  t = *LFSR;
  
  for (i=1; i<128; i += i) 
  {
    if (t & 1) {
      c ^= (uint64_t)1ULL << (i - 1);
    }
    t = (t & 0x80) ? (t << 1) ^ 0x71 : t << 1;
  }
  *LFSR = (uint8_t)t;
  return c;
}

Compression

This function was originally implemented by Markku-Juhani O. Saarinen. The only thing I’ve done is remove the round constants and replace with LFSR function above.

const uint8_t keccakf_piln[24] = 
{ 10, 7,  11, 17, 18, 3, 5,  16, 8,  21, 24, 4, 
  15, 23, 19, 13, 12, 2, 20, 14, 22, 9,  6,  1  };

#define ROTL64(x, y) (((x) << (y)) | ((x) >> (64 - (y))))

void SHA3_Transform (SHA3_CTX *ctx)
{
  uint32_t i, j, rnd, r;
  uint64_t t, bc[5];
  uint64_t *st=(uint64_t*)ctx->state.v64;
  uint8_t lfsr=1;
  
  for (rnd=0; rnd<SHA3_ROUNDS; rnd++) 
  {
    // Theta
    for (i=0; i<5; i++) {     
      bc[i] = st[i] 
            ^ st[i + 5] 
            ^ st[i + 10] 
            ^ st[i + 15] 
            ^ st[i + 20];
    }
    for (i=0; i<5; i++) {
      t = bc[(i + 4) % 5] ^ ROTL64(bc[(i + 1) % 5], 1);
      for (j=0; j<25; j+=5) {
        st[j + i] ^= t;
      }
    }

    // Rho Pi
    t = st[1];
    for (i=0, r=0; i<24; i++) {
      r += i + 1;
      j = keccakf_piln[i];
      bc[0] = st[j];
      st[j] = ROTL64(t, r & 63);
      t = bc[0];
    }

    // Chi
    for (j=0; j<25; j+=5) {
      for (i=0; i<5; i++) {
        bc[i] = st[j + i];
      }
      for (i=0; i<5; i++) {
        st[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5];
      }
    }
    // Iota
    st[0] ^= rc(&lfsr);
  }
}

The assembly code uses MMX instructions since using 32-bit code alone would require too much space.

SHA3_Transform:
    pushad
    
    ; set up workspace
    sub    esp, 8*5
    mov    _bc, esp
    xor    r, r
    
    push   1
    pop    lfsr

    call   s3_l2
sha3_mod5:
    db 0, 1, 2, 3, 4, 0, 1, 2, 3, 4
sha3_piln:
    db 10, 7,  11, 17, 18, 3, 5,  16, 8,  21, 24, 4 
    db 15, 23, 19, 13, 12, 2, 20, 14, 22, 9,  6,  1  
s3_l2:
    push   r
    push   lfsr
    ; Theta
    ; for (i = 0; i < 5; i++)     
    ;   bc[i] = st[i + 0 ] ^ 
    ;           st[i + 5 ] ^ 
    ;           st[i + 10] ^ 
    ;           st[i + 15] ^ 
    ;           st[i + 20]; 
    xor    eax, eax
s3_l3:
    movq    t, [_st+8*eax+20*8]
    pxor    t, [_st+8*eax+15*8]
    pxor    t, [_st+8*eax+10*8]
    pxor    t, [_st+8*eax+ 5*8]
    pxor    t, [_st+8*eax     ]
    movq    [_bc+8*eax        ], t
    inc     eax
    cmp     al, 5
    jnz     s3_l3
      
    ; for (i = 0; i < 5; i++) {
    ;   t = bc[(i + 4) % 5] ^ ROTL64(bc[(i+1)%5], 1);
    ;   for (j = 0; j < 25; j += 5)
    ;     st[j + i] ^= t;
    ; }
    ; ************************************
    ; for (i = 0; i < 5; i++)
    xor    i, i
s3_l4:
    ; t = ROTL64(bc[(i + 1) % 5], 1)
    mov    eax, [esp+2*4]
    push   eax
    movzx  eax, byte [eax + i + 1]
    mov    edx, [_bc+8*eax]
    mov    ebp, [_bc+8*eax+4]
    add    edx, edx
    adc    ebp, ebp
    adc    dl, ih
    ; bc[(i + 4) % 5]
    pop    eax      ; keccakf_mod5
    movzx  eax, byte [eax + i + 4]
    xor    edx, [_bc+8*eax]
    xor    ebp, [_bc+8*eax+4]
    ; for (j = 0; j < 25; j += 5)
    xor    j, j
s3_l5:
    ; st[j + i] ^= t;
    lea    eax, [j+i]
    xor    [_st+8*eax], edx
    xor    [_st+8*eax+4], ebp
    add    j, 5
    cmp    j, 25
    jnz    s3_l5
    
    inc    i
    cmp    i, 5
    jnz    s3_l4
            
    ; // Rho Pi
    ; t = st[1];
    ; for (i = 0; i < 24; i++) {
    ;   j = keccakf_piln[i];
    ;   bc[0] = st[j];
    ;   st[j] = ROTL64(t, keccakf_rotc[i]);
    ;   t = bc[0];
    ; }
    ; *************************************
    ; t = st[1]
    mov    edx, [_st+8]
    mov    ebp, [_st+8+4]
    xor    i, i
    ; for (i = 0; i < 24; i++)
    push   i
s3_l6:
    push   ecx
    ; j = keccakf_piln[i];
    mov    eax, [esp+4*4]
    movzx  j, byte [eax + i + (sha3_piln - sha3_mod5)]
    mov    eax, [esp+4]
    lea    ecx, [eax+i+1]
    mov    [esp+4], ecx
    and    ecx, 63
    ;mov    cl, byte [eax + i + (sha3_rotc - sha3_piln)]
    ; st[j] = ROTL64(t, keccakf_rotc[i]);
s3_l06x:
    add    edx, edx
    adc    ebp, ebp
    adc    dl, ih
    loop   s3_l06x
    ; bc[0] = st[j];
    xchg   [_st+8*j], edx
    mov    [_bc], edx
    xchg   [_st+8*j+4], ebp
    mov    [_bc+4], ebp
    pop    ecx
    inc    i
    cmp    i, 24
    jnz    s3_l6
    pop    eax
      
    ; // Chi
    ; for (j = 0; j < 25; j += 5) {
    ;   for (i = 0; i < 5; i++)
    ;     bc[i] = st[j + i];
    ;   for (i = 0; i < 5; i++)
    ;     st[j + i] ^= (~bc[(i+1)%5]) & bc[(i+2)%5];
    ; }
    ; *********************************
    ; for (j=0; j<25; j+=5)
    xor    j, j
s3_l7:
    ; for (i=0; i<5; i++)
    xor    i, i
s3_l8:
    ; bc[i] = st[j + i];
    lea    eax, [j+i]
    movq   t, [_st+8*eax]
    movq   [_bc+8*i], t
    inc    i
    cmp    i, 5
    jnz    s3_l8
        
    ; for (i=0; i<5; i++)
    xor    i, i
s3_l9:
    ; st[j + i] ^= (~bc[(i+1)%5]) & bc[(i+2)%5];
    mov    eax, [esp+2*4] ; keccakf_mod5
    push   eax 
    movzx  eax, byte [eax + i + 1]
    movq   t, [_bc+8*eax]
    pop    eax            ; keccakf_mod5 
    movzx  eax, byte [eax + i + 2]
    pandn  t, [_bc+8*eax]
    lea    eax, [j+i]
    pxor   t, [_st+8*eax]
    movq   [_st+8*eax], t
    inc    i
    cmp    i, 5
    jnz    s3_l9
    
    add    j, 5
    cmp    j, 25
    jnz    s3_l7
           
    pop    lfsr

    ; // Iota
    ; st[0] ^= keccakf_rndc[round];
    movq    mm4, [_st]
;;    call    rc
rc:
    pxor   mm0, mm0        ; c=0
    pxor   mm1, mm1        ; 
    push   1
    pop    eax             ; i=1
    movd   mm1, eax
rc_l00:
    test   dl, 1           ; (t & 1)
    jz     rc_l01
    ; ecx = (i - 1)
    lea    ecx, [eax-1]
    movd   mm2, ecx
    movq   mm3, mm1
    ; 1ULL << (i - 1)
    psllq  mm3, mm2
    pxor   mm0, mm3        ; c ^= 1ULL << (i - 1)
rc_l01:
    add    dl, dl          ; t += t
    jnc    rc_l02
    xor    dl, 71h
rc_l02:
    add    al, al          ; i += i
    jns    rc_l00
;;    ret
  
    pxor    mm4, t
    movq    [_st], mm4
    
    pop     r
    inc     r
    cmp     r, SHA3_ROUNDS
    jnz     s3_l2
    
    add    esp, 8*5 + 4
    popad
    ret

Results

architecture size
x86 461 (using MMX)
x64 n/a

Conclusion

In my humble opinion, it’s a much better design for implementation than any other cryptographic hash function out there. It can be used for MAC without requiring HMAC algorithm and is therefore much more compact.

I’ve already considered SHA extensions provided by Intel CPU.

Code may be updated in future but not shown here, please refer to sources here and here.

Advertisements
This entry was posted in assembly, cryptography, programming, security and tagged , , , , , , , , , . Bookmark the permalink.

One Response to SHA-3 / Keccak cryptographic hash

  1. Pingback: Shellcode: A Windows PIC using RSA-2048 key exchange, AES-256, SHA-3 | modexp

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s