Asmcodes: ChaCha20 Stream cipher


ChaCha or Snuffle 2008 is a stream cipher developed by mathematician and cryptologist Daniel J. Bernstein.

Published in 2008, it’s designed for high performance in software implementations. It is compact, uses few resources and inexpensive operations (ADD/ROTATE/XOR) that makes it suitable for a wide range of architectures. It prevents leakage of information through side channel analysis, has a simple and fast key setup and provides good overall performance.

The author has this to say from his paper ChaCha, a variant of Salsa20

ChaCha8 is a 256-bit stream cipher based on the 8-round cipher Salsa20/8. The changes from Salsa20/8 to ChaCha8 are designed to improve diffusion per round, conjecturally increasing resistance to cryptanalysis, while preserving—and often improving—time per round. ChaCha12 and ChaCha20 are analogous modifications of the 12-round and 20-round ciphers Salsa20/12 and Salsa20/20. This paper presents the ChaCha family and explains the differences between Salsa20 and ChaCha.

The variant of ChaCha pressented here uses 20 rounds.

Google have published details in RFC7539 how they plan to use ChaCha20 for symmetric encryption along with Poly1305 for authentication.

The implementation in C was only tested on little-endian architecture and main difference between this code and specifications set out by Google in RFC7539 is that Google use a 96-bit nonce/iv and 32-bit counter. Here we use 64-bit counter and 64-bit nonce/iv.

The assembly is a joint effort between Peter Ferrie and myself. Currently, it’s approx. 243 bytes but may shrink in future.

Key Expansion

Google have decided to use 96-bit nonces and 256-bit keys. Note from RFC7539 section 2.3

The ChaCha algorithm described here uses a 256-bit key. The original algorithm also specified 128-bit keys and 8- and 12-round variants, but these are out of scope for this document.

Note also that the original ChaCha had a 64-bit nonce and 64-bit block count. We have modified this here to be more consistent with recommendations in Section 3.2 of [RFC5116]. This limits the use of a single (key,nonce) combination to 2^32 blocks, or 256 GB, but that is enough for most uses. In cases where a single key is used by multiple senders, it is important to make sure that they don’t use the same nonces. This can be assured by partitioning the nonce space so that the first 32 bits are unique per sender, while the other 64 bits come from a counter.

As said, we’re still using a 64-bit nonce/iv and 64-bit counter.

The following is translation of above with optimizations for size.

; void cc_setkey(chacha_ctx *ctx, void *key, void *iv)
    mov    edi, [esp+32+4] ; ctx
    ; copy "expand 32-byte k" into state
    call   load_iv
    dd     061707865h, 03320646Eh
    dd     079622D32h, 06B206574h
    pop    esi
    push   4
    pop    ecx
    rep    movsd
    ; copy 256-bit key
    mov    esi, [esp+32+8] ; key
    mov    cl, 8
    rep    movsd
    ; set 64-bit block counter to zero
    xchg   ecx, eax
    ; store 64-bit nonce
    mov    esi, [esp+32+12] ; iv

The Quarter-Round

The original reference implementation and values which is designed to be parallelizable uses macros to speed up computation. The following is ChaCha8 variant.

void salsa20_wordtobyte(u8 output[64],
  const u32 input[16])
  u32 x[16];
  int i;

  // copy state to local
  for (i = 0;i < 16;++i) 
    x[i] = input[i];
  // do 8 rounds
  for (i = 8;i > 0;i -= 2) {
    QUARTERROUND( 0, 4, 8,12)
    QUARTERROUND( 1, 5, 9,13)
    QUARTERROUND( 2, 6,10,14)
    QUARTERROUND( 3, 7,11,15)
    QUARTERROUND( 0, 5,10,15)
    QUARTERROUND( 1, 6,11,12)
    QUARTERROUND( 2, 7, 8,13)
    QUARTERROUND( 3, 4, 9,14)
  for (i = 0;i < 16;++i) 
    x[i] = PLUS(x[i],input[i]);

  for (i = 0;i < 16;++i) 
    U32TO8_LITTLE(output + 4 * i,x[i]);

Since the main objective is to reduce space used, QUARTERROUND is converted to function instead and each of the index values are stored as 8 16-bit integers instead of 32 bytes.

// 16-bit integers of each index
  uint16_t idx16[8]=
  { 0xC840, 0xD951, 0xEA62, 0xFB73, 
    0xFA50, 0xCB61, 0xD872, 0xE943 };

For each round, 4 index values are extracted. This is possible because the x array is 16 words. So for example, 0xC840 represents 0,4,8,12 when shifted 4 bits 4 times.

The rotate values can also be stored as 0x07080C10 and when shifted 8-bits 4 times represents 16,12,8,7

%define a eax
%define b edx
%define c esi
%define d edi
%define x edi

%define t0 ebp

; void chacha_permute(chacha_blk *blk, uint16_t index)
    push   a
    xchg   ah, al
    aam    16

    movzx  ebp, ah
    movzx  c, al

    pop    a
    aam    16

    movzx  b, ah
    movzx  a, al

    lea    a, [x+a*4]
    lea    b, [x+b*4]
    lea    c, [x+c*4]
    lea    d, [x+ebp*4]
    ; load ecx with rotate values
    ; 16, 12, 8, 7
    mov    ecx, 07080C10h
    mov    t0, [b]
    xor    ebx, ebx
    ; x[a] = PLUS(x[a],x[b]);
    add    t0, [a]
    mov    [a], t0
    ; x[d] = ROTATE(XOR(x[d],x[a]),cl);
    ; also x[b] = ROTATE(XOR(x[b],x[c]),cl);
    xor    t0, [d]
    rol    t0, cl
    mov    [d], t0
    xchg   cl, ch
    xchg   c, a
    xchg   d, b
    dec    ebx
    jp     q_l2
    ; --------------------------------------------
    shr    ecx, 16
    jnz    q_l1


Generating stream of bytes

The core function uses 20 rounds by default.

; void cc20_streamx (chacha_ctx *ctx, void *in, uint32_t len)
; do not call directly
; expects state in ebx, length in eax, input in edx

    ; copy state to edi
    push   64
    pop    ecx
    mov    ebx, esi
    rep    movsb

    pop    edi
    push   edi
    push   20/2  ; 20 rounds
    pop    ebp
    ; load indexes
    call   load_idx
    dw     0c840H, 0d951H
    dw     0ea62H, 0fb73H
    dw     0fa50H, 0cb61H
    dw     0d872H, 0e943H
    pop    esi  ; pointer to indexes
    mov    cl, 8
    call   chacha_permute
    loop   e_l2
    dec    ebp
    jnz    e_l1

    ; add state to x
    mov    cl, 16
    mov    eax, [ebx+ecx*4-4]
    add    [edi+ecx*4-4], eax
    loop   add_state

    ; update block counter
    adc    dword[ebx+12*4], ecx
    adc    dword[ebx+13*4], ecx

    ; restore registers

Encryption/Decryption Function

    lea     esi, [esp+32+4]
    xchg    ecx, eax          ; ecx = length
    xchg    ebx, eax          ; ebx = buf
    xchg    esi, eax          ; esi = ctx
    mov     edi, esp          ; edi = stream[64]
    xor     eax, eax
    jecxz   cc_l2             ; exit if len==0
    call    cc20_streamx
    mov     dl, byte[edi+eax]
    xor     byte[ebx+eax], dl
    inc     eax
    cmp     al, 64
    loopnz  cc_l1
    add     ebx, eax
    jmp     cc_l0


architecture size in bytes
x86 230
x64 270

Please refer to here and here for future reference.

This entry was posted in assembly, cryptography, encryption and tagged , , , , , . Bookmark the permalink.

3 Responses to Asmcodes: ChaCha20 Stream cipher

  1. Pingback: Asmcodes: Twofish-256 | Odzhan

  2. Pingback: Asmcodes: Salsa20 Stream Cipher | x86 crypto

  3. Pingback: Asmcodes: Twofish-256 | x86 crypto

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s