Asmcodes: ChaCha20 Stream cipher


ChaCha or Snuffle 2008 is a stream cipher developed by mathematician and cryptologist Daniel J. Bernstein.

Published in 2008, it’s designed for high performance in software implementations. It is compact, uses few resources and inexpensive operations (ADD/ROTATE/XOR) that makes it suitable for a wide range of architectures. It prevents leakage of information through side channel analysis, has a simple and fast key setup and provides good overall performance.

The author has this to say from his paper ChaCha, a variant of Salsa20

ChaCha8 is a 256-bit stream cipher based on the 8-round cipher Salsa20/8. The changes from Salsa20/8 to ChaCha8 are designed to improve diffusion per round, conjecturally increasing resistance to cryptanalysis, while preserving—and often improving—time per round. ChaCha12 and ChaCha20 are analogous modifications of the 12-round and 20-round ciphers Salsa20/12 and Salsa20/20. This paper presents the ChaCha family and explains the differences between Salsa20 and ChaCha.

The variant of ChaCha pressented here uses 20 rounds.

Google have published details in RFC7539 how they plan to use ChaCha20 for symmetric encryption along with Poly1305 for authentication.

The implementation in C was only tested on little-endian architecture and main difference between this code and specifications set out by Google in RFC7539 is that Google use a 96-bit nonce/iv and 32-bit counter. Here we use 64-bit counter and 64-bit nonce/iv.

The assembly is a joint effort between Peter Ferrie and myself. Currently, it’s approx. 243 bytes but may shrink in future.

Key Expansion

Google have decided to use 96-bit nonces and 256-bit keys. Note from RFC7539 section 2.3

The ChaCha algorithm described here uses a 256-bit key. The original algorithm also specified 128-bit keys and 8- and 12-round variants, but these are out of scope for this document.

Note also that the original ChaCha had a 64-bit nonce and 64-bit block count. We have modified this here to be more consistent with recommendations in Section 3.2 of [RFC5116]. This limits the use of a single (key,nonce) combination to 2^32 blocks, or 256 GB, but that is enough for most uses. In cases where a single key is used by multiple senders, it is important to make sure that they don’t use the same nonces. This can be assured by partitioning the nonce space so that the first 32 bits are unique per sender, while the other 64 bits come from a counter.

As said, we’re still using a 64-bit nonce/iv and 64-bit counter.

// setup the key
void cc_setkey(chacha_ctx *ctx, void *key, void *nonce)
  chacha_blk *ivec=(chacha_blk*)nonce;
  int        i;
  // "expand 32-byte k"
  ctx->state.v32[0] = 0x61707865;
  ctx->state.v32[1] = 0x3320646E;
  ctx->state.v32[2] = 0x79622D32;
  ctx->state.v32[3] = 0x6B206574;

  // copy 256-bit key
  for (i=0; i<32; i++) {
    ctx->state.v8[16+i] = ((uint8_t*)key)[i];
  // set 64-bit block counter and 64-bit nonce
  ctx->state.v32[12] = 0;
  ctx->state.v32[13] = 0;
  ctx->state.v32[14] = ivec->v32[0];
  ctx->state.v32[15] = ivec->v32[1];

The following is translation of above with optimizations for size.

; void cc_setkey(chacha_ctx *ctx, void *key, void *iv)
    mov    edi, [esp+32+4] ; ctx
    ; copy "expand 32-byte k" into state
    call   load_iv
    dd     061707865h, 03320646Eh
    dd     079622D32h, 06B206574h
    pop    esi
    push   4
    pop    ecx
    rep    movsd
    ; copy 256-bit key
    mov    esi, [esp+32+8] ; key
    mov    cl, 8
    rep    movsd
    ; set 64-bit block counter to zero
    xchg   ecx, eax
    ; store 64-bit nonce
    mov    esi, [esp+32+12] ; iv

The Quarter-Round

The original reference implementation and values which is designed to be parallelizable uses macros to speed up computation. The following is ChaCha8 variant.

void salsa20_wordtobyte(u8 output[64],
  const u32 input[16])
  u32 x[16];
  int i;

  // copy state to local
  for (i = 0;i < 16;++i) 
    x[i] = input[i];
  // do 8 rounds
  for (i = 8;i > 0;i -= 2) {
    QUARTERROUND( 0, 4, 8,12)
    QUARTERROUND( 1, 5, 9,13)
    QUARTERROUND( 2, 6,10,14)
    QUARTERROUND( 3, 7,11,15)
    QUARTERROUND( 0, 5,10,15)
    QUARTERROUND( 1, 6,11,12)
    QUARTERROUND( 2, 7, 8,13)
    QUARTERROUND( 3, 4, 9,14)
  for (i = 0;i < 16;++i) 
    x[i] = PLUS(x[i],input[i]);

  for (i = 0;i < 16;++i) 
    U32TO8_LITTLE(output + 4 * i,x[i]);

Since the main objective is to reduce space used, QUARTERROUND is converted to function instead and each of the index values are stored as 8 16-bit integers instead of 32 bytes.

// 16-bit integers of each index
  uint16_t idx16[8]=
  { 0xC840, 0xD951, 0xEA62, 0xFB73, 
    0xFA50, 0xCB61, 0xD872, 0xE943 };

For each round, 4 index values are extracted. This is possible because the x array is 16 words. So for example, 0xC840 represents 0,4,8,12 when shifted 4 bits 4 times.

// transforms block using ARX instructions
void QUARTERROUND(chacha_blk *blk, uint16_t index) 
  uint32_t a, b, c, d;
  uint32_t *x=(uint32_t*)&blk->v8;
  a = (index         & 0xF);
  b = ((index >>  4) & 0xF);
  c = ((index >>  8) & 0xF);
  d = ((index >> 12) & 0xF);
  x[a] = x[a] + x[b]; 
  x[d] = ROTL32(x[d] ^ x[a],16);
  x[c] = x[c] + x[d]; 
  x[b] = ROTL32(x[b] ^ x[c],12);
  x[a] = x[a] + x[b]; 
  x[d] = ROTL32(x[d] ^ x[a], 8);
  x[c] = x[c] + x[d]; 
  x[b] = ROTL32(x[b] ^ x[c], 7);

The rotate values can also be stored as 0x07080C10 and when shifted 8-bits 4 times represents 16,12,8,7

%define a eax
%define b edx
%define c esi
%define d edi
%define x edi

%define t0 ebp

; void QUARTERROUND(chacha_blk *blk, uint16_t index)
    push   a
    xchg   ah, al
    aam    16
    movzx  ebp, ah
    movzx  c, al

    pop    a
    aam    16
    movzx  b, ah
    movzx  a, al

    lea    a, [x+a*4]
    lea    b, [x+b*4]
    lea    c, [x+c*4]
    lea    d, [x+ebp*4]
    ; load ecx with rotate values
    ; 16, 12, 8, 7
    mov    ecx, 07080C10h
    mov    t0, [b]
    xor    ebx, ebx
    ; x[a] = PLUS(x[a],x[b]); 
    add    t0, [a]
    mov    [a], t0
    ; x[d] = ROTATE(XOR(x[d],x[a]),cl);
    ; also x[b] = ROTATE(XOR(x[b],x[c]),cl);
    xor    t0, [d]
    rol    t0, cl
    mov    [d], t0
    xchg   cl, ch
    xchg   c, a
    xchg   d, b
    dec    ebx
    jp     q_l2
    ; --------------------------------------------
    shr    ecx, 16
    jnz    q_l1

Core Function

The core function uses 20 rounds by default.

// encrypt/decrypt 64 byte block
// do not call directly, use cc_encrypt instead
void cc_core (chacha_ctx *ctx, uint8_t *in, uint32_t len)
  chacha_blk x;
  uint32_t   i, j;

  // copy state to local space
  for (i=0; i<16; i++) { 
    x.v32[i] = ctx->state.v32[i];
  // apply 20 rounds
  for (i=0; i<20; i+=2) {
    for (j=0; j<8; j++) {
      QUARTERROUND(&x, idx16[j]);
  // add state to x
  for (i=0; i<16; i++) {
    x.v32[i] += ctx->state.v32[i];
  // xor input with x
  for (i=0; i<len; i++) {
    ((uint8_t*)in)[i] ^= x.v8[i];
  // update block counter
  // stopping at 2^70 bytes per nonce is user's responsibility
; void cc_corex (chacha_ctx *ctx, void *in, uint32_t len)
; do not call directly
; expects state in ebx, length in eax, input in edx

    ; allocate 64-bytes local space, x
    push   64
    pop    ecx
    sub    esp, ecx
    ; copy state to x
    mov    edi, esp
    mov    esi, ebx
    rep    movsb

    ; move x into edi
    mov    edi, esp
    push   eax
    push   20/2  ; 20 rounds
    pop    ebp
    ; load indexes
    call   load_idx
    dw     0c840H, 0d951H
    dw     0ea62H, 0fb73H
    dw     0fa50H, 0cb61H
    dw     0d872H, 0e943H
    pop    esi  ; pointer to indexes
    mov    cl, 8
    loop   e_l2
    dec    ebp
    jnz    e_l1
    ; add state to x
    mov    cl, 16
    mov    eax, [ebx+ecx*4-4]
    add    [edi+ecx*4-4], eax
    loop   add_state

    ; xor input with x
    pop    ecx               ; ecx=len
    mov    al, byte[edi+ecx-1]
    xor    byte[edx+ecx-1], al
    loop   xor_input
    ; update block counter
    inc    dword[ebx+12*4]
    adc    dword[ebx+13*4], ecx
    ; free stack
    add    esp, 64

Encryption/Decryption Function

// encrypt stream of bytes
void cc_encrypt (chacha_ctx *ctx, void *in, uint32_t len) 
  uint32_t r;
  uint8_t  *p=(uint8_t*)in;
  while (len) {
    r=(len>64) ? 64 : len;
    cc_core(ctx, p, r);
    len -= r;
    p += r;
    mov    ecx, [esp+32+12]  ; len
    jecxz  cc_l1             ; exit if len==0
    mov    ebx, [esp+32+4]   ; ctx
    mov    edx, [esp+32+8]   ; input
    xor    eax, eax          ; r=0
    mov    al, 64
    cmp    ecx, eax          ; eax=len>64:64:len
    cmovb  eax, ecx
    call   cc_corex
    add    edx, eax          ; p += r;
    sub    ecx, eax          ; len -= r;
    jnz    cc_l0


architecture size in bytes
x86 243
x64 n/a

There doesn’t appear to be any significant weaknesses against this cipher provided a large number of rounds are used.

The size of code is likely to change but may not be updated here. Please refer to here and here for future reference.

This entry was posted in assembly, cryptography, encryption and tagged , , , , , . Bookmark the permalink.

3 Responses to Asmcodes: ChaCha20 Stream cipher

  1. Pingback: Asmcodes: Twofish-256 | Odzhan

  2. Pingback: Asmcodes: Salsa20 Stream Cipher | x86 crypto

  3. Pingback: Asmcodes: Twofish-256 | x86 crypto

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s