Poly1305 Message Authentication Code

Introduction

Poly1305 is a cryptographic Message Authentication Code (MAC) designed and published in 2004 by Daniel J. Bernstein. It can be used to verify the data integrity and authenticity of a message.

Adam Langley has published details in RFC 7539 which uses ChaCha20 and Poly1305 for Authenticated Encryption Associated Data (AEAD) as an alternative to AES-GCM (Galois Counter Mode)

The reasons to use ChaCha20-Poly1305 instead of AES-GCM are not for lack of confidence in AES but due to its performance on ARM architectures that do not have hardware acceleration.

Dan’s ciphers

Many of Dan’s designs have become popular in the last number of years and there are some good observations on the reasons why in an essay titled On the Impending Crypto Monoculture by Peter Gutmann.

It’s not a case of Dan being a crypto king (although he does awesome work). IMHO the choice of using his algorithms comes down to a number of reasons.

  • Provably secure
  • Patent free
  • Low overhead in hardware and software
  • Simple and secure implementations already provided by author
  • Not tainted by government agencies

Dan’s designs are intended to be resistant to timing attacks, side channel analysis which AES was never intended to be.

CCM, EAX and GCM can be challenging to implement correctly. Poly1305 is simple with just addition, modular multiplication and therefore less prone to error.

Vlad Krasnov from Cloudflare offers a sound review of both poly1305 and chacha20 in It takes two to ChaCha (Poly)

About C code

As most of you might know, the codes presented here in all entries are optimized for size so if you want an implementation optimized for speed, this will certainly disappoint you.

As per usual with these entries, I’ll show code in C first and then x86 assembly. If you can spot ways to optimize the code for size, feel free to fork the project and submit a pull request with your name included in source code as author.

The C code here is derived from a reference implementation provided by Dan on his site except I’m following the design presented by Langley in RFC 7539 for ChaCha20. The RFC is easy to follow and if you want to better understand implementation, I highly recommend it.

Addition

The addition is performed on 8-bit bytes stored as 32-bit words. At the end of every number added to the accumulator is a byte: 0, 1 or 252. So rather than expecting the callee to pad a buffer with zeros, poly1305_add() takes care of this.

/**********************************************
 *
 * poly1305 addition
 *
 **********************************************/
void poly1305_add(
    uint32_t out[], 
    const uint8_t in[], 
    int inlen, 
    uint8_t last)
{
  uint32_t c, i, x = 0;
  uint8_t *p = (uint8_t*)in;

  // add in to out
  for (i=0; i<17; i++) {
    c = *p;
    c = (i == inlen) ? last : c;
    c = (i  > inlen) ? 0    : c;
    p = (i  < inlen) ? p+1  : p;
    
    x += out[i] + c; 
    out[i] = x & 255; 
    x >>= 8; 
  }
}

Even though ADC (Add with Carry) is available and in fact works fine I’ve stuck with the original reference provided by MR. Dan. Function expects accumulator in EDI, the input bytes in ESI, inlen in ECX and last byte in AL. ADC only increments p if i is less than inlen.

; ***************************************
;
; poly1305 addition
;
; al  = last 
; ecx = inlen
; edi = out
; esi = p
;
; ***************************************
_poly1305_addx:
poly1305_addx:
    pushad
    movzx  edx, al            ; edx = last
    xor    ebx, ebx           ; x = 0
    xor    ecx, ecx           ; ecx = 0    
    xor    ebp, ebp           ; i = 0
add_l0:
    movzx  eax, byte[esi]     ; c = *p 
    cmp    ebp, [esp+_ecx]    ; EFLAGS = (i - inlen)
    cmovz  eax, edx           ; c = (i == inlen) ? last : c;
    cmova  eax, ecx           ; c = (i  > inlen) ? 0    : c;
    adc    esi, ecx           ; p = (i  < inlen) ? p+1  : p;
    add    ebx, [edi]         ; x += out[i]
    add    ebx, eax           ; x += c
    mov    al, bl             ; al = x
    stosd                     ; out[i] = x & 255
    shr    ebx, 8             ; x >>= 8
    inc    ebp                ; i++
    cmp    ebp, 17 
    jb     add_l0
    popad
    ret

Modular Multiplication

The modulus is 2^130-5 or 3fffffffffffffffffffffffffffffffb and Dan explains in his paper the reason for choosing this number specifically.

I chose 2^130-5 because its sparse form makes divisions particularly easy in both software and hardware. My encoding of messages as polynomials takes advantage of the gap between 2^128 and 2^130-5.

/**********************************************
 *
 * poly1305 modular multiplication
 *
 * "P" is 2^130-5 or 3fffffffffffffffffffffffffffffffb
 *
 **********************************************/
void poly1305_mulmod(
    uint32_t acc[17],
    const uint32_t r[17])
{
  uint32_t hr[17];
  uint32_t i, j, u;

  memcpy ((uint8_t*)hr, (uint8_t*)acc, 17*4);
  
  // multiply
  for (i=0; i<17; i++) {
    u = 0;
    for (j=0; j<=i; j++) {
      u += hr[j] * r[i - j];
    }
    for (; j<17; j++) {
      u += hr[j] * r[i + (17 - j)] * 320;
    }
    acc[i] = u;
  }
  
  for (u=0, j=0; j<2; j++)
  {
    for (i=0; i<16; i++) 
    { 
      u += acc[i];
      acc[i] = u & 255; 
      u >>= 8; 
    }
    if (!j) 
    {
      u += acc[16];
      acc[16] = u & 3;
      u = (u >> 2) * 5;
    }
  }
  acc[16] += u;
}

The assembly sticks close to the original C code provided. I attempted to use a different implementation of modular multiplication but the reduction in code was negligible.

; ***************************************
;
; poly1305 modular multiplication
;
; "P" is 2^130-5 or 3fffffffffffffffffffffffffffffffb
;
; esi = acc 
; ebx = r
;
; ***************************************
poly1305_mulmodx:
_poly1305_mulmodx:
    pushad
    ; memcpy ((uint8_t*)hr, (uint8_t*)acc, 17*4); 
    push   17*4
    pop    ecx
    sub    esp, ecx
    mov    esi, edi            ; esi = acc
    mov    edi, esp            ; edi = hr
    push   esi                 ; save acc
    rep    movsb
    pop    edi                 ; edi = acc
    
    ; multiply
    mov    esi, ebx            ; esi = r
    xor    ebx, ebx            ; i = 0
mm_l0:
    xor    ecx, ecx            ; j = 0
    xor    edx, edx            ; u = 0
mm_l1:
    mov    eax, [esp+ecx*4]    ; eax = hr[j]
    mov    ebp, ebx            ; ebp = i
    sub    ebp, ecx            ; ebp = i - j
    imul   eax, [esi+ebp*4]    ; eax *= r[i - j]
    add    edx, eax            ; u += eax
    inc    ecx                 ; j++
    cmp    ecx, ebx            ; j <= i
    jbe    mm_l1
mm_l2:
    push   17
    pop    ebp
    
    cmp    ecx, ebp            ; i < 17
    jae    mm_l3

    sub    ebp, ecx            ; ebp = 17 - j          
    add    ebp, ebx            ; ebp += i
    
    push   64
    pop    eax
    inc    ah                  ; eax = 320
    imul   eax, [esi+ebp*4]    ; eax *= r[i+17-j]
    imul   eax, [esp+ecx*4]    ; eax *= hr[j]
    add    edx, eax            ; u += eax 
    inc    ecx                 ; j++
    jmp    mm_l2
mm_l3:
    mov    [edi+ebx*4], edx    ; acc[i] = u
    inc    ebx
    cmp    ebx, 17
    jb     mm_l0
    
    ; reduce
    xor    ebx, ebx
    xor    edx, edx            ; u = 0
f_lx:    
    mov    cl, 16
    mov    esi, edi            ; esi = acc
    push   edi    
f_l0:
    lodsd                      ; eax = acc[i]
    add    edx, eax            ; u += acc[i]
    movzx  eax, dl             ; eax = u & 255
    stosd                      ; acc[i] = eax
    shr    edx, 8              ; u >>= 8
    loop   f_l0
    dec    ebx                 ;  
    jnp    f_l1    
    ; ---------------
    lodsd                      ; eax = acc[16]
    add    edx, eax            ; u += eax
    movzx  eax, dl             ; 
    and    al, 3               ; al &= 3
    stosd                      ; acc[16] = eax
    shr    edx, 2              ; u >>= 2
    lea    edx, [edx+edx*4]    ; u = (u >> 2) * 5
    pop    edi
    jmp    f_lx
f_l1:    
    add    [edi], edx          ; acc[16] += u;
    add    esp, 17*4 + 4
    popad
    ret

MAC function

This is the main function called to generate 128-bit MAC of message. First r is “clamped”

I constrained r to simplify and accelerate implementations of Poly1305-AES in various contexts. A wider range of r|e.g., all 128-bit integers would allow a quantitatively better security bound, but the current bound DL=2106 will be perfectly satisfactory for the foreseeable future, whereas slower authenticator computations would not be perfectly satisfactory.

/**********************************************
 *
 * poly1305 mac function
 *
 **********************************************/
void poly1305_mac (
    uint8_t *out, 
    const uint8_t *in, 
    uint32_t inlen, 
    const uint8_t *k)
{
  uint32_t i, len, neg;
  uint32_t r[17], acc[17];
  uint8_t  minusp[16]={5};
  
  // copy r
  for (i=0; i<16; i++) {
    r[i] = k[i];
  }
  // clamp r
  r[ 3] &= 15;
  r[ 7] &= 15;
  r[11] &= 15;
  r[15] &= 15;
  
  r[ 4] &= 252;
  r[ 8] &= 252;
  r[12] &= 252;
  r[16]  = 0;

  // zero initialize accumulator
  memset ((uint8_t*)acc, 0, 17*4);

  for (;;) 
  {
    // if zero length, break
    if (inlen==0) break;
    
    // process 16 bytes or remaining
    len = inlen < 16 ? inlen : 16;
    
    // add bytes to acc
    poly1305_add (acc, in, len, 1);

    // multiply accumulator by r mod p
    poly1305_mulmod (acc, r);

    // update length and buffer position
    in    += len;
    inlen -= len;
  }

  memcpy (r, acc, sizeof(r));

  poly1305_add (acc, minusp, 16, 252);
  neg = -(acc[16] >> 7);
  
  for (i=0; i<17; i++) {
    acc[i] ^= neg & (r[i] ^ acc[i]);
  }
  
  // add s
  poly1305_add (acc, &k[16], 16, 0);
  
  // return tag
  for (i=0; i<16; i++) {
    out[i] = acc[i];
  }
}

Summary

Assembly generated by MSVC 2010 with /O2 /Os flags results in 507 bytes. The assembly implementation is currently 339 bytes but may shrink in the future.

See here for future updates or if you want to contribute.

Advertisements
This entry was posted in assembly, cryptography, encryption, programming, security and tagged , , , . Bookmark the permalink.

One Response to Poly1305 Message Authentication Code

  1. Pingback: Asmcodes: CubeMAC128 Message Authentication Code | x86 crypto

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s