Asmcodes: PRESENT Block Cipher

Introduction

PRESENT is a 64-bit block cipher published in 2007 which provides support for key lengths of 80 and 128-bits. It was designed specifically for hardware implementation and uses a Substitution Permutation Network (SPN) structure which is similar to AES but at the same time completely different.

The difference is that PRESENT is bit oriented whereas AES is byte oriented. For that reason, PRESENT is not suitable for software implementation although I found the bit-permutation layer interesting and was just curious to see results using x86 assembly.

It was designed by Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe.

PRESENT is intended to be used in situations where low-power consumption and high chip efficiency is desired. It has been standardized in ISO-29192 along with CLEFIA which was designed by Sony Corporation.

Today, we’ll focus on the 128-bit key version since it has more potential for future applications than something supporting only 80-bit keys.

I’ve no doubt you can implement PRESENT in 32-bit x86 assembly but because it’s a hardware design, I decided to stick with x86-64 assembly and only implement encryption.

In future blog entries, depending on the design of an algorithm, it will be either 32 or 64-bit assembly and maybe both if suitable but I’ll avoid using MMX which was used in some past entries.

You can find a number of implementations in C and python here.

Key whitening

A linear operation seen in pretty much every block cipher designed since being used by Ron Rivest in 1984 to strengthen the DES block cipher.

It uses a simple eXclusive OR operation with the key and data.

\textbf{addRoundKey. } \text{Given round key } K_i=\kappa_{63}^i\dots \kappa_0^i \text{ for } 1\leq i\leq 32 \text{ and current }\\  \textsc{state } b_{63}\dots b_0, \text{ addRoundKey consists of the operation for } 0\leq j\leq 63,
b_j\rightarrow b_j\oplus\kappa_j^i.

Byte substitution

The non-linear layer is based on a single 4-bit S-box which was designed with hardware optimizations in mind. It replaces each nibble of 64-bits with 4-bit value from predefined table.

\textbf{sBoxlayer. }\text{The S-box used in }\textsc{present }\text{is a 4-bit to 4-bit S-box } S:\mathbb{F}^4_2 \rightarrow\mathbb{F}_2^4.\\  \text{The action of this box in hexadecimal notation is given by the following table.}

\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & A & B & C & D & E & F \\\hline    P(i)  & C & 5 & 6 & B & 9 & 0 & A & D & 3 & E & F & 8 & 4 & 7 & 1 & 2  \\\hline  \end{array}

Here’s code generated by MSVC

; 62   :     hi.b[0] = sub_byte(hi.b[0]);               

  movzx edx, cl
  mov QWORD PTR hi$[rbp-48], rcx
  mov eax, edx
  and edx, 15
  shr rax, 4
  mov cl, BYTE PTR sbox$3697[rbp+rax-48]
  shl cl, 4
  or  cl, BYTE PTR sbox$3697[rbp+rdx-48]

The assembly code loads a memory pointer for sbox into the RBX register and expects the AL register to hold the value we’re replacing with XLATB.

For those of you who don’t know, the one-byte instruction XLATB locates a byte entry from a table in memory which is pointed to by the base register (BX).

Using the contents of the AL register as a table index, it then copies the contents of the table entry back into the AL register.

lsb:
    pop     rbx              ; rbx = sbox
    mov     cl, al           ; cl = x
    shr     al, 4            ; al = sbox[x >> 4] << 4
    xlatb                    ; 
    shl     al, 4    
    
    xchg    al, cl
    
    and     al, 15           ; al |= sbox[x & 0x0F]  
    xlatb
    or      al, cl    
    pop     rcx
    pop     rbx    
    ret
    
sub_byte:
    push    rbx 
    push    rcx 
    call    lsb
    ; sbox
    db      0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    db      0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

Bit Permutation

A linear operation that permutates 64-bits of the ciphertext state. Each bit is repositioned according to a predefined table.

\textbf{pLayer. } \text{The bit permutation used in }\textsc{present} \text{ is given by the following table.}\\  \text{Bit i of }\textsc{state } \text{is moved to bit position } P(i).

\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15\\    P(i)  & 0 & 16 & 32 & 48 & 1 & 17 & 33 & 49 & 2 & 18 & 34 & 50 & 3 & 19 & 35 & 51  \\\hline  \hline i & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31\\    P(i)  & 4 & 20 & 36 & 52 & 5 & 21 & 37 & 53 & 6 & 22 & 38 & 54 & 7 & 23 & 39 & 55  \\\hline  \hline i & 32 & 33 & 34 & 35 & 36 & 37 & 38 & 39 & 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47\\    P(i)  & 8 & 24 & 40 & 56 & 9 & 25 & 41 & 57 & 10 & 26 & 42 & 58 & 11 & 27 & 43 & 59  \\\hline  \hline i & 48 & 49 & 50 & 51 & 52 & 53 & 54 & 55 & 56 & 57 & 58 & 59 & 60 & 61 & 62 & 63\\    P(i)  & 12 & 28 & 44 & 60 & 13 & 29 & 45 & 61 & 14 & 30 & 46 & 62 & 15 & 31 & 47 & 63  \\\hline  \end{array}

As you might already tell from the table, each bit position increases sequentially by 16 modulo 63.

To obtain the correct bit position during permutation without the use of a table, we can multiply the loop counter by 16 and modulo the result by 63.

The approach I use is to initialize a 32-bit variable R to 0x30201000. Imagine this to be a 4 byte array initialized to 0,16,32,48 which are used for shifting the first 4 bits of ciphertext state.

By incrementing each byte of the R variable every round and rotating by 8 bits, this will generate every shift value required for the permutation operation.

Here’s assembly output from MSVC for the C code above.

; 83   :     t.q = 0;    

  xor eax, eax

; 84   :     r   = 0x30201000;    

  mov r9d, 807407616        ; 30201000H

; 85   :     for (j=0; j<64; j++) {

  xor ebx, ebx
$LL3@present128@2:

; 86   :       // test bit at j'th position and shift left
; 87   :       t.q |= ((p.q >> j) & 1) << (r & 255);      

  mov rdx, QWORD PTR p$[rbp-16]
  mov ecx, ebx
  inc ebx
  shr rdx, cl
  mov ecx, r9d
  and edx, 1
  shl rdx, cl
  or  rax, rdx

; 88   :       r = ROTR32(r+1, 8);

  inc r9d
  ror r9d, 8
  cmp ebx, 64         ; 00000040H
  jl  SHORT $LL3@present128@2

One thing C can’t directly do is take advantage of the x86 status flags so the assembly code is much different.

If bit 0 of P variable is 1, a right shift by 1 will set the carry flag. Then we just avoid setting bits of t variable with Jump If Not Carry (JNC)

Bit Test and Set (BTS) will set bit in RBP based on value in CL.

Here’s handwritten assembly taking advantage of status flags and bit manipulation instructions.

;
    mov     ecx, 0x30201000  ; r   = 0x30201000;
    xor     ebp, ebp         ; t.q = 0;
    xor     ebx, ebx         ; j   = 0;
pse_l2:
    shr     rax, 1           ; CF = (p.q >> j) & 1
    jnc     pse_l3           ; no bit carried
    bts     rbp, rcx         ; set bit in rbp at position in cl
pse_l3:    
    inc     cl               ; r = ROTR32(r+1, 8);
    ror     ecx, 8
    
    add     bl, 4            ; j++, j < 64
    jne     pse_l2

The only other thing to mention is how the j counter represented by the BL register is updated. Notice how it’s being increased by 4 instead of 1.

64 x 4 = 256 so once we’ve permutated 64-bits of ciphertext, BL register will be zero again, hence the use of Jump If Not Equal (JNE) which is testing the Zero Flag (ZF)

While describing the loop above, I rewrote the code to use the following instead which takes advantage of ECX being zero after the loop you can see in full routine.

;
    mov     ebx, 0x30201000  ; r   = 0x30201000;
    xor     ebp, ebp
    mov     cl, 64
pse_l2:
    shr     rax, 1           ; CF = (p.q >> j) & 1
    jnc     pse_l3           ; no bit carried
    bts     rbp, rbx         ; set bit in rbp at position in bl
pse_l3:    
    inc     bl               ; r = ROTR32(r+1, 8);
    ror     ebx, 8
    loop    pse_l2

Key scheduling

Function creates 32 64-bit subkeys using linear/non-linear operations from a 128-bit key.

; rcx = present_ctx
; rdx = key    
present128_setkeyx:
    push    rsi
    push    rdi
    push    rbp
    push    rbx
    
    push    rcx              
    pop     rdi              ; rdi = ctx
    
    mov     rax, [rdx+8]     ; hi = key[1]
    mov     rdx, [rdx  ]     ; lo = key[0]
    
    xor     ecx, ecx         ; rnd = 0
psk:
    stosq                    ; ctx->key[rnd] = hi.q;  
    
    lea     ebx, [rcx*2+2]   ; lo.q  ^= (rnd + rnd) + 2;
    xor     rdx, rbx
    
    push    rax              ; t.q    = hi.q;
    pop     rbx
    push    rdx
    pop     rbp
    
    shl     rax, 61          ; hi.q   = (hi.q & 7) << 61;
    shr     rbp, 3           ; hi.q  |= (lo.q >> 3);
    or      rax, rbp
    
    shl     rdx, 61          ; lo.q    = (lo.q & 7) << 61;
    shr     rbx, 3           ; lo.q   |= ( t.q >> 3);
    or      rdx, rbx
    
    rol     rax, 8           ; hi.q    = ROTL64(hi.q, 8);
    call    sub_byte
    ror     rax, 8
    inc     cl
    cmp     cl, PRESENT_RNDS
    jne     psk
    jmp     pse_l4

Encryption

Assembly

; rcx = present_ctx
; rdx = buf    
present128_encryptx:
    push    rsi
    push    rdi
    push    rbp
    push    rbx
    
    push    rcx              ; rdi = present_ctx
    pop     rdi
    
    mov     rax, [rdx]       ; p.q = buf[0]
    
    push    PRESENT_RNDS-1
    pop     rcx
pse_l0:    
    xor     rax, [rdi]       ; p.q ^= ctx->key[i]
    scasq                    ; advance key position
    push    rcx              ; save i
    mov     cl, 8            ; j = 0
pse_l1:
    call    sub_byte         ; p.b[j] = sub_byte(p.b[j]); 
    ror     rax, 8           ; p.q    = ROTR64(s.q, 8);
    loop    pse_l1           ; j++ < 8
    ;
    mov     ebx, 0x30201000  ; r   = 0x30201000;
    xor     ebp, ebp
pse_l2:
    shr     rax, 1           ; CF = (p.q >> j) & 1
    jnc     pse_l3           ; no bit carried
    bts     rbp, rbx         ; set bit in rbp at position in bl
pse_l3:    
    inc     bl               ; r = ROTR32(r+1, 8);
    ror     ebx, 8

    add     cl, 4            ; j++
    jne     pse_l2

    xchg    rax, rbp         ; p.q = t.q 
    pop     rcx
    loop    pse_l0 
    
    ; post whitening
    xor     rax, [rdi]  ; p.q ^= ctx->key[PRESENT_RNDS-1];
    mov     [rdx], rax  ; buf[0] = p.q
pse_l4:    
    pop     rbx    
    pop     rbp    
    pop     rdi    
    pop     rsi    
    ret

Summary

PRESENT is clearly intended for hardware and not terribly efficient in software but has some interesting features/ideas that could be used elsewhere.

The size of x86-64 assembly is approx. 188 bytes.

Thanks to 0x4d_ for \LaTeX formulas.

Advertisements
This entry was posted in assembly, cryptography, programming, security and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s