PRESENT Block Cipher

Introduction

PRESENT is a 64-bit block cipher published in 2007 that supports key lengths of 80 and 128-bits. It was designed by Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe. It has been standardized in ISO-29192 along with CLEFIA which was designed by Sony Corporation. PRESENT was designed to be implemented on hardware. However, I found the bit-permutation layer interesting and was curious to see results using 64-bit assembly. You can find a number of implementations in C and python here.

Key mixing layer

A linear operation seen in pretty much every block cipher designed since being used by Ron Rivest in 1984 to strengthen the DES block cipher. It uses a simple eXclusive OR operation with the key and data.

\textbf{addRoundKey. } \text{Given round key } K_i=\kappa_{63}^i\dots \kappa_0^i \text{ for } 1\leq i\leq 32 \text{ and current }\\  \textsc{state } b_{63}\dots b_0, \text{ addRoundKey consists of the operation for } 0\leq j\leq 63,
b_j\rightarrow b_j\oplus\kappa_j^i.

Nonlinear layer

The non-linear layer is based on a single 4-bit S-box which was designed with hardware optimizations in mind. It replaces each nibble of 64-bits with 4-bit value from predefined table.

\textbf{sBoxlayer. }\text{The S-box used in }\textsc{present }\text{is a 4-bit to 4-bit S-box } S:\mathbb{F}^4_2 \rightarrow\mathbb{F}_2^4.\\  \text{The action of this box in hexadecimal notation is given by the following table.}

\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & A & B & C & D & E & F \\\hline    P(i)  & C & 5 & 6 & B & 9 & 0 & A & D & 3 & E & F & 8 & 4 & 7 & 1 & 2  \\\hline  \end{array}

B S(B x) {
    B sbox[16] =
      {0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
       0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };
    return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}

Here’s code generated by MSVC.

  movzx edx, cl
  mov QWORD PTR hi$[rbp-48], rcx
  mov eax, edx
  and edx, 15
  shr rax, 4
  mov cl, BYTE PTR sbox$3697[rbp+rax-48]
  shl cl, 4
  or  cl, BYTE PTR sbox$3697[rbp+rdx-48]

The assembly code loads a memory pointer for sbox into the RBX register and expects the AL register to hold the value we’re replacing with XLATB. For those of you who don’t know, the one-byte instruction XLATB locates a byte entry from a table in memory which is pointed to by the base register (BX). Using the contents of the AL register as a table index, it then copies the contents of the table entry back into the AL register.

lsb:
    pop     rbx              ; rbx = sbox
    mov     cl, al           ; cl = x
    shr     al, 4            ; al = sbox[x >> 4] << 4
    xlatb                    ; 
    shl     al, 4    
    
    xchg    al, cl
    
    and     al, 15           ; al |= sbox[x & 0x0F]  
    xlatb
    or      al, cl    
    pop     rcx
    pop     rbx    
    ret
    
sub_byte:
    push    rbx 
    push    rcx 
    call    lsb
    ; sbox
    db      0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    db      0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

Linear layer

A linear operation that permutates 64-bits of the ciphertext state. Each bit is repositioned according to a predefined table.

\textbf{pLayer. } \text{The bit permutation used in }\textsc{present} \text{ is given by the following table.}\\  \text{Bit i of }\textsc{state } \text{is moved to bit position } P(i).

\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15\\    P(i)  & 0 & 16 & 32 & 48 & 1 & 17 & 33 & 49 & 2 & 18 & 34 & 50 & 3 & 19 & 35 & 51  \\\hline  \hline i & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31\\    P(i)  & 4 & 20 & 36 & 52 & 5 & 21 & 37 & 53 & 6 & 22 & 38 & 54 & 7 & 23 & 39 & 55  \\\hline  \hline i & 32 & 33 & 34 & 35 & 36 & 37 & 38 & 39 & 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47\\    P(i)  & 8 & 24 & 40 & 56 & 9 & 25 & 41 & 57 & 10 & 26 & 42 & 58 & 11 & 27 & 43 & 59  \\\hline  \hline i & 48 & 49 & 50 & 51 & 52 & 53 & 54 & 55 & 56 & 57 & 58 & 59 & 60 & 61 & 62 & 63\\    P(i)  & 12 & 28 & 44 & 60 & 13 & 29 & 45 & 61 & 14 & 30 & 46 & 62 & 15 & 31 & 47 & 63  \\\hline  \end{array}

As you might already tell from the table, each bit position increases sequentially by 16 modulo 63. To obtain the correct bit position during permutation without the use of a table, we can multiply the loop counter by 16 and modulo the result by 63. The approach I use is to initialize a 64-bit variable r to 0x0030002000100000. Imagine this to be an 8 byte array initialized to 0,16,32,48 that are used for shifting the first 4 bits of ciphertext state. By incrementing each byte of r every round and rotating by 16 bits, this will generate every shift value required for the permutation operation.

      // apply linear layer
      t=0;r=0x0030002000100000;
      F(j,64)
        t|=((p>>j)&1)<<(r&255),
        r=R(r+1,16);

Here’s assembly output from GCC for the C code above.

	mov	r12, QWORD PTR 8[rsp]    ; p
	xor	r8d, r8d                 ; t = 0
	mov	rax, rbx                 ; r = 0x0030002000100000;
	xor	ebp, ebp                 ; j = 0;
.L4:
	mov	cl, bpl                  ; cl = j
	mov	rdi, r12                 ; rdi = p
	inc	rbp                      ; j++
	shr	rdi, cl                  ; rdi >>= j
	mov	cl, al                   ; cl = (r & 255)
	inc	rax                      ; r++
	and	edi, 1                   ; rdi &= 1 
	ror	rax, 16                  ; r = R(r, 16)
	sal	rdi, cl                  ; rdi <<= cl
	or	r8, rdi                  ; t |= rdi
	cmp	rbp, 64                  ; j<64
	jne	.L4

One thing C can’t directly do is take advantage of the x86 status flags so the assembly code is much different. If bit 0 of P variable is 1, a right shift by 1 will set the carry flag. Then we just avoid setting bits of t variable with Jump If Not Carry (JNC) Bit Test and Set (BTS) will set bit in RBP based on value in CL. Here’s handwritten assembly taking advantage of status flags and bit manipulation instructions.

;
    mov     ecx, 0x30201000  ; r = 0x30201000;
    xor     ebp, ebp         ; t = 0;
    xor     ebx, ebx         ; j = 0;
L3:
    shr     rax, 1           ; CF = (p >> j) & 1
    jnc     L3           ; no bit carried
    bts     rbp, rcx         ; set bit in rbp at position in cl
L3:    
    inc     cl               ; r = ROTR32(r+1, 8);
    ror     ecx, 8
    
    add     bl, 4            ; j++, j < 64
    jne     L2

The only other thing to mention is how the j counter represented by the BL register is updated. Notice how it’s being increased by 4 instead of 1. 64 x 4 = 256 so once we’ve permutated 64-bits of ciphertext, BL register will be zero again, hence the use of Jump If Not Equal (JNE) which is testing the Zero Flag (ZF). While describing the loop above, I rewrote the code to use the following instead which takes advantage of ECX being zero after the loop you can see in full routine.

    ; r = 0x30201000
    mov     edx, 0x30201000
    ; t = 0
    xor     esi, esi
L2:
    ; t |= ((p >> j) & 1) << (r & 255);
    shr     rax, 1
    jnc     L3
    bts     rsi, rdx
L3:    
    ; r = ROTR32(r+1, 8);
    inc     dl
    ror     edx, 8
    ; j++, j < 64
    add     cl, 4
    jne     L2

Key scheduling

32 64-bit subkeys are generated using linear/non-linear operations on a 128-bit master key. It also includes the round counter to protect against slide attacks.

      // create subkey
      p =(k0<<61)|(k1>>3);
      k1=(k1<<61)|(k0>>3);
      p=R(p,56);
      // apply nonlinear layer
      ((B*)&p)[0]=S(((B*)&p)[0]);
      k0=R(p,8)^((i+1)>>2);
      k1^=(((i+1)& 3)<<62);

Compact code

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)
typedef unsigned long long W;
typedef unsigned char B;

B S(B x) {
    B sbox[16] =
      {0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
       0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };
    return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}

#define rev __builtin_bswap64

void present(void*mk,void*data) {
    W i,j,r,p,t,t2,k0,k1,*k=(W*)mk,*x=(W*)data;
    
    // load 64-bit plaintext, 128-bit master key
    k0=rev(k[0]); k1=rev(k[1]);t=rev(x[0]);
  
    F(i,32-1) {
      // mix key
      p=t^k0;
      // apply nonlinear layer
      F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
      // apply linear layer
      t=0;r=0x0030002000100000;
      F(j,64)
        t|=((p>>j)&1)<<(r&255),
        r=R(r+1,16);
      // create subkey
      p =(k0<<61)|(k1>>3);
      k1=(k1<<61)|(k0>>3);
      p=R(p,56);
      // apply nonlinear layer
      ((B*)&p)[0]=S(((B*)&p)[0]);
      k0=R(p,8)^((i+1)>>2);
      k1^=(((i+1)& 3)<<62);
    }
    // mix key
    x[0] = rev(t^k0);
}

AMD64 assembly

; -----------------------------------------------
; PRESENT-128 block cipher in AMD64 assembly (encryption only)
;
; size: 193 bytes
;
; global calls use Linux/AMD64 ABI
;
; -----------------------------------------------
    bits 64

    %ifndef BIN
      global present
    %endif
    
    %define k0   rbp
    %define k1   rdi
    
present:
    push    rbp
    push    rbx
    push    rdi   ; save mk
    push    rsi   ; save data
    ; load 128-bit key
    ; k0=rev(k[0]); k1=rev(k[1]);
    mov     k0, [rdi]
    bswap   k0
    mov     k1, [rdi+8]
    bswap   k1
    ; load 64-bit plaintext
    ; t=rev(x[0]);
    lodsq
    bswap   rax
    ; i = 0
    xor     ebx, ebx
L0:
    ; mix key
    ; p = t ^ k0;
    xor     rax, k0
    ; F(j,8) ((B*)&p)[j] = S(((B*)&p)[j]);
    push    8
    pop     rcx
L1:
    call    S
    ror     rax, 8
    loop    L1
    ; r = 0x30201000
    mov     edx, 0x30201000
    ; t = 0
    xor     esi, esi
L2:
    ; t |= ((p >> j) & 1) << (r & 255);
    shr     rax, 1
    jnc     L3
    bts     rsi, rdx
L3:    
    ; r = ROTR32(r+1, 8);
    inc     dl
    ror     edx, 8
    ; j++, j < 64
    add     cl, 4
    jne     L2
    
    ; save t
    push    rsi
    ; p = (k0 << 61) | (k1 >> 3);
    push    k0
    pop     rax
    push    k1
    pop     rcx
    shl     rax, 61
    shr     rcx, 3
    or      rax, rcx
    ; k1 = (k1 << 61) | (k0 >> 3);
    shl     k1, 61
    shr     k0, 3
    or      k1, k0
    ; p = R(p, 56);
    ror     rax, 56
    ; apply nonlinear layer
    ; ((B*)&p)[0] = S(((B*)&p)[0]);
    call    S
    ; i++
    inc     bl
    ; k0 = R(p, 8) ^ ((i + 1) >> 2);
    mov     ecx, ebx
    shr     ecx, 2
    push    rax
    pop     k0
    ror     k0, 8
    xor     k0, rcx
    ; k1 ^= (((i + 1) & 3) << 62);
    mov     ecx, ebx
    and     ecx, 3
    shl     rcx, 62
    xor     k1, rcx
    ; restore t in rax
    pop     rax
    ; i < 32-1
    cmp     bl, 32-1
    jne     L0
    ; x[0] = rev(t ^ k0);
    xor     rax, k0
    bswap   rax
    pop     rdi
    stosq
    pop     rdi
    pop     rbx    
    pop     rbp     
    ret
    
    ; nonlinear layer
S0:
    pop     rbx              ; rbx = sbox
    mov     cl, al           ; cl = x
    shr     al, 4            ; al = sbox[x >> 4] << 4
    xlatb                    ; 
    shl     al, 4    
    xchg    al, cl
    and     al, 15           ; al |= sbox[x & 0x0F]  
    xlatb
    or      al, cl    
    pop     rcx
    pop     rbx    
    ret
S:
    push    rbx 
    push    rcx 
    call    S0
    ; sbox
    db      0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    db      0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

ARM64 / AArch64 assembly

// PRESENT in ARM64 assembly
// 224 bytes

    .arch armv8-a
    .text
    .global present

    #define k  x0
    #define x  x1
    #define r  w2
    #define p  x3
    #define t  x4
    #define k0 x5
    #define k1 x6
    #define i  x7
    #define j  x8
    #define s  x9
    
present:
    str     lr, [sp, -16]!
    
    // k0=k[0];k1=k[1];t=x[0];
    ldp     k0, k1, [k]
    ldr     t, [x]

    // only dinosaurs use big endian convention
    rev     k0, k0
    rev     k1, k1
    rev     t, t
 
    mov     i, 0
    adr     s, sbox
L0:
    // p=t^k0;
    eor     p, t, k0
    
    // F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
    mov     j, 8
L1:
    bl      S
    ror     p, p, 8
    subs    j, j, 1
    bne     L1

    // t=0;r=0x0030002000100000;
    mov     t, 0
    ldr     r, =0x30201000
    // F(j,64)
    mov     j, 0
L2:
    // t|=((p>>j)&1)<<(r&255),
    lsr     x10, p, j         // x10 = (p >> j) & 1
    and     x10, x10, 1       // 
    lsl     x10, x10, x2      // x10 << r
    orr     t, t, x10         // t |= x10
    
    // r=R(r+1,16);
    add     r, r, 1           // r = R(r+1, 8)
    ror     r, r, 8
    
    add     j, j, 1           // j++
    cmp     j, 64             // j < 64
    bne     L2

    // p =(k0<<61)|(k1>>3);
    lsr     p, k1, 3
    orr     p, p, k0, lsl 61
    
    // k1=(k1<<61)|(k0>>3);
    lsr     k0, k0, 3
    orr     k1, k0, k1, lsl 61
    
    // p=R(p,56);
    ror     p, p, 56
    bl      S

    // i++
    add     i, i, 1

    // k0=R(p,8)^((i+1)>>2);
    lsr     x10, i, 2
    eor     k0, x10, p, ror 8

    // k1^= (((i+1)&3)<<62);
    and     x10, i, 3
    eor     k1, k1, x10, lsl 62
     
    // i < 31
    cmp     i, 31
    bne     L0
    
    // x[0] = t ^= k0
    eor     p, t, k0 
    rev     p, p   
    str     p, [x]
    
    ldr     lr, [sp], 16
    ret 

S:
    ubfx    x10, p, 0, 4              // x10 = (p & 0x0F)
    ubfx    x11, p, 4, 4              // x11 = (p & 0xF0) >> 4
    
    ldrb    w10, [s, w10, uxtw 0]     // w10 = s[w10]
    ldrb    w11, [s, w11, uxtw 0]     // w11 = s[w11]
    
    bfi     p, x10, 0, 4              // p[0] = ((x11 << 4) | x10)
    bfi     p, x11, 4, 4
    
    ret
sbox:
    .byte 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    .byte 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

Summary

PRESENT is clearly intended for hardware and not terribly efficient in software but has some interesting features/ideas that could be used elsewhere. Thanks to 0x4d_ for submitting \LaTeX formulas.

Sources here.

Advertisements
This entry was posted in assembly, cryptography, programming, security and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s