PRESENT Block Cipher

Introduction

PRESENT is a 64-bit block cipher published in 2007 which provides support for key lengths of 80 and 128-bits. It was designed specifically for hardware implementation and uses a Substitution Permutation Network (SPN) structure which is similar to AES but at the same time completely different.

The difference is that PRESENT is bit oriented whereas AES is byte oriented. For that reason, PRESENT is not suitable for software implementation although I found the bit-permutation layer interesting and was just curious to see results using x86 assembly.

It was designed by Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe.

PRESENT is intended to be used in situations where low-power consumption and high chip efficiency is desired. It has been standardized in ISO-29192 along with CLEFIA which was designed by Sony Corporation.

Today, we’ll focus on the 128-bit key version since it has more potential for future applications than something supporting only 80-bit keys.

I’ve no doubt you can implement PRESENT in 32-bit x86 assembly but because it’s a hardware design, I decided to stick with x86-64 assembly and only implement encryption.

In future blog entries, depending on the design of an algorithm, it will be either 32 or 64-bit assembly and maybe both if suitable but I’ll avoid using MMX which was used in some past entries.

You can find a number of implementations in C and python here.

Key whitening

A linear operation seen in pretty much every block cipher designed since being used by Ron Rivest in 1984 to strengthen the DES block cipher.

It uses a simple eXclusive OR operation with the key and data.

$\textbf{addRoundKey. } \text{Given round key } K_i=\kappa_{63}^i\dots \kappa_0^i \text{ for } 1\leq i\leq 32 \text{ and current }\\ \textsc{state } b_{63}\dots b_0, \text{ addRoundKey consists of the operation for } 0\leq j\leq 63,$
$b_j\rightarrow b_j\oplus\kappa_j^i.$

Byte substitution

The non-linear layer is based on a single 4-bit S-box which was designed with hardware optimizations in mind. It replaces each nibble of 64-bits with 4-bit value from predefined table.

$\textbf{sBoxlayer. }\text{The S-box used in }\textsc{present }\text{is a 4-bit to 4-bit S-box } S:\mathbb{F}^4_2 \rightarrow\mathbb{F}_2^4.\\ \text{The action of this box in hexadecimal notation is given by the following table.}$

$\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & A & B & C & D & E & F \\\hline P(i) & C & 5 & 6 & B & 9 & 0 & A & D & 3 & E & F & 8 & 4 & 7 & 1 & 2 \\\hline \end{array}$

Here’s code generated by MSVC

; 62   :     hi.b[0] = sub_byte(hi.b[0]);

movzx edx, cl
mov QWORD PTR hi$[rbp-48], rcx mov eax, edx and edx, 15 shr rax, 4 mov cl, BYTE PTR sbox$3697[rbp+rax-48]
shl cl, 4
or  cl, BYTE PTR sbox$3697[rbp+rdx-48]  The assembly code loads a memory pointer for sbox into the RBX register and expects the AL register to hold the value we’re replacing with XLATB. For those of you who don’t know, the one-byte instruction XLATB locates a byte entry from a table in memory which is pointed to by the base register (BX). Using the contents of the AL register as a table index, it then copies the contents of the table entry back into the AL register. lsb: pop rbx ; rbx = sbox mov cl, al ; cl = x shr al, 4 ; al = sbox[x >> 4] << 4 xlatb ; shl al, 4 xchg al, cl and al, 15 ; al |= sbox[x & 0x0F] xlatb or al, cl pop rcx pop rbx ret sub_byte: push rbx push rcx call lsb ; sbox db 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd db 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2  Bit Permutation A linear operation that permutates 64-bits of the ciphertext state. Each bit is repositioned according to a predefined table. $\textbf{pLayer. } \text{The bit permutation used in }\textsc{present} \text{ is given by the following table.}\\ \text{Bit i of }\textsc{state } \text{is moved to bit position } P(i).$ $\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15\\ P(i) & 0 & 16 & 32 & 48 & 1 & 17 & 33 & 49 & 2 & 18 & 34 & 50 & 3 & 19 & 35 & 51 \\\hline \hline i & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31\\ P(i) & 4 & 20 & 36 & 52 & 5 & 21 & 37 & 53 & 6 & 22 & 38 & 54 & 7 & 23 & 39 & 55 \\\hline \hline i & 32 & 33 & 34 & 35 & 36 & 37 & 38 & 39 & 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47\\ P(i) & 8 & 24 & 40 & 56 & 9 & 25 & 41 & 57 & 10 & 26 & 42 & 58 & 11 & 27 & 43 & 59 \\\hline \hline i & 48 & 49 & 50 & 51 & 52 & 53 & 54 & 55 & 56 & 57 & 58 & 59 & 60 & 61 & 62 & 63\\ P(i) & 12 & 28 & 44 & 60 & 13 & 29 & 45 & 61 & 14 & 30 & 46 & 62 & 15 & 31 & 47 & 63 \\\hline \end{array}$ As you might already tell from the table, each bit position increases sequentially by 16 modulo 63. To obtain the correct bit position during permutation without the use of a table, we can multiply the loop counter by 16 and modulo the result by 63. The approach I use is to initialize a 32-bit variable R to 0x30201000. Imagine this to be a 4 byte array initialized to 0,16,32,48 which are used for shifting the first 4 bits of ciphertext state. By incrementing each byte of the R variable every round and rotating by 8 bits, this will generate every shift value required for the permutation operation. Here’s assembly output from MSVC for the C code above. ; 83 : t.q = 0; xor eax, eax ; 84 : r = 0x30201000; mov r9d, 807407616 ; 30201000H ; 85 : for (j=0; j<64; j++) { xor ebx, ebx$LL3@present128@2:

; 86   :       // test bit at j'th position and shift left
; 87   :       t.q |= ((p.q >> j) & 1) << (r & 255);

mov rdx, QWORD PTR p$[rbp-16] mov ecx, ebx inc ebx shr rdx, cl mov ecx, r9d and edx, 1 shl rdx, cl or rax, rdx ; 88 : r = ROTR32(r+1, 8); inc r9d ror r9d, 8 cmp ebx, 64 ; 00000040H jl SHORT$LL3@present128@2


One thing C can’t directly do is take advantage of the x86 status flags so the assembly code is much different.

If bit 0 of P variable is 1, a right shift by 1 will set the carry flag. Then we just avoid setting bits of t variable with Jump If Not Carry (JNC)

Bit Test and Set (BTS) will set bit in RBP based on value in CL.

Here’s handwritten assembly taking advantage of status flags and bit manipulation instructions.

;
mov     ecx, 0x30201000  ; r   = 0x30201000;
xor     ebp, ebp         ; t.q = 0;
xor     ebx, ebx         ; j   = 0;
pse_l2:
shr     rax, 1           ; CF = (p.q >> j) & 1
jnc     pse_l3           ; no bit carried
bts     rbp, rcx         ; set bit in rbp at position in cl
pse_l3:
inc     cl               ; r = ROTR32(r+1, 8);
ror     ecx, 8

add     bl, 4            ; j++, j < 64
jne     pse_l2


The only other thing to mention is how the j counter represented by the BL register is updated. Notice how it’s being increased by 4 instead of 1.

64 x 4 = 256 so once we’ve permutated 64-bits of ciphertext, BL register will be zero again, hence the use of Jump If Not Equal (JNE) which is testing the Zero Flag (ZF)

While describing the loop above, I rewrote the code to use the following instead which takes advantage of ECX being zero after the loop you can see in full routine.

;
mov     ebx, 0x30201000  ; r   = 0x30201000;
xor     ebp, ebp
mov     cl, 64
pse_l2:
shr     rax, 1           ; CF = (p.q >> j) & 1
jnc     pse_l3           ; no bit carried
bts     rbp, rbx         ; set bit in rbp at position in bl
pse_l3:
inc     bl               ; r = ROTR32(r+1, 8);
ror     ebx, 8
loop    pse_l2


Key scheduling

Function creates 32 64-bit subkeys using linear/non-linear operations from a 128-bit key.

; rcx = present_ctx
; rdx = key
present128_setkeyx:
push    rsi
push    rdi
push    rbp
push    rbx

push    rcx
pop     rdi              ; rdi = ctx

mov     rax, [rdx+8]     ; hi = key[1]
mov     rdx, [rdx  ]     ; lo = key[0]

xor     ecx, ecx         ; rnd = 0
psk:
stosq                    ; ctx->key[rnd] = hi.q;

lea     ebx, [rcx*2+2]   ; lo.q  ^= (rnd + rnd) + 2;
xor     rdx, rbx

push    rax              ; t.q    = hi.q;
pop     rbx
push    rdx
pop     rbp

shl     rax, 61          ; hi.q   = (hi.q & 7) << 61;
shr     rbp, 3           ; hi.q  |= (lo.q >> 3);
or      rax, rbp

shl     rdx, 61          ; lo.q    = (lo.q & 7) << 61;
shr     rbx, 3           ; lo.q   |= ( t.q >> 3);
or      rdx, rbx

rol     rax, 8           ; hi.q    = ROTL64(hi.q, 8);
call    sub_byte
ror     rax, 8
inc     cl
cmp     cl, PRESENT_RNDS
jne     psk
jmp     pse_l4


Assembly

; rcx = present_ctx
; rdx = buf
present128_encryptx:
push    rsi
push    rdi
push    rbp
push    rbx

push    rcx              ; rdi = present_ctx
pop     rdi

mov     rax, [rdx]       ; p.q = buf[0]

push    PRESENT_RNDS-1
pop     rcx
pse_l0:
xor     rax, [rdi]       ; p.q ^= ctx->key[i]
push    rcx              ; save i
mov     cl, 8            ; j = 0
pse_l1:
call    sub_byte         ; p.b[j] = sub_byte(p.b[j]);
ror     rax, 8           ; p.q    = ROTR64(s.q, 8);
loop    pse_l1           ; j++ < 8
;
mov     ebx, 0x30201000  ; r   = 0x30201000;
xor     ebp, ebp
pse_l2:
shr     rax, 1           ; CF = (p.q >> j) & 1
jnc     pse_l3           ; no bit carried
bts     rbp, rbx         ; set bit in rbp at position in bl
pse_l3:
inc     bl               ; r = ROTR32(r+1, 8);
ror     ebx, 8

jne     pse_l2

xchg    rax, rbp         ; p.q = t.q
pop     rcx
loop    pse_l0

; post whitening
xor     rax, [rdi]  ; p.q ^= ctx->key[PRESENT_RNDS-1];
mov     [rdx], rax  ; buf[0] = p.q
pse_l4:
pop     rbx
pop     rbp
pop     rdi
pop     rsi
ret


Summary

PRESENT is clearly intended for hardware and not terribly efficient in software but has some interesting features/ideas that could be used elsewhere.

The size of x86-64 assembly is approx. 188 bytes.

Thanks to 0x4d_ for $\LaTeX$ formulas.