## PRESENT Block Cipher

### Introduction

PRESENT is a 64-bit block cipher published in 2007 that supports key lengths of 80 and 128-bits. It was designed by Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe. It has been standardized in ISO-29192 along with CLEFIA which was designed by Sony Corporation. PRESENT was designed to be implemented on hardware. However, I found the bit-permutation layer interesting and was curious to see results using 64-bit assembly. You can find a number of implementations in C and python here.

### Key mixing layer

A linear operation seen in pretty much every block cipher designed since being used by Ron Rivest in 1984 to strengthen the DES block cipher. It uses a simple eXclusive OR operation with the key and data.

$\textbf{addRoundKey. } \text{Given round key } K_i=\kappa_{63}^i\dots \kappa_0^i \text{ for } 1\leq i\leq 32 \text{ and current }\\ \textsc{state } b_{63}\dots b_0, \text{ addRoundKey consists of the operation for } 0\leq j\leq 63,$
$b_j\rightarrow b_j\oplus\kappa_j^i.$

### Nonlinear layer

The non-linear layer is based on a single 4-bit S-box which was designed with hardware optimizations in mind. It replaces each nibble of 64-bits with 4-bit value from predefined table.

$\textbf{sBoxlayer. }\text{The S-box used in }\textsc{present }\text{is a 4-bit to 4-bit S-box } S:\mathbb{F}^4_2 \rightarrow\mathbb{F}_2^4.\\ \text{The action of this box in hexadecimal notation is given by the following table.}$

$\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & A & B & C & D & E & F \\\hline P(i) & C & 5 & 6 & B & 9 & 0 & A & D & 3 & E & F & 8 & 4 & 7 & 1 & 2 \\\hline \end{array}$

B S(B x) {
B sbox[16] =
{0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };
return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}


Here’s code generated by MSVC.

  movzx edx, cl
mov QWORD PTR hi$[rbp-48], rcx mov eax, edx and edx, 15 shr rax, 4 mov cl, BYTE PTR sbox$3697[rbp+rax-48]
shl cl, 4
or  cl, BYTE PTR sbox\$3697[rbp+rdx-48]


The assembly code loads a memory pointer for sbox into the RBX register and expects the AL register to hold the value we’re replacing with XLATB. For those of you who don’t know, the one-byte instruction XLATB locates a byte entry from a table in memory which is pointed to by the base register (BX). Using the contents of the AL register as a table index, it then copies the contents of the table entry back into the AL register.

lsb:
pop     rbx              ; rbx = sbox
mov     cl, al           ; cl = x
shr     al, 4            ; al = sbox[x >> 4] << 4
xlatb                    ;
shl     al, 4

xchg    al, cl

and     al, 15           ; al |= sbox[x & 0x0F]
xlatb
or      al, cl
pop     rcx
pop     rbx
ret

sub_byte:
push    rbx
push    rcx
call    lsb
; sbox
db      0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
db      0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2


### Linear layer

A linear operation that permutates 64-bits of the ciphertext state. Each bit is repositioned according to a predefined table.

$\textbf{pLayer. } \text{The bit permutation used in }\textsc{present} \text{ is given by the following table.}\\ \text{Bit i of }\textsc{state } \text{is moved to bit position } P(i).$

$\begin{array}{|c||c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline i & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15\\ P(i) & 0 & 16 & 32 & 48 & 1 & 17 & 33 & 49 & 2 & 18 & 34 & 50 & 3 & 19 & 35 & 51 \\\hline \hline i & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31\\ P(i) & 4 & 20 & 36 & 52 & 5 & 21 & 37 & 53 & 6 & 22 & 38 & 54 & 7 & 23 & 39 & 55 \\\hline \hline i & 32 & 33 & 34 & 35 & 36 & 37 & 38 & 39 & 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47\\ P(i) & 8 & 24 & 40 & 56 & 9 & 25 & 41 & 57 & 10 & 26 & 42 & 58 & 11 & 27 & 43 & 59 \\\hline \hline i & 48 & 49 & 50 & 51 & 52 & 53 & 54 & 55 & 56 & 57 & 58 & 59 & 60 & 61 & 62 & 63\\ P(i) & 12 & 28 & 44 & 60 & 13 & 29 & 45 & 61 & 14 & 30 & 46 & 62 & 15 & 31 & 47 & 63 \\\hline \end{array}$

As you might already tell from the table, each bit position increases sequentially by 16 modulo 63. To obtain the correct bit position during permutation without the use of a table, we can multiply the loop counter by 16 and modulo the result by 63. The approach I use is to initialize a 64-bit variable r to 0x0030002000100000. Imagine this to be an 8 byte array initialized to 0,16,32,48 that are used for shifting the first 4 bits of ciphertext state. By incrementing each byte of r every round and rotating by 16 bits, this will generate every shift value required for the permutation operation.

      // apply linear layer
t=0;r=0x0030002000100000;
F(j,64)
t|=((p>>j)&1)<<(r&255),
r=R(r+1,16);


Here’s assembly output from GCC for the C code above.

	mov	r12, QWORD PTR 8[rsp]    ; p
xor	r8d, r8d                 ; t = 0
mov	rax, rbx                 ; r = 0x0030002000100000;
xor	ebp, ebp                 ; j = 0;
.L4:
mov	cl, bpl                  ; cl = j
mov	rdi, r12                 ; rdi = p
inc	rbp                      ; j++
shr	rdi, cl                  ; rdi >>= j
mov	cl, al                   ; cl = (r & 255)
inc	rax                      ; r++
and	edi, 1                   ; rdi &= 1
ror	rax, 16                  ; r = R(r, 16)
sal	rdi, cl                  ; rdi <<= cl
or	r8, rdi                  ; t |= rdi
cmp	rbp, 64                  ; j<64
jne	.L4


One thing C can’t directly do is take advantage of the x86 status flags so the assembly code is much different. If bit 0 of P variable is 1, a right shift by 1 will set the carry flag. Then we just avoid setting bits of t variable with Jump If Not Carry (JNC) Bit Test and Set (BTS) will set bit in RBP based on value in CL. Here’s handwritten assembly taking advantage of status flags and bit manipulation instructions.

;
mov     ecx, 0x30201000  ; r = 0x30201000;
xor     ebp, ebp         ; t = 0;
xor     ebx, ebx         ; j = 0;
L3:
shr     rax, 1           ; CF = (p >> j) & 1
jnc     L3           ; no bit carried
bts     rbp, rcx         ; set bit in rbp at position in cl
L3:
inc     cl               ; r = ROTR32(r+1, 8);
ror     ecx, 8

add     bl, 4            ; j++, j < 64
jne     L2


The only other thing to mention is how the j counter represented by the BL register is updated. Notice how it’s being increased by 4 instead of 1. 64 x 4 = 256 so once we’ve permutated 64-bits of ciphertext, BL register will be zero again, hence the use of Jump If Not Equal (JNE) which is testing the Zero Flag (ZF). While describing the loop above, I rewrote the code to use the following instead which takes advantage of ECX being zero after the loop you can see in full routine.

    ; r = 0x30201000
mov     edx, 0x30201000
; t = 0
xor     esi, esi
L2:
; t |= ((p >> j) & 1) << (r & 255);
shr     rax, 1
jnc     L3
bts     rsi, rdx
L3:
; r = ROTR32(r+1, 8);
inc     dl
ror     edx, 8
; j++, j < 64
jne     L2


### Key scheduling

32 64-bit subkeys are generated using linear/non-linear operations on a 128-bit master key. It also includes the round counter to protect against slide attacks.

      // create subkey
p =(k0<<61)|(k1>>3);
k1=(k1<<61)|(k0>>3);
p=R(p,56);
// apply nonlinear layer
((B*)&p)[0]=S(((B*)&p)[0]);
k0=R(p,8)^((i+1)>>2);
k1^=(((i+1)& 3)<<62);


### Compact code

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)
typedef unsigned long long W;
typedef unsigned char B;

B S(B x) {
B sbox[16] =
{0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };
return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}

#define rev __builtin_bswap64

void present(void*mk,void*data) {
W i,j,r,p,t,t2,k0,k1,*k=(W*)mk,*x=(W*)data;

// load 64-bit plaintext, 128-bit master key
k0=rev(k[0]); k1=rev(k[1]);t=rev(x[0]);

F(i,32-1) {
// mix key
p=t^k0;
// apply nonlinear layer
F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
// apply linear layer
t=0;r=0x0030002000100000;
F(j,64)
t|=((p>>j)&1)<<(r&255),
r=R(r+1,16);
// create subkey
p =(k0<<61)|(k1>>3);
k1=(k1<<61)|(k0>>3);
p=R(p,56);
// apply nonlinear layer
((B*)&p)[0]=S(((B*)&p)[0]);
k0=R(p,8)^((i+1)>>2);
k1^=(((i+1)& 3)<<62);
}
// mix key
x[0] = rev(t^k0);
}



### AMD64 assembly

; -----------------------------------------------
; PRESENT-128 block cipher in AMD64 assembly (encryption only)
;
; size: 193 bytes
;
; global calls use Linux/AMD64 ABI
;
; -----------------------------------------------
bits 64

%ifndef BIN
global present
%endif

%define k0   rbp
%define k1   rdi

present:
push    rbp
push    rbx
push    rdi   ; save mk
push    rsi   ; save data
; k0=rev(k[0]); k1=rev(k[1]);
mov     k0, [rdi]
bswap   k0
mov     k1, [rdi+8]
bswap   k1
; t=rev(x[0]);
lodsq
bswap   rax
; i = 0
xor     ebx, ebx
L0:
; mix key
; p = t ^ k0;
xor     rax, k0
; F(j,8) ((B*)&p)[j] = S(((B*)&p)[j]);
push    8
pop     rcx
L1:
call    S
ror     rax, 8
loop    L1
; r = 0x30201000
mov     edx, 0x30201000
; t = 0
xor     esi, esi
L2:
; t |= ((p >> j) & 1) << (r & 255);
shr     rax, 1
jnc     L3
bts     rsi, rdx
L3:
; r = ROTR32(r+1, 8);
inc     dl
ror     edx, 8
; j++, j < 64
jne     L2

; save t
push    rsi
; p = (k0 << 61) | (k1 >> 3);
push    k0
pop     rax
push    k1
pop     rcx
shl     rax, 61
shr     rcx, 3
or      rax, rcx
; k1 = (k1 << 61) | (k0 >> 3);
shl     k1, 61
shr     k0, 3
or      k1, k0
; p = R(p, 56);
ror     rax, 56
; apply nonlinear layer
; ((B*)&p)[0] = S(((B*)&p)[0]);
call    S
; i++
inc     bl
; k0 = R(p, 8) ^ ((i + 1) >> 2);
mov     ecx, ebx
shr     ecx, 2
push    rax
pop     k0
ror     k0, 8
xor     k0, rcx
; k1 ^= (((i + 1) & 3) << 62);
mov     ecx, ebx
and     ecx, 3
shl     rcx, 62
xor     k1, rcx
; restore t in rax
pop     rax
; i < 32-1
cmp     bl, 32-1
jne     L0
; x[0] = rev(t ^ k0);
xor     rax, k0
bswap   rax
pop     rdi
stosq
pop     rdi
pop     rbx
pop     rbp
ret

; nonlinear layer
S0:
pop     rbx              ; rbx = sbox
mov     cl, al           ; cl = x
shr     al, 4            ; al = sbox[x >> 4] << 4
xlatb                    ;
shl     al, 4
xchg    al, cl
and     al, 15           ; al |= sbox[x & 0x0F]
xlatb
or      al, cl
pop     rcx
pop     rbx
ret
S:
push    rbx
push    rcx
call    S0
; sbox
db      0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
db      0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2


### ARM64 / AArch64 assembly

// PRESENT in ARM64 assembly
// 224 bytes

.arch armv8-a
.text
.global present

#define k  x0
#define x  x1
#define r  w2
#define p  x3
#define t  x4
#define k0 x5
#define k1 x6
#define i  x7
#define j  x8
#define s  x9

present:
str     lr, [sp, -16]!

// k0=k[0];k1=k[1];t=x[0];
ldp     k0, k1, [k]
ldr     t, [x]

// only dinosaurs use big endian convention
rev     k0, k0
rev     k1, k1
rev     t, t

mov     i, 0
L0:
// p=t^k0;
eor     p, t, k0

// F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
mov     j, 8
L1:
bl      S
ror     p, p, 8
subs    j, j, 1
bne     L1

// t=0;r=0x0030002000100000;
mov     t, 0
ldr     r, =0x30201000
// F(j,64)
mov     j, 0
L2:
// t|=((p>>j)&1)<<(r&255),
lsr     x10, p, j         // x10 = (p >> j) & 1
and     x10, x10, 1       //
lsl     x10, x10, x2      // x10 << r
orr     t, t, x10         // t |= x10

// r=R(r+1,16);
add     r, r, 1           // r = R(r+1, 8)
ror     r, r, 8

add     j, j, 1           // j++
cmp     j, 64             // j < 64
bne     L2

// p =(k0<<61)|(k1>>3);
lsr     p, k1, 3
orr     p, p, k0, lsl 61

// k1=(k1<<61)|(k0>>3);
lsr     k0, k0, 3
orr     k1, k0, k1, lsl 61

// p=R(p,56);
ror     p, p, 56
bl      S

// i++

// k0=R(p,8)^((i+1)>>2);
lsr     x10, i, 2
eor     k0, x10, p, ror 8

// k1^= (((i+1)&3)<<62);
and     x10, i, 3
eor     k1, k1, x10, lsl 62

// i < 31
cmp     i, 31
bne     L0

// x[0] = t ^= k0
eor     p, t, k0
rev     p, p
str     p, [x]

ldr     lr, [sp], 16
ret

S:
ubfx    x10, p, 0, 4              // x10 = (p & 0x0F)
ubfx    x11, p, 4, 4              // x11 = (p & 0xF0) >> 4

ldrb    w10, [s, w10, uxtw 0]     // w10 = s[w10]
ldrb    w11, [s, w11, uxtw 0]     // w11 = s[w11]

bfi     p, x10, 0, 4              // p[0] = ((x11 << 4) | x10)
bfi     p, x11, 4, 4

ret
sbox:
.byte 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
.byte 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2


### Summary

PRESENT is clearly intended for hardware and not terribly efficient in software but has some interesting features/ideas that could be used elsewhere. Thanks to 0x4d_ for submitting $\LaTeX$ formulas.

Sources here.

This entry was posted in assembly, cryptography, programming, security and tagged , , , , . Bookmark the permalink.