**Introduction**

PRESENT is a 64-bit block cipher published in 2007 which provides support for key lengths of 80 and 128-bits. It was designed specifically for hardware implementation and uses a *Substitution Permutation Network* (SPN) structure which is similar to AES but at the same time completely different.

The difference is that PRESENT is bit oriented whereas AES is byte oriented. For that reason, PRESENT is not suitable for software implementation although I found the bit-permutation layer interesting and was just curious to see results using x86 assembly.

It was designed by Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe.

PRESENT is intended to be used in situations where low-power consumption and high chip efficiency is desired. It has been standardized in ISO-29192 along with CLEFIA which was designed by Sony Corporation.

Today, we’ll focus on the 128-bit key version since it has more potential for future applications than something supporting only 80-bit keys.

I’ve no doubt you can implement PRESENT in 32-bit x86 assembly but because it’s a hardware design, I decided to stick with x86-64 assembly and only implement encryption.

In future blog entries, depending on the design of an algorithm, it will be either 32 or 64-bit assembly and maybe both if suitable but I’ll avoid using MMX which was used in some past entries.

You can find a number of implementations in C and python here.

**Key whitening**

A linear operation seen in pretty much every block cipher designed since being used by Ron Rivest in 1984 to strengthen the DES block cipher.

It uses a simple eXclusive OR operation with the key and data.

**Byte substitution**

The non-linear layer is based on a single 4-bit S-box which was designed with hardware optimizations in mind. It replaces each nibble of 64-bits with 4-bit value from predefined table.

Here’s code generated by MSVC

; 62 : hi.b[0] = sub_byte(hi.b[0]); movzx edx, cl mov QWORD PTR hi$[rbp-48], rcx mov eax, edx and edx, 15 shr rax, 4 mov cl, BYTE PTR sbox$3697[rbp+rax-48] shl cl, 4 or cl, BYTE PTR sbox$3697[rbp+rdx-48]

The assembly code loads a memory pointer for sbox into the RBX register and expects the AL register to hold the value we’re replacing with XLATB.

For those of you who don’t know, the one-byte instruction XLATB locates a byte entry from a table in memory which is pointed to by the base register (BX).

Using the contents of the AL register as a table index, it then copies the contents of the table entry back into the AL register.

lsb: pop rbx ; rbx = sbox mov cl, al ; cl = x shr al, 4 ; al = sbox[x >> 4] << 4 xlatb ; shl al, 4 xchg al, cl and al, 15 ; al |= sbox[x & 0x0F] xlatb or al, cl pop rcx pop rbx ret sub_byte: push rbx push rcx call lsb ; sbox db 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd db 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

**Bit Permutation**

A linear operation that permutates 64-bits of the ciphertext state. Each bit is repositioned according to a predefined table.

As you might already tell from the table, each bit position increases sequentially by 16 modulo 63.

To obtain the correct bit position during permutation without the use of a table, we can multiply the loop counter by 16 and modulo the result by 63.

The approach I use is to initialize a 32-bit variable R to 0x30201000. Imagine this to be a 4 byte array initialized to 0,16,32,48 which are used for shifting the first 4 bits of ciphertext state.

By incrementing each byte of the R variable every round and rotating by 8 bits, this will generate every shift value required for the permutation operation.

Here’s assembly output from MSVC for the C code above.

; 83 : t.q = 0; xor eax, eax ; 84 : r = 0x30201000; mov r9d, 807407616 ; 30201000H ; 85 : for (j=0; j<64; j++) { xor ebx, ebx $LL3@present128@2: ; 86 : // test bit at j'th position and shift left ; 87 : t.q |= ((p.q >> j) & 1) << (r & 255); mov rdx, QWORD PTR p$[rbp-16] mov ecx, ebx inc ebx shr rdx, cl mov ecx, r9d and edx, 1 shl rdx, cl or rax, rdx ; 88 : r = ROTR32(r+1, 8); inc r9d ror r9d, 8 cmp ebx, 64 ; 00000040H jl SHORT $LL3@present128@2

One thing C can’t directly do is take advantage of the x86 status flags so the assembly code is much different.

If bit 0 of P variable is 1, a right shift by 1 will set the *carry flag*. Then we just avoid setting bits of t variable with *Jump If Not Carry* (JNC)

*Bit Test and Set* (BTS) will set bit in RBP based on value in CL.

Here’s handwritten assembly taking advantage of status flags and bit manipulation instructions.

; mov ecx, 0x30201000 ; r = 0x30201000; xor ebp, ebp ; t.q = 0; xor ebx, ebx ; j = 0; pse_l2: shr rax, 1 ; CF = (p.q >> j) & 1 jnc pse_l3 ; no bit carried bts rbp, rcx ; set bit in rbp at position in cl pse_l3: inc cl ; r = ROTR32(r+1, 8); ror ecx, 8 add bl, 4 ; j++, j < 64 jne pse_l2

The only other thing to mention is how the j counter represented by the BL register is updated. Notice how it’s being increased by 4 instead of 1.

64 x 4 = 256 so once we’ve permutated 64-bits of ciphertext, BL register will be zero again, hence the use of *Jump If Not Equal* (JNE) which is testing the *Zero Flag* (ZF)

While describing the loop above, I rewrote the code to use the following instead which takes advantage of ECX being zero after the loop you can see in full routine.

; mov ebx, 0x30201000 ; r = 0x30201000; xor ebp, ebp mov cl, 64 pse_l2: shr rax, 1 ; CF = (p.q >> j) & 1 jnc pse_l3 ; no bit carried bts rbp, rbx ; set bit in rbp at position in bl pse_l3: inc bl ; r = ROTR32(r+1, 8); ror ebx, 8 loop pse_l2

**Key scheduling**

Function creates 32 64-bit subkeys using linear/non-linear operations from a 128-bit key.

; rcx = present_ctx ; rdx = key present128_setkeyx: push rsi push rdi push rbp push rbx push rcx pop rdi ; rdi = ctx mov rax, [rdx+8] ; hi = key[1] mov rdx, [rdx ] ; lo = key[0] xor ecx, ecx ; rnd = 0 psk: stosq ; ctx->key[rnd] = hi.q; lea ebx, [rcx*2+2] ; lo.q ^= (rnd + rnd) + 2; xor rdx, rbx push rax ; t.q = hi.q; pop rbx push rdx pop rbp shl rax, 61 ; hi.q = (hi.q & 7) << 61; shr rbp, 3 ; hi.q |= (lo.q >> 3); or rax, rbp shl rdx, 61 ; lo.q = (lo.q & 7) << 61; shr rbx, 3 ; lo.q |= ( t.q >> 3); or rdx, rbx rol rax, 8 ; hi.q = ROTL64(hi.q, 8); call sub_byte ror rax, 8 inc cl cmp cl, PRESENT_RNDS jne psk jmp pse_l4

**Encryption**

### Assembly

; rcx = present_ctx ; rdx = buf present128_encryptx: push rsi push rdi push rbp push rbx push rcx ; rdi = present_ctx pop rdi mov rax, [rdx] ; p.q = buf[0] push PRESENT_RNDS-1 pop rcx pse_l0: xor rax, [rdi] ; p.q ^= ctx->key[i] scasq ; advance key position push rcx ; save i mov cl, 8 ; j = 0 pse_l1: call sub_byte ; p.b[j] = sub_byte(p.b[j]); ror rax, 8 ; p.q = ROTR64(s.q, 8); loop pse_l1 ; j++ < 8 ; mov ebx, 0x30201000 ; r = 0x30201000; xor ebp, ebp pse_l2: shr rax, 1 ; CF = (p.q >> j) & 1 jnc pse_l3 ; no bit carried bts rbp, rbx ; set bit in rbp at position in bl pse_l3: inc bl ; r = ROTR32(r+1, 8); ror ebx, 8 add cl, 4 ; j++ jne pse_l2 xchg rax, rbp ; p.q = t.q pop rcx loop pse_l0 ; post whitening xor rax, [rdi] ; p.q ^= ctx->key[PRESENT_RNDS-1]; mov [rdx], rax ; buf[0] = p.q pse_l4: pop rbx pop rbp pop rdi pop rsi ret

**Summary**

PRESENT is clearly intended for hardware and not terribly efficient in software but has some interesting features/ideas that could be used elsewhere.

The size of x86-64 assembly is approx. 188 bytes.

Thanks to 0x4d_ for formulas.