Twofish-256 Block cipher

Introduction

Twofish is a symmetric block cipher published in 1998. It was designed and analyzed by Bruce Schneier, John Kelsey, Doug Whiting, David Wagner, Chris Hall, and Niels Ferguson.

It was one of the 5 AES finalists but lost out to Rijndael.

Like the other AES candidates, Twofish has a 128-bit block size, a key size ranging from 128 to 256 bits, and is optimized for 32-bit CPUs. It’s unpatented and free for anyone to use which makes it a popular alternative to AES.

Hard to believe 18 years have passed since this algorithm was published and it’s still resistant to most cryptanalytic attacks. If you’re wondering which is the better algorithm: Rijndael or TwoFish?

Wagner had this to say in response to that question in 2004 on sci.crypt

My advice would be to use AES, not Twofish, unless there is some special requirement that makes AES unsuitable. There’s nothing particularly wrong with Twofish — I’m pleased with the design and how it has held up — but I think AES is even better, and AES is receiving more scrutiny than any of the other finalists. This gives a powerful reason to prefer AES over Twofish (or any of the other finalists, including Serpent, for that matter).

The C code shown here was derived from implementation by Marc Schoolderman. It’s been modified significantly and published under a BSD license with permission from Marc for portions of code retained.

The x86 assembly which is optimized for size was a joint effort between asm king Peter Ferrie and myself. It’s currently 615 bytes but may shrink in future.

To understand the building blocks of Twofish and why each function of the algorithm was chosen, It makes more sense to read the original paper published in 1998 rather than me try explain here because If I’m honest with you and I’m only speaking for myself (not Marc or Peter) I don’t understand a lot of it.

So this won’t be a tutorial on Twofish, we’ll just look at x86 implementation of the algorithm, optimized for size.

Whitening

In our attacks on reduced-round Twofish variants, we discovered that whitening substantially increased the diffculty of attacking the cipher, by hiding from an attacker the specific inputs to the first and last rounds’ F functions.

This is a simple technique originally invented/proposed by Ron Rivest in 1984 which offers a cheap way to increase resistance of a cipher to exhaustive brute force attacks.

It’s essentially the same as AddRoundKey used in AES or blkxor shown here in Serpent.

A common brute force attack against Microsoft Lanman hashes (derived from DES) only requires computing the first 15 rounds to initially check a key. Had DES-X been used instead which employs whitening, the full 16 rounds would be required so it makes sense to use this simple bit of code to improve security of the algorithm.

void whiten (tf_blk *in, uint32_t *keys)
{
  int i;
  
  for (i=0; i<4; i++) {
    in->v32[i] ^= keys[i];
  }
}

The assembly code shown here is placed at the end of encryption/decryption function. I’ll explain why some of these registers/instructions are used specifically further down.

; edi = keys
; esi = in
; void whiten (uint32_t *in, uint32_t *keys)
whiten:
    pushad
whiten_tail:
    mov    cl, 4
w_l1:
    mov    eax, [edi]
    xor    [esi], eax
    cmpsd
    loop   w_l1
    popad
    ret

MDS Matrices

Serge Vaudenay first proposed MDS matrices as a cipher design element in his paper ‘On the need for multipermutations‘ in 1995 having cryptanalysed MD4 and SAFER showing the weakness of such algorithms without multipermutations.

The MDS function is very similar to MixColumns used by AES (Rijndael).
Twofish uses a single 4-by-4 MDS matrix over GF(2^8). I don’t know what the heck that means but here’s the code to do it.

// Twofish uses 1,91,239 in a non-circulant matrix.
uint8_t matrix[4][4] = 
{ { 0x01, 0xEF, 0x5B, 0x5B },
  { 0x5B, 0xEF, 0xEF, 0x01 },
  { 0xEF, 0x5B, 0x01, 0xEF },
  { 0xEF, 0x01, 0xEF, 0x5B } };

uint32_t mds(uint32_t w)
{
  vector   x;
  int i;
  uint32_t j, x0, y;
  vector acc;
  
  x.v32 = w;
  acc.v32 = 0;
  
  for (i=0; i<4; i++) 
  {
    for (j=0; j<4; j++) 
    {
      x0 = matrix[i][j];
      y  = x.v8[j];
      while (y)
      {
        if (x0 > (x0 ^ 0x169))
          x0 ^= 0x169;
        if (y & 1)
          acc.v8[i] ^= x0;
        x0 <<= 1;
        y >>= 1;
      }
    }
  }
  return acc.v32;
}

The asm version was originally written by Marc and optimized by Peter. The matrix is encoded as a 32-bit value in order to reduce space used and is obviously much different from C.

mds:
    pushad
_mdsx_tail:
    xchg   eax, ebx
    mov    ecx, 0357cd3ceh
    xor    edx, edx
mds_l0:
    dec    ecx
mds_l1:
    xor    dl, bl
    mov    al, bl
    shr    al, 1
    jnb    mds_l2
    xor    al, 0b4h
mds_l2:
    shl    ecx, 1
    jnb    mds_l3
    xor    dl, al
mds_l3:
    shr    al, 1
    jnb    mds_l4
    xor    al, 0b4h
mds_l4:
    shl    ecx, 1
    jnb    mds_l5
    xor    dl, al
mds_l5:
    ror    ebx, 8
    test   cl, cl
    jnz    mds_l1
    ror    edx, 8
    dec    cl
    inc    ecx
    jne    mds_l0
mds_l6:
    mov    [esp+_eax], edx
    popad
    ret

Reed Solomon

This is another multipermutation function first published in ‘Polynomial Codes over Certain Finite Fields‘ by Irving Reed and Gustave Solomon all the way back in 1960.

You might be able to acquire the original paper if you have a login here: Journal of the Society for Industrial and Applied Mathematics.

The C code here was originally written by Wei Dai. All I did was swap the low bits to avoid some extra instructions on x86 cpu but it’s not translated to asm since Marc already wrote a smaller version.

// compute (c * x^4) mod (x^4 + (a + 1/a) * x^3 + a * x^2 + (a + 1/a) * x + 1)
// over GF(256)
uint32_t Mod(uint32_t c)
{
  uint32_t c1, c2;
  
  c2=(c<<1) ^ ((c & 0x80) ? 0x14d : 0);
  c1=c2 ^ (c>>1) ^ ((c & 1) ? (0x14d>>1) : 0);

  return c | (c1 << 8) | (c2 << 16) | (c1 << 24);
}

// compute RS(12,8) code with the above polynomial as generator
// this is equivalent to multiplying by the RS matrix
uint32_t reedsolomon(uint64_t x)
{
  uint32_t i, low, high;
    
  low  = SWAP32(x & 0xFFFFFFFF);
  high = x >> 32;
  
  for (i=0; i<8; i++)
  {
    high = Mod(high >> 24) ^ (high << 8) ^ (low & 255);
    low >>= 8;
  }
  return high;
}

As always this is optimized for size and is inlined to save more space. Originally written by Marc.

; in:  edx
; out: eax = result
; uint32_t reedsolomon (uint64_t in)
reedsolomon:
    pushad
    mov    ebx, [edx]
    mov    edx, [edx+4]
    mov    cl, 88h
    jmp    rs_l1
rs_l0:
    xor    edx, ebx
rs_l1:
    rol    edx, 8
    mov    ah, dl
    shr    ah, 1
    jnb    rs_l2
    xor    ah, 0a6h
rs_l2:
    mov    al, dl
    add    al, al
    jnc    rs_l3
    xor    al, 04dh
rs_l3:
    xor    ah, al
    xor    dh, ah
    shl    eax, 16
    xor    edx, eax
    shr    cl, 1
    jnb    rs_l1
    jnz    rs_l0
    mov    [esp+_eax], edx
    popad

Pseudo-Hadamard Transforms

A pseudo-Hadamard transform (PHT) is a simple mixing operation that runs quickly in software. Given two inputs, a and b, the 32-bit PHT is defined.

This is just 2 additions and inlined with tf_enc. It occurs after using the G function using T0 and T1 which you’ll see below.

The Function g

The function g forms the heart of Twofish. The input word X is split into four bytes. Each byte is run through its own key-dependent S-box. Each Sbox is bijective, takes 8 bits of input, and produces 8 bits of output.

The four results are interpreted as a vector of length 4 over GF(2^8), and multiplied by the 4×4 MDS matrix (using the field GF(2^8) for the computations). The resulting vector is interpreted as a 32-bit word which is the result of g.

// The G function
uint32_t round_g(tf_ctx *ctx, uint32_t w)
{
  vector x;
  uint32_t i;
  uint8_t *sbp;
  
  x.v32 = w;

  sbp=&ctx->sbox[0];
  
  for (i=0; i<4; i++) {
    x.v8[i] = sbp[x.v8[i]];
    sbp += 256;
  }
  return mds(x.v32);
}

Assembly code was fairly straight forward.

; eax = w
; ebx = ctx->sbox
; ecx = 0 or 1
; uint32_t round_g(tf_ctx *ctx, uint32_t w)
round_g:
    pushad
    mov    cl, 4
    add    ebx, sbox
rg_l1:
    xlatb                   ; sbp[x.v8[i]]
    ror    eax, 8
    add    ebx, 256         ; sbp += 256
    loop   rg_l1
    db     03ch             ; cmp al, xx (mask pushad)

Peter uses masked instruction here at the end to replace a jump just before the mds function which is nice.

The Function h

This is a function that takes two inputs|a 32-bit word X and a list L = (L0; : : : ;Lk1) of 32-bit words of length k|and produces one word of output. This function works in k stages. In each stage, the four bytes are each passed through a fixed S-box, and xored with a byte derived from the list. Finally, the bytes are once again passed through a fixed Sbox, and the four bytes are multiplied by the MDS matrix just as in g.

This was a x64 asm to C conversion.

uint32_t round_h(tf_ctx *ctx, uint32_t x_in, uint32_t *L)
{
  int    i, j;
  uint32_t r=0x9C53A000;
  vector x;
  uint8_t *qbp=(uint8_t*)&ctx->qbox[0][0];
  
  x.v32 = x_in * 0x01010101;
  
  for (i=4; i>=0; i--) 
  {
    for (j=0; j<4; j++)
    {
      r=ROTL32(r, 1);
      x.v8[j] = qbp[((r & 1) << 8) + x.v8[j]];
    }
    if (i>0) {
      x.v32 ^= L[(i-1)*2];
    }
  }
  return x.v32;
}

The assembly is slightly based on code by Marc but IMUL is used here instead of ADD. May be possible to reduce further using ADD ecx, 0x01010101 instead of IMUL.

; uint32_t round_h(tf_ctx *ctx, uint8_t x_in, uint32_t *L)
;
; ebx = ctx
; ecx = x_in
; esi = L
round_h:
    pushad
    ; r=0x9C53A000;
    mov    ebp, 09C53A000h
    ; x.v32 = x_in * 0x01010101;
    imul   edx, ecx, 01010101h
    ; i=4
    mov    cl, 16
    lea    edi, [ebx+ecx*8+qbox-128]
rh_l1:
    ; j=0
    push   4
    pop    eax
rh_l2:
    movzx  ebx, dl
    add    ebp, ebp
    adc    bh, bh
    mov    dl, [ebx+edi]
    ror    edx, 8
    dec    eax
    jnz    rh_l2
    
    ; if (i>0)
    jecxz  mds_l6
    sub    ecx, 4
    ; x.v32 ^= L[(i-1)*2];
    xor    edx, [esi+ecx*2]
    jmp    rh_l1

Computing Q-Tables

There are 2 q-boxes, each 256-bytes in size used by the H function during key expansion and generation of key dependent s-boxes. While you could precompute these values and store as 512-bytes, computing before key expansion requires less space.

uint8_t qb[64]=
{ 0x18, 0xd7, 0xf6, 0x23, 0xb0, 0x95, 0xce, 0x4a,
  0xce, 0x8b, 0x21, 0x53, 0x4f, 0x6a, 0x07, 0xd9,
  0xab, 0xe5, 0xd6, 0x09, 0x8c, 0x3f, 0x42, 0x17,
  0x7d, 0x4f, 0x21, 0xe6, 0xb9, 0x03, 0x58, 0xac,
  0x82, 0xdb, 0x7f, 0xe6, 0x13, 0x49, 0xa0, 0x5c,
  0xe1, 0xb2, 0xc4, 0x73, 0xd6, 0x5a, 0x9f, 0x80,
  0xc4, 0x57, 0x61, 0xa9, 0xe0, 0x8d, 0xb2, 0xf3,
  0x9b, 0x15, 0x3c, 0xed, 0x46, 0xf7, 0x02, 0xa8 };

uint8_t gq(uint8_t x, uint8_t *p)
{
  uint8_t a, b, x0, x1, t;
  int8_t i;
  
  for (i=0; i<2; i++)
  {
    a = (x >> 4) ^ (x & 15);
    b = (x >> 4) ^ ((x >> 1) & 15) ^ ((x << 3) & 0x8);
    
    x0 = p[a];
    x1 = p[b+16];
    
    // if first pass, swap
    if (i==0) {
      t = x0; x0 = x1; x1 = t;
    }
    x1 <<= 4;
    x  = x0 | x1;
    p += 32;
  }
  return x;
}
  
/**
 * Computes the Q-tables
 */
void tf_init(tf_ctx *ctx) 
{
  int32_t i, j;
  uint8_t x;
  uint8_t t[256];
  uint8_t *q, *p=(uint8_t*)&t[0];
  
  for (i=0; i<64; i++) {
    x=qb[i];
    *p++ = x & 15;
    *p++ = x >> 4;
  }
  
  for (i=0; i<256; i++) 
  {
    p=(uint8_t*)&t[0];
    q=(uint8_t*)&ctx->qbox[0][0];
    
    for (j=0; j<2; j++) 
    {
      q[i] = gq((uint8_t)i, p);
      p += 64;
      q += 256;
    }
  }
}

In ChaCha and BLAKE2, Peter demonstrated using the PF flag how to perform a double loop which I will always use from now on whenever applicable.

Here you can see ECX and EDX set to zero. First DEC sets PF to 1, second sets PF to 0 ending the loop.
If you wanted to preform a loop 3 times, you should be able to go in the opposite direction using INC instead.

The other thing to mention here is that the q-box nibbles are rearranged to work better with AAM so that we don’t end up using XCHG as was the case with original array.

; ***********************************************
; void tf_init(tf_ctx *ctx)
; ***********************************************
tf_init:
    pushad
    mov    cl, 64
    enter  128, 0
    mov    edi, esp          ; edi = p = alloc(128)
    call   ld_qb
; qb:
    db 018h, 0d7h, 0f6h, 023h, 0b0h, 095h, 0ceh, 04ah
    db 0ceh, 08bh, 021h, 053h, 04fh, 06ah, 007h, 0d9h
    db 0abh, 0e5h, 0d6h, 009h, 08ch, 03fh, 042h, 017h
    db 07dh, 04fh, 021h, 0e6h, 0b9h, 003h, 058h, 0ach
    db 082h, 0dbh, 07fh, 0e6h, 013h, 049h, 0a0h, 05ch
    db 0e1h, 0b2h, 0c4h, 073h, 0d6h, 05ah, 09fh, 080h
    db 0c4h, 057h, 061h, 0a9h, 0e0h, 08dh, 0b2h, 0f3h
    db 09bh, 015h, 03ch, 0edh, 046h, 0f7h, 002h, 0a8h
  
ld_qb:
    pop    esi
    push   ecx
tfi_l1:
    lodsb                    ; load byte
    aam    16                ; get 2 bytes
    stosw                    ; store as 16-bit word
    loop   tfi_l1            ; do 64-bytes in esi
    pop    eax

tfi_l2:
    mov    esi, esp          ; esi = &t[0][0][0];
    lea    edi, [ebx+eax*2+32]  ; edi = &ctx->qbox[0][0]
    cdq                      ; j=0
tfi_l3:
;;    call   gq                ; gq(i, p);

; uint8_t gq (uint8_t *p, uint8_t x)
; esi = p
; ecx = x
gq:
    pushad
    xchg   eax, ecx
    xor    ecx, ecx
gq_l2:
    mov    bl, al             ; bl = x
    ; a = (x >> 4) ^ (x & 15);
    aam    16
    xor    al, ah
    ; b = (x >> 4) ^ (x >> 1) & 15 ^ (x << 3) & 0x8;
    imul   edx, ebx, 8
    shr    bl, 1
    xor    bl, ah
    xor    bl, dl
    ; ------------
    and    eax, 15
    and    ebx, 15    
    ; x0 = p[a];
    mov    ah, [esi+eax]
    ; x1 = p[b+16];
    mov    al, [esi+ebx+16]
    jecxz  gq_l3  
    xchg   al, ah
gq_l3:
    ; x1 <<= 4
    ; x = x0 | x1
    aad    16
    ; p += 32
    add    esi, 32
    ; i++
    dec    ecx
    jp     gq_l2            ; i < 2
    ; return x
    mov    byte[esp+_edx+1], al
    popad
;;    ret
    
    mov    [edi+ecx], dh     ; q[i] = gq(i, p);
    add    esi, eax          ; p += 64
    lea    edi, [edi+eax*4]  ; q += 256
    dec    edx               ; j++
    jp     tfi_l3            ; j < 2
    
    inc    cl                ; i++
    jnz    tfi_l2
    
    leave                    ; free stack
    popad
    ret

Key Expansion

This is significantly more complex than Rijndael and Serpent, especially when trying to optimize for size.
The main part that takes up space is tf_init.

void tf_setkey(tf_ctx *ctx, void *key)
{
  uint32_t key_copy[8];
  vector x;
  uint8_t *sbp;
  uint32_t *p=key_copy;
  tf_key *mk=(tf_key*)key;
  uint32_t A, B=0, T, i;
  
  tf_init(ctx);

  // copy key to local space
  for (i=0; i<32; i++) {
    ((uint8_t*)&key_copy)[i]=((uint8_t*)key)[i];
  }

  for (i=0; i<40;) 
  {
    p=key_copy;
  calc_mds:
    A = mds(round_h(ctx, i++, p++));
    // swap
    T=A; A=B; B=T;
    if (i & 1) goto calc_mds;
      
    B = ROTL32(B, 8);
    
    A += B;
    B += A;
    
    ctx->keys[i-2] = A;
    ctx->keys[i-1] = ROTL32(B, 9);
  }

  p += 4;

  for (i=0; i<4; i++) {
    *p = reedsolomon(mk->v64[i]);
     p-= 2;
  }
  
  p += 2;
  
  for (i=0; i<256; i++) {
    x.v32 = round_h(ctx, i, p);
    sbp = &ctx->sbox[0];
    do {
      sbp[i] = x.v8[0];
      sbp += 256;
      x.v32 >>= 8;
    } while (x.v32!=0);
  }
}

asm code

tf_setkey:
_tf_setkeyx:
    pushad
    push   32
    pop    ecx
    sub    esp, ecx
    mov    edi, esp
    mov    ebx, [edi+64+4]   ; ctx
    mov    esi, [edi+64+8]   ; key
    mov    edx, esi          ; edx=key
    
    rep    movsb
    
    call   tf_init
    
    mov    edi, ebx          ; edi=keys
sk_l1:
    ; ecx/i = 0
    mov    esi, esp          ; esi=p/key_copy
sk_l2:
    call   round_h           ; A = mds(round_h(ctx, i++, p++));
    call   mds
    add    esi, 4            ; p++
    inc    ecx               ; i++
    xchg   eax, ebp          ; swap A and B
    test   cl, 1             ; if (i & 1) goto sk_l1
    jnz    sk_l2
    
    rol    ebp, 8            ; B = ROTL32(B, 8);
    add    eax, ebp          ; A += B;
    add    ebp, eax          ; B += A;
    stosd                    ; ctx->keys[i-2] = A;
    xchg   eax, ebp
    rol    eax, 9
    stosd                    ; ctx->keys[i-1] = ROTL32(B, 9);
    cmp    ecx, 40           ; i < 40
    jnz    sk_l1
    
    add    esi, 16           ; p += 4
    mov    cl, 4             ; for (i=0; i<4; i++) {
sk_l3:
    ; *p = reedsolomon(mk->v64[i]);
;;    call   reedsolomon

; in:  ebp
; out: eax = result
; uint32_t reedsolomon (uint64_t in)
reedsolomon:
    pushad
    mov    ebx, [edx]
    mov    edx, [edx+4]
    mov    cl, 88h
    jmp    rs_l1
rs_l0:
    xor    edx, ebx
rs_l1:
    rol    edx, 8
    mov    ah, dl
    shr    ah, 1
    jnb    rs_l2
    xor    ah, 0a6h
rs_l2:
    mov    al, dl
    add    al, al
    jnc    rs_l3
    xor    al, 04dh
rs_l3:
    xor    ah, al
    xor    dh, ah
    shl    eax, 16
    xor    edx, eax
    shr    cl, 1
    jnb    rs_l1
    jnz    rs_l0
    mov    [esp+_eax], edx
    popad
;;    ret

    mov    [esi], eax        ;
    add    edx, 8
    sub    esi, 8            ; p -= 2
    loop   sk_l3
    
    lodsd                    ; p++
    lodsd                    ; p++
sk_l4:                       ; for (i=0; i<256; i++) {
    call   round_h           ;   x.v32 = round_h(ctx, i, p);
    lea    edi, [ebx+sbox]   ;   sbp = &ctx->sbox[0];
sk_l5:                       ;   do {
    mov    [edi+ecx], al     ;     sbp[i] = x.v8[0];    
    add    edi, 256          ;     sbp += 256;
    shr    eax, 8            ;     x.v32 >>= 8;
    jnz    sk_l5             ;   } while (x.v32!=0);
    
    inc    cl                ; 
    jnz    sk_l4             ; }
    
    add    esp, 32
    popad
    ret

Encryption

Finally, the encryption and decryption which has F and PHT functions inlined.

// encrypt/decrypt 128-bits of data
// encryption which inlines F function
void tf_enc(tf_ctx *ctx, tf_blk *data, int enc)
{
  int      i;
  uint32_t A, B, C, D, T0, T1;
  uint32_t *keys;

  whiten (data, &ctx->keys[enc*4]);
  
  keys=(uint32_t*)&ctx->keys[8];
  
  if (enc==TF_DECRYPT) {
    keys += 2*14+3;
  }
  
  // load data
  A=data->v32[0];
  B=data->v32[1];
  C=data->v32[2];
  D=data->v32[3];
  
  for (i=16; i>0; i--) 
  {
    // apply G function
    T0=round_g(ctx, A);
    T1=round_g(ctx, ROTL32(B, 8));
    
    // apply PHT
    T0 += T1;
    T1 += T0;
    
    // apply F function
    if (enc==TF_ENCRYPT)
    {
      C ^= T0 + *keys++;
      C  = ROTR32(C, 1);
      D  = ROTL32(D, 1);
      D ^= T1 + *keys++;
    } else {
      D ^= T1 + *keys--;
      D  = ROTR32(D, 1);
      C  = ROTL32(C, 1);
      C ^= T0 + *keys--;
    }
    // swap
    T0 = C; T1 = D;
    C  = A;  D = B;
    A  = T0; B = T1;
  }

  // save
  data->v32[0]=C;
  data->v32[1]=D;
  data->v32[2]=A;
  data->v32[3]=B;
  
  whiten (data, &ctx->keys[enc==TF_DECRYPT?0:4]);
}

The direction flag DF is set if we’re decrypting and EDI advanced to last subkey. The 2 SCASD instructions are subtracting 8 from EDI whereas when DF is cleared, they add 8

%define A [esp+12]
%define B [esp+8]
%define C ebp
%define D esi

%define T0 edx
%define T1 eax

; encrypt or decrypt 128-bits of data
; void tf_enc(tf_ctx *ctx, tf_blk *data, int enc)
_tf_encx:
    pushad
    
    lea    esi, [esp+32+4]
    lodsd                    ; ctx
    xchg   ebx, eax
    lodsd                    ; data
    push   eax
    lodsd                    ; enc
    pop    esi
    cdq                      ; i=0
    xchg   ecx, eax
    
    mov    edi, ebx
    mov    dl, 4*4           ; 16
    jecxz  tf_l1             ; if enc==0 encrypt
    add    edi, edx          ; edi = &ctx->keys[4]
tf_l1:
    call   whiten
tf_l2:
    push   edi               ; save pointer to keys
    mov    edi, esi
    
    lodsd
    push   eax               ; A=data->v32[0];
    lodsd
    push   eax               ; B=data->v32[1];
    lodsd
    xchg   eax, C            ; C=data->v32[2];
    lodsd
    xchg   eax, D            ; D=data->v32[3];

    push   edi
    lea    edi, [ebx+edx*2]  ; edi=&ctx->keys[8]
    jecxz  tf_l3

    std                      ; DF=1 to go backwards
    add    edi, (2*14+3)*4
tf_l3:
    push   edx               ; save i
    ; apply G function
    ; T0=round_g(ctx, A);
    mov    eax, A
    call   round_g
    xchg   eax, T0
    
    ; T1=round_g(ctx, ROTL32(B, 8));
    mov    eax, B
    rol    eax, 8
    call   round_g

    ; apply PHT
    add    T0, T1            ; T0 += T1;
    add    T1, T0            ; T1 += T0;
    
    ; apply F function
    jecxz  tf_l4             ; if (ecx==TF_ENCRYPT) goto tf_l4
    
    rol    C, 1              ; C  = ROTL32(C, 1);
    add    T1, [edi]         ; D ^= T1 + *K1--;
    xor    D, T1             
    add    T0, [edi-4]       ; C ^= T0 + *K1--;
    ror    D, 1              ; D  = ROTR32(D, 1);
    xor    C, T0
    jmp    tf_l5
tf_l4:
    add    T0, [edi]
    add    T1, [edi+4]
    xor    C, T0
    ror    C, 1
    rol    D, 1
    xor    D, T1
tf_l5:
    ; edi += 8 or edi -= 8 depending on DF
    scasd
    scasd
    ; swap
    xchg   A, C
    xchg   B, D

    pop    edx               ; restore i
    dec    edx               ; i--
    jnz    tf_l3

    cld                      ; DF=0 to go forward
    pop    edi               ; restore data
    pop    edx
    pop    eax
    push   edi
    ; save
    xchg   eax, C
    stosd                    ; data->v32[0]=C;
    xchg   eax, D
    stosd                    ; data->v32[1]=D;
    xchg   eax, C
    stosd                    ; data->v32[2]=A;
    xchg   eax, edx
    stosd                    ; data->v32[3]=B;
    pop    esi
    pop    edi               ; edi = &ctx->keys[0]
    
    ; add or subtract 16 depending on enc
    mov    eax, ecx
    neg    eax
    xor    edi, eax
    add    edi, 16
    xor    edi, eax
    db     3ch               ; cmp al, xx (mask pushad)

; edi = keys
; esi = in
; void whiten (uint32_t *in, uint32_t *keys)
whiten:
    pushad
whiten_tail:
    mov    cl, 4
w_l1:
    mov    eax, [edi]
    xor    [esi], eax
    cmpsd
    loop   w_l1
    popad
    ret

Results

It is almost twice the size of the AES-256 code and slightly larger than Serpent-256 but both Twofish and Serpent use hardcoded sbox values.

architecture size
x86 610
x64 n/a

Check out sources here and here

Advertisements
This entry was posted in assembly, cryptography, programming, security and tagged , , , , , . Bookmark the permalink.

One Response to Twofish-256 Block cipher

  1. Pingback: Asmcodes: Kuznyechik “Grasshopper” Block cipher | x86 crypto

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s