CubeHash 1/1-256 cryptographic hash


CubeHash is a simple cryptographic hash function designed and written in 2008 by mathematician/cryptographer Daniel J. Bernstein. It was submitted to the NIST SHA-3 competition but was not chosen as a finalist for the reasons stated below.

CubeHash is a simple, well-understood design that received extensive third-party analysis. The design is suitable for constrained environments, as it requires a small area in hardware, and little RAM or ROM in software. However, it has poor performance without the use of a vector unit, resulting in a large overhead for short messages. Moreover, no single variant for the 512-bit digest size achieves the required theoretical security with acceptable performance. Another disadvantage of the design is the existence of the symmetric properties that are arguably a potential avenue for future attacks. NIST felt that an additional year of study would not be enough to determine whether or not the symmetric properties pose a threat. Therefore, NIST felt that CubeHash could not be chosen as SHA-3 with confidence, and thus should not be selected as a finalist.


In constrained environments, CubeHash has poor performance, but requires less ROM and RAM than most of the second-round candidates.

Having looked at other algorithms and tweaking reference code to reduce space without compromising much of the security, It initially appeared to require less ROM in software than any other algorithm in round 2 which includes Keccak. I was mostly curious to see how far the code could be reduced in assembly.

Choosing parameters

The author states in specification.

CubeHashr/b-h produces an h-bit digest of a variable-length message. The digest depends on two tunable parameters r, b that allow the selection of a range of security/performance tradeoffs

Choosing high r and low b values result in the best security but with higher r would impact performance since if you were to select r=10, b=1 then the permutation function would be invoked 10 times for each byte which is not very efficient.

To minimize size for this implementation, the parameters chosen are: r=1, b=1, h=256

Be advised, the following implementation will only work with those parameters. Changing these would probably provide incorrect hashes but if you want a look at the reference code, see the authors site. or alternatively look at implementations from the SUPERCOP library.

Permutation Function

Unrolled, this is 12 loops and can be vectorized.

#define ROTL32(a,b) (((a) << (b)) | ((a) >> (32 - b)))

void transform(cube_ctx *c, int cnt)
  int      i, j, k, n;
  uint32_t y[16];

  for (n=cnt; n>0; n--) 
    for (k=7, j=2; j>0; k+=4, j--)
      for (i=15; i>=0; --i) c->x[i + 16] += c->x[i];
      for (i=15; i>=0; --i) y[i ^ (j*4)] = c->x[i];
      for (i=15; i>=0; --i) c->x[i] = ROTL32(y[i], k);
      for (i=15; i>=0; --i) c->x[i] ^= c->x[i + 16];
      for (i=15; i>=0; --i) y[i ^ j] = c->x[i + 16];
      for (i=15; i>=0; --i) c->x[i + 16] = y[i];


The original implementation requires the output digest be provided by the callee. Since I’m only interested in the 256-bit output, the code is fixed to this output length.

void cube_init(cube_ctx *c)
  int i;

  // initialize state
  for (i=0; i<32; ++i) {
    c->x[i] = 0;
  // set our fixed parameters
  c->x[0] = 32;            // 256-bit
  c->x[1] = 1;             // b=1 (block size in bytes)
  c->x[2] = 1;             // r=1 (number of rounds for each block)

  transform(c, 10);        // initial 10 rounds


void cube_update(cube_ctx *c, const void *in, uint32_t inlen)
  uint8_t  *data=(uint8_t*)in;
  uint32_t u;

  while (inlen) {
    c->x[0] ^= *data++;      // absorb byte into state
    if (!inlen--) break;     // don't process last byte
    transform(c, 1);         // transform state with 1 round


void cube_final(void *out, cube_ctx *c)
  int      i;
  uint32_t u;

  // step 1.
  c->x[0] ^= 128;           // absorb length
  transform(c, 1);
  c->x[31] ^= 1;            // absorb final bit
  transform(c, 10);         // apply final rounds
  // return 256-bit hash
  for (i=0; i<32; i++) 
    ((uint8_t*)out)[i] = c->v8[i];

The step 1 probably requires some explanation. If you look at the reference implementation, the code is

// from Final
  crypto_uint32 u;

  u = (128 >> (state->pos % 8));
  u <<= 8 * ((state->pos / 8) % 4);
  state->x[state->pos / 32] ^= u;

Because the blocksize is 1, the length byte will always be 128.


Simplicity and suitability to constrained environments is where this function wins over any other hash algorithm I’ve covered so far. The size of variables are 32-bit and because it only uses ARX instructions, makes it suitable for wide range of 32-bit architectures.

The output assembly size for this implementation is 345 bytes.

This entry was posted in assembly, cryptography, encryption, programming, security and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s