Fast Whirlpool hash in x86 assembly

Implementing the Whirlpool hash function at all can be challenging. Implementing it efficiently is even more challenging. The design of Whirlpool is very similar to that of the AES cipher, involving byte-wise operations, an S-box, linear algebra, and Galois field arithmetic. So unlike cryptographic functions like MD5 that have a straightforward description, it takes some additional mathematical knowledge to understand how to implement Whirlpool.

Optimizing Whirlpool requires familiarity with the basic algorithm, so in the interest of brevity I will assume that you have read and understood the Whirlpool specification paper. (If you haven’t, you should read about AES first, because the tutorials for AES are much friendlier. Then you’d adapt this knowledge in order to understand Whirlpool.)

The key to optimizing Whirlpool is to do operations not one byte at a time, but an entire 8-byte row at a time. (Similarly, AES benefits from the same optimization as applied to 4-byte columns.) The three operations SubBytes, ShiftColumns, and MixRows can be combined into a single operation that loads a byte from the appropriately shifted location and XORs the appropriate row with a magic constant that fuses the effects of SubBytes and MixRows.

Source code

The code comes in a number of parts:

Files:

To use this code, compile it on Linux with one of these commands:

Then run the executable with ./whirlpooltest.

Licensing: This code is copyrighted and is not open source. Please contact me if you wish to use or copy the code.

Benchmark results

For the C version, I only implemented the simple byte-wise algorithm for clarity. I did not try to produce a fast C version because it would end up having very frequent register spills, thus severely limiting the maximum speed.

An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:

x86-64 version

All the C files work correctly without modification on x86-64. In the assembly code, I changed the usage of MMX registers to GPRs r8 to r15. The usage instructions are exactly the same. Here are the files:

An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:

Notes

More info

Related



Feedback

Question? Comment? Contact me

ProjectNayuki: Like, comment, follow updates on Facebook