Fast Whirlpool hash in x86 assembly
Implementing the Whirlpool hash function at all can be challenging. Implementing it efficiently is even more challenging. The design of Whirlpool is very similar to that of the AES cipher, involving byte-wise operations, an S-box, linear algebra, and Galois field arithmetic. So unlike cryptographic functions like MD5 that have a straightforward description, it takes some additional mathematical knowledge to understand how to implement Whirlpool.
Optimizing Whirlpool requires familiarity with the basic algorithm, so in the interest of brevity I will assume that you have read and understood the Whirlpool specification paper. (If you haven’t, you should read about AES first, because the tutorials for AES are much friendlier. Then you’d adapt this knowledge in order to understand Whirlpool.)
The key to optimizing Whirlpool is to do operations not one byte at a time, but an entire 8-byte row at a time. (Similarly, AES benefits from the same optimization as applied to 4-byte columns.) The three operations SubBytes, ShiftColumns, and MixRows can be combined into a single operation that loads a byte from the appropriately shifted location and XORs the appropriate row with a magic constant that fuses the effects of SubBytes and MixRows.
Source code
The code comes in a number of parts:
- Two interchangeable versions of the Whirlpool compression function:
- In x86 assembly, optimized using MMX and SSE.
- In C, provided as a straightforward, naïve reference implementation.
- The full Whirlpool hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
- A runnable main program that checks correctness and performs a speed benchmark.
Files:
To use this code, compile it on Linux with one of these commands:
gcc whirlpooltest.c whirlpool.S -o whirlpooltestgcc whirlpooltest.c whirlpool.c -o whirlpooltest
Then run the executable with ./whirlpooltest.
Licensing: This code is copyrighted and is not open source. Please contact me if you wish to use or copy the code.
Benchmark results
For the C version, I only implemented the simple byte-wise algorithm for clarity. I did not try to produce a fast C version because it would end up having very frequent register spills, thus severely limiting the maximum speed.
An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:
- C code, GCC -O0: 0.77 MiB/s
- C code, GCC -O1: 2.77 MiB/s
- C code, GCC -O2: 2.93 MiB/s
- C code, GCC -O3: 5.57 MiB/s
- C code, GCC -O1 -fomit-frame-pointer: 2.67 MiB/s
- C code, GCC -O2 -fomit-frame-pointer: 3.00 MiB/s
- C code, GCC -O3 -fomit-frame-pointer: 5.84 MiB/s
- x86 code, GCC -O1: 58.7 MiB/s
x86-64 version
All the C files work correctly without modification on x86-64. In the assembly code, I changed the usage of MMX registers to GPRs r8 to r15. The usage instructions are exactly the same. Here are the files:
An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:
- C code, GCC -O0: 0.76 MiB/s
- C code, GCC -O1: 2.43 MiB/s
- C code, GCC -O2: 2.43 MiB/s
- C code, GCC -O3: 8.06 MiB/s
- x86-64 code, GCC -O0: 63.0 MiB/s
Notes
The x86 code uses all eight 64-bit MMX and all eight 128-bit XMM registers, but only 5 general-purpose registers (GPRs).
PEXTRB is a more suitable instruction to use, but it’s fairly new, being introduced in SSE 4.1. So I emulated it using the older PEXTRW instruction with pre- and post-adjustments.
AddRoundKey is performed on 128 bits per operation, which is a nice benefit of using the XMM registers. All the other Whirlpool operations can only be done 64 bits at a time.
For the x86 code, GCC -O1 is used (instead of -O0) because the benchmark loop code is written in C.
While the software implementation of Whirlpool is much slower than (e.g.) SHA-1, Whirlpool will see a much larger parallelization speedup when implemented in logic gates and circuits.
More info
- The Whirlpool Hash Function
- Official paper and reference implementation package
- Wikipedia: Whirlpool (cryptography)