Fast SHA-1 hash implementation in x86 assembly

After doing something for the first time, doing it again is much easier. The MD5 and SHA-1 hash algorithms are quite similar in structure, and having written MD5 in x86 assembly recently, applying that knowledge to implement SHA-1 in x86 was a breeze.

Source code

The code comes in a number of parts:

Four interchangeable versions of the SHA-1 compression function:
- In x86 assembly, naively (sha1-naive-x86.S).
- In x86 assembly, with reordered key schedule computation and instruction-level optimization (sha1-fast-x86.S).
- In C, naively in the manner described in the official specification.
- In C, with the key schedule computation interleaved with the hashing rounds (much like sha1-fast-x86.S).
The full SHA-1 hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
A runnable main program that checks correctness and performs a speed benchmark.

Files:

To use this code, compile it on Linux with one of these commands:

cc sha1-test.c sha1-naive.c -o sha1-test (x86, x86-64)
cc sha1-test.c sha1-fast.c -o sha1-test (x86, x86-64)
cc sha1-test.c sha1-naive-x86.S -o sha1-test (x86 only)
cc sha1-test.c sha1-fast-x86.S -o sha1-test (x86 only)
cc sha1-test.c sha1-fast-x8664.S -o sha1-test (x86-64 only)

Then run the executable with ./sha1-test.

Benchmark results

Code	Compilation	Speed on x86	Speed on x86-64
C (naive)	GCC -O0	98 MiB/s	97 MiB/s
C (naive)	GCC -O1	184 MiB/s	212 MiB/s
C (naive)	GCC -O2	178 MiB/s	198 MiB/s
C (naive)	GCC -O3	177 MiB/s	199 MiB/s
C (naive)	GCC -O1 -fomit-frame-pointer	206 MiB/s
C (naive)	GCC -O2 -fomit-frame-pointer	191 MiB/s
C (naive)	GCC -O3 -fomit-frame-pointer	191 MiB/s
C (fast)	GCC -O0	111 MiB/s	111 MiB/s
C (fast)	GCC -O1	253 MiB/s	292 MiB/s
C (fast)	GCC -O2	137 MiB/s	191 MiB/s
C (fast)	GCC -O3	137 MiB/s	191 MiB/s
C (fast)	GCC -O1 -fomit-frame-pointer	269 MiB/s
C (fast)	GCC -O2 -fomit-frame-pointer	182 MiB/s
C (fast)	GCC -O3 -fomit-frame-pointer	182 MiB/s
Assembly (naive)	GCC -O1	270 MiB/s
Assembly (fast)	GCC -O0		313 MiB/s
Assembly (fast)	GCC -O1	327 MiB/s
Assembly (OpenSSL^[0])	GCC -O0	305 MiB/s

On x86, my assembly code is 1.22× as fast as my C code best compiled by GCC. On x86-64, my assembly code is 1.07× as fast as my C code best compiled by GCC.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.

Notes

In both MD5 and SHA-1 (and some other hashes), the round function computes (b & c) | (~b & d). There are different ways to compute this that give the same result, but the most speed-optimized way is different for MD5 compared to SHA-1.
Once again by great luck, there are just enough registers to keep all the state values, temporary values, and data pointers in registers without spilling.
Because of the key schedule computation, SHA-1 does some more computation and many more load/store operations than MD5, yet it doesn’t run much slower than MD5.
Going beyond GCC -O1 may actually hurt performance. Speed optimization involves a surprising amount of voodoo.
All the C files work correctly without modification on x86-64. I made minimal changes to the assembly code only to adapt to the calling convention and change the 32-bit constants to signed numbers. The usage instructions are exactly the same.
[0]: OpenSSL 1.0.1c, generating sha1-586.s on Linux.

Project Nayuki

Fast SHA-1 hash implementation in x86 assembly

Source code

Benchmark results

Notes

More info