Fast SHA-1 hash implementation in x86 assembly
After doing something for the first time, doing it again is much easier. The MD5 and SHA-1 hash algorithms are quite similar in structure, and having written MD5 in x86 assembly recently, applying that knowledge to implement SHA-1 in x86 was a breeze.
Source code
The code comes in a number of parts:
- Four interchangeable versions of the SHA-1 compression function:
- In x86 assembly, naïvely (sha1-naive.S).
- In x86 assembly, with reordered key schedule computation and instruction-level optimization (sha1-fast.S).
- In C, naïvely in the manner described in the official specification.
- In C, with the key schedule computation interleaved with the hashing rounds (much like sha1-fast.S).
- The full SHA-1 hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
- A runnable main program that checks correctness and performs a speed benchmark.
Files:
To use this code, compile it on Linux with one of these commands:
gcc sha1test.c sha1-naive.c -o sha1testgcc sha1test.c sha1-fast.c -o sha1testgcc sha1test.c sha1-naive.S -o sha1testgcc sha1test.c sha1-fast.S -o sha1test
Then run the executable with ./sha1test.
Licensing: This code is copyrighted and is not open source. Please contact me if you wish to use or copy the code.
Benchmark results
An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:
- C code, naïve, GCC -O0: 98 MiB/s
- C code, naïve, GCC -O1: 184 MiB/s
- C code, naïve, GCC -O2: 178 MiB/s
- C code, naïve, GCC -O3: 177 MiB/s
- C code, fast, GCC -O0: 111 MiB/s
- C code, fast, GCC -O1: 253 MiB/s
- C code, fast, GCC -O2: 137 MiB/s
- C code, fast, GCC -O3: 137 MiB/s
- C code, naïve, GCC -O1 -fomit-frame-pointer: 206 MiB/s
- C code, naïve, GCC -O2 -fomit-frame-pointer: 191 MiB/s
- C code, naïve, GCC -O3 -fomit-frame-pointer: 191 MiB/s
- C code, fast, GCC -O1 -fomit-frame-pointer: 269 MiB/s
- C code, fast, GCC -O2 -fomit-frame-pointer: 182 MiB/s
- C code, fast, GCC -O3 -fomit-frame-pointer: 182 MiB/s
- x86 code, naïve, GCC -O1: 190 MiB/s
- x86 code, fast, GCC -O1: 327 MiB/s
- OpenSSL x86 code[0], GCC -O0: 305 MiB/s
In conclusion, my best x86 code is 23% faster than my best C code compiled with GCC.
x86-64 version
All the C files work correctly without modification on x86-64. I made minimal changes to the assembly code only to adapt to the calling convention and change the 32-bit constants to signed numbers. The usage instructions are exactly the same. Here are the files:
An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:
- C code, naïve, GCC -O0: 97 MiB/s
- C code, naïve, GCC -O1: 212 MiB/s
- C code, naïve, GCC -O2: 198 MiB/s
- C code, naïve, GCC -O3: 199 MiB/s
- C code, fast, GCC -O0: 111 MiB/s
- C code, fast, GCC -O1: 292 MiB/s
- C code, fast, GCC -O2: 191 MiB/s
- C code, fast, GCC -O3: 191 MiB/s
- x86-64 code, fast, GCC -O0: 313 MiB/s
Compared to x86-32 mode, the C code performs better but the assembly code performs worse.
Notes
In both MD5 and SHA-1 (and some other hashes), the round function computes
(b & c) | (~b & d). There are different ways to compute this that give the same result, but the most speed-optimized way is different for MD5 compared to SHA-1.Once again by great luck, there are just enough registers to keep all the state values, temporary values, and data pointers in registers without spilling.
Because of the key schedule computation, SHA-1 does some more computation and many more load/store operations than MD5, yet it doesn’t run much slower than MD5.
Going beyond GCC -O1 may actually hurt performance. Speed optimization involves a surprising amount of voodoo.
[0]: OpenSSL 1.0.1c, generating sha1-586.s on Linux.