Fast MD5 hash implementation in x86 assembly
For the fun of experimentation, I wanted to see how much I could optimize my x86 MD5 hash implementation for speed. I started with a fairly straightforward naïve implementation, then reordered instructions and made equivalent logical transformations. Each successful optimization trick added a few MiB/s of speed, but after trying almost a hundred tweaks (of which about 20 succeeded), the overall result was a staggering 59% increase in speed.
Source code
The code comes in a number of parts:
- Three interchangeable versions of the MD5 compression function:
- In x86 assembly, naïvely (md5-naive.S).
- In x86 assembly, after extensive optimization (md5-fast.S).
- In C, for comparison with the x86 version (md5.c).
- The full MD5 hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
- A runnable main program that checks correctness and performs a speed benchmark.
Files:
To use this code, compile it on Linux with one of these commands:
gcc md5test.c md5.c -o md5testgcc md5test.c md5-naive.S -o md5testgcc md5test.c md5-fast.S -o md5test
Then run the executable with ./md5test.
Licensing: This code is copyrighted and is not open source. Please contact me if you wish to use or copy the code.
Benchmark results
An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:
- C code, GCC -O0: 122 MiB/s
- C code, GCC -O1: 379 MiB/s
- C code, GCC -O2: 387 MiB/s
- C code, GCC -O3: 387 MiB/s
- C code, GCC -O1 -fomit-frame-pointer: 382 MiB/s
- C code, GCC -O2 -fomit-frame-pointer: 389 MiB/s
- C code, GCC -O3 -fomit-frame-pointer: 390 MiB/s
- x86 code, naïve, GCC -O0: 270 MiB/s
- x86 code, fast, GCC -O1: 430 MiB/s
- OpenSSL x86 code[0], GCC -O0: 410 MiB/s
Overall, my best hand-optimized code is 10% faster than my C code compiled with the best GCC options.
x86-64 version
All the C files work correctly without modification on x86-64. I made minimal changes to the assembly code only to adapt to the calling convention and change the 32-bit constants to signed numbers. The usage instructions are exactly the same. Here are the files:
An informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:
- C code, GCC -O0: 123 MiB/s
- C code, GCC -O1: 390 MiB/s
- C code, GCC -O2: 389 MiB/s
- C code, GCC -O3: 389 MiB/s
- x86-64 code, fast, GCC -O2: 427 MiB/s
The C code has the same speed compared to x86-32 mode.
Notes
My initial naïve code was already loop-unrolled. Since my goal was to optimize for speed, it wouldn’t make sense to start with an implementation that still had loop management overhead.
I was surprised to see that my naïve x86 code ran much slower than GCC’s generated code, and that my final hand-optimized code was not too much faster than GCC’s best code.
Sometimes adding extra instructions and doing more arithmetic computations results in faster code. I think this worked because I made the data dependency graph shallower and took advantage of the superscalar processor.
When there are multiple instruction streams doing independent computations, trying the different ways of interleaving them is very important in extracting performance from the superscalar processor.
Abusing the LEA instruction to add two registers and a constant is a classic optimization trick with predictable improvement.
I had just enough registers to store all the relevant values (4 for the MD5 state, 2 for intermediate computations per round, 1 for the base of the block’s array, and 1 for the stack), which definitely made the code shorter and faster. However, I didn’t use the EBP register in the conventional way.
Both my x86 and C implementations of MD5 use C preprocessor macros to generate the repetitive code.
MD5 is an old hash algorithm with severe cryptanalytic attacks, and it should not be used anymore for any serious purpose. Please use SHA-1/256/512/3 instead.
[0]: OpenSSL 1.0.1c, generating md5-586.s on Linux.