Project Nayuki


Fast MD5 hash implementation in x86 assembly

For the fun of experimentation, I wanted to see how much I could optimize my x86 MD5 hash implementation for speed. I started with a fairly straightforward naive implementation, then reordered instructions and made equivalent logical transformations. Each successful optimization trick added a few MiB/s of speed, but after trying almost a hundred tweaks (of which about 20 succeeded), the overall result was a staggering 59% increase in speed.

Source code

The code comes in a number of parts:

  • Three interchangeable versions of the MD5 compression function:
    • In x86 assembly, naively (md5-naive.S).
    • In x86 assembly, after extensive optimization (md5-fast.S).
    • In C, for comparison with the x86 version (md5.c).
  • The full MD5 hash function (in C), which handles the block splitting and tail padding. This is not CPU-intensive, because its job is only to set up the appropriate data for the compression function to process.
  • A runnable main program that checks correctness and performs a speed benchmark.

Files:

To use this code, compile it on Linux with one of these commands:

  • gcc md5test.c md5.c -o md5test (x86, x86-64)
  • gcc md5test.c md5-naive.S -o md5test (x86 only)
  • gcc md5test.c md5-fast.S -o md5test (x86 only)
  • gcc md5test.c md5-fast-64.S -o md5test (x86-64 only)

Then run the executable with ./md5test.

Licensing: This code is copyrighted and is not open source. Please contact me if you wish to use or copy the code.

Benchmark results

Code Compilation Speed on x86 Speed on x86-64
CGCC -O0122 MiB/s123 MiB/s
CGCC -O1379 MiB/s390 MiB/s
CGCC -O2387 MiB/s389 MiB/s
CGCC -O3387 MiB/s389 MiB/s
CGCC -O1 -fomit-frame-pointer382 MiB/s
CGCC -O2 -fomit-frame-pointer389 MiB/s
CGCC -O3 -fomit-frame-pointer390 MiB/s
Assembly (naive)GCC -O0270 MiB/s
Assembly (fast)GCC -O1430 MiB/s
Assembly (fast)GCC -O2427 MiB/s
Assembly (OpenSSL[0])GCC -O0410 MiB/s

On both CPU architectures, my assembly code is about 1.10× as fast as my C code best compiled by GCC. Moreover, the C code and assembly code compiled with the various options have the same speed on both architectures.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.

Notes

  • My initial naive code was already loop-unrolled. Since my goal was to optimize for speed, it wouldn’t make sense to start with an implementation that still had loop management overhead.

  • I was surprised to see that my naive x86 code ran much slower than GCC’s generated code, and that my final hand-optimized code was not too much faster than GCC’s best code.

  • Sometimes adding extra instructions and doing more arithmetic computations results in faster code. I think this worked because I made the data dependency graph shallower and took advantage of the superscalar processor.

  • When there are multiple instruction streams doing independent computations, trying the different ways of interleaving them is very important in extracting performance from the superscalar processor.

  • Abusing the LEA instruction to add two registers and a constant is a classic optimization trick with predictable improvement.

  • I had just enough registers to store all the relevant values (4 for the MD5 state, 2 for intermediate computations per round, 1 for the base of the block’s array, and 1 for the stack), which definitely made the code shorter and faster. However, I didn’t use the EBP register in the conventional way.

  • Both my x86 and C implementations of MD5 use C preprocessor macros to generate the repetitive code.

  • x86-64 code: All the C files work correctly without modification on x86-64. I made minimal changes to the assembly code only to adapt to the calling convention and change the 32-bit constants to signed numbers. The usage instructions are exactly the same.

  • MD5 is an old hash algorithm with severe cryptanalytic attacks, and it should not be used anymore for any serious purpose. Please use SHA-1/256/512/3 instead.

  • [0]: OpenSSL 1.0.1c, generating md5-586.s on Linux.

More info