Project Nayuki


Fast MD5 hash implementation in x86 assembly

For the fun of experimentation, I wanted to see how much I could optimize my x86 MD5 hash implementation for speed. I started with a fairly straightforward naïve implementation, then reordered instructions and made equivalent logical transformations. Each successful optimization trick added a few MiB/s of speed, but after trying almost a hundred tweaks (of which about 20 succeeded), the overall result was a staggering 59% increase in speed.

Source code

The code comes in a number of parts:

Files:

To use this code, compile it on Linux with one of these commands:

Then run the executable with ./md5test.

Licensing: This code is copyrighted and is not open source. Please contact me if you wish to use or copy the code.

Benchmark results

Code Compilation Speed on x86 Speed on x86-64
CGCC -O0122 MiB/s123 MiB/s
CGCC -O1379 MiB/s390 MiB/s
CGCC -O2387 MiB/s389 MiB/s
CGCC -O3387 MiB/s389 MiB/s
CGCC -O1 -fomit-frame-pointer382 MiB/s
CGCC -O2 -fomit-frame-pointer389 MiB/s
CGCC -O3 -fomit-frame-pointer390 MiB/s
Assembly (naïve)GCC -O0270 MiB/s
Assembly (fast)GCC -O1430 MiB/s
Assembly (fast)GCC -O2427 MiB/s
Assembly (OpenSSL[0])GCC -O0410 MiB/s

On both CPU architectures, my assembly code is about 1.10× as fast as my C code best compiled by GCC. Moreover, the C code and assembly code compiled with the various options have the same speed on both architectures.

All the benchmark results above are based on: CPU = Intel Core 2 Quad Q6600 2.40 GHz (single-threaded), OS = Ubuntu 10.04 (32-bit and 64-bit), compiler = GCC 4.4.3.

Notes

More info