RC4 cipher in x86 assembly

The core of the RC4 stream cipher is a very small amount of code, so I decided to implement it in x86 assembly language for fun to see how fast I could make it go.

Excluding comments and blank lines, my x86 code is 39 lines long, and the main encryption loop consists of only 13 instructions. Interestingly, for this algorithm there are just enough registers on x86 to hold all the relevant values – but I did have split the 32-bit register ECX into the two 8-bit registers CL and CH, which is somewhat unorthodox in 32-bit programming.

Source code

This code offers a reusable function that performs RC4 encryption, and also a demo main() function that runs a sanity check and speed test.

Two x86 assembly language implementations of the RC4 encryption function are provided. One is byte-oriented and more logical to a human reader. The other uses benchmark-guided modifications to make the code faster, by using 32-bit integer processing and adding extra instructions in certain places.

To use this code, compile it on Linux with one of these commands:

Then run the executable with ./rc4test.

Benchmark results

A quick, informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:

Therefore, we see that my simple x86 code is slightly slower than the optimized C code, but my fast x86 code is 1.5× as fast as the C code. Manually writing and optimizing assembly code does seem to pay off in this case.

More info