RC4 cipher in x86 assembly
The core of the RC4 stream cipher is a very small amount of code, so I decided to implement it in x86 assembly language for fun to see how fast I could make it go.
Excluding comments and blank lines, my x86 code is 40 lines long, and the main encryption loop consists of only 14 instructions. Interestingly, there are just enough registers on x86 to hold all the relevant values for this algorithm, so no register spills are needed.
Source code
This code offers a reusable function that performs RC4 encryption and also a demo main() function that runs a self-test and a speed benchmark.
To use this code, compile it on Linux with this command:
gcc rc4test.c rc4.s -o rc4test
Then run the executable with ./rc4test.
Licensing: This code is copyrighted and is not open source. Please contact me if you wish to use or copy the code.
Benchmark results
A quick, informal benchmark on Intel Core 2 Quad Q6600 2.40 GHz (using a single core), Ubuntu 10.04, GCC 4.4.3 gives these numbers:
- C code, GCC -O0: 62.7 MiB/s
- C code, GCC -O1: 171 MiB/s
- C code, GCC -O2: 174 MiB/s
- C code, GCC -O3: 219 MiB/s
- C code, GCC -O1 -fomit-frame-pointer: 187 MiB/s
- C code, GCC -O2 -fomit-frame-pointer: 191 MiB/s
- C code, GCC -O3 -fomit-frame-pointer: 219 MiB/s
- x86 code, GCC -O1: 318 MiB/s
Therefore, we see that my x86 code is 45% faster than my C code. Manually writing and optimizing assembly code seems to pay off significantly in this case.
Notes
Since RC4 is a stream cipher, encryption and decryption are the same operation. Hence I am not providing an explicit decrypt() function.
Using 8-bit instructions in 32-bit mode has no intrinsic penalty, but the important thing to watch out for is partial register accesses. For example, if you update AL and then read EAX, the read will sometimes incur a delay that significantly hurts performance. In the code, the two usages of
movbzlfrom the lower 8 bits of a register to the whole 32-bit register itself are functionally equivalent toandl $0xFF, but the latter is much slower because of partial register access.