Huffman coding (Java)
This project is an open source reference implementation of Huffman coding in Java. The code is intended to be used for study, and as a solid basis for modification and extension. As such, it is optimized for clear logic and low complexity, not speed/memory/performance.
Overview
Huffman encoding takes a sequence (stream) of symbols as input and gives a sequence of bits as output. The intent is to produce a short output for the given input. Each input yields a different output, so the process can be reversed, and the output can be decoded to give back the original input.
In this software, a symbol is a non-negative integer. The symbol limit is one plus the highest allowed symbol. For example, a symbol limit of 4 means that the set of allowed symbols is {0, 1, 2, 3}.
- Sample applications
Two pairs of command line programs fully demonstrate how this software package can be used to encode and decode data using Huffman coding. One pair is the classes
HuffmanCompressandHuffmanDecompress, which implements static Huffman coding. The other pair is the classesAdaptiveHuffmanCompressandAdaptiveHuffmanDecompress, which implements adaptive/dynamic Huffman coding.- Encoder/decoder
The classes
HuffmanEncoderandHuffmanDecoderimplement the basic algorithms for encoding and decoding a Huffman-coded stream. The code tree must be set before encoding or decoding. The code tree can be changed after encoding or decoding each symbol, as long as the encoder and decoder have the same code tree at the same time. At any time, the encoder must not attempt to encode a symbol that is not in the code tree.- Huffman code tree
The class
CodeTree, along withNode,InternalNode, andLeaf, represent a Huffman code tree. The leaves represent symbols. The path to a leaf represents the bit string of its Huffman code.- Frequency table
The class
FrequencyTableis a simple integer array wrapper that counts symbol frequencies. It is also responsible for generating a Huffman code tree that is optimal for its current array of frequencies (but not necessarily canonical).- Canonical codes
The class
CanonicalCodeconverts an arbitraryCodeTreeto a canonical code. It can then generate aCodeTreefor the canonical code.- Bitwise I/O streams
The classes
BitInputStreamandBitOutputStreamare I/O streams based on bitwise I/O, analogous to the standard bytewise I/O streams. However, they use an underlying bytewise I/O stream, so they can only work with bitwise streams whose length is a multiple of 8 bits.
Download
Browse the project’s source code is available at GitHub: https://github.com/nayuki/Huffman-Coding
Or download a ZIP of all the files: https://github.com/nayuki/Huffman-Coding/zipball/master
The code is open source under the MIT License. See the readme file for details.
Limitations
All the Huffman-related code only works with alphabets with up to
Integer.MAX_VALUE - 1(i.e. 231 − 2) symbols.FrequencyTablecan only track each symbol’s frequency up toInteger.MAX_VALUE. Trying to increment a symbol’s frequency beyond this limit raises an exception. However, using larger frequencies than 231 should be rare in practice.The code is optimized for understandability, not performance. One consequence is that
CodeTreeuses memory grossly inefficiently to store the code bit string for each symbol. It usesArrayList<Integer>at a cost of at least 4 bytes per represented bit, instead of packing bits into a primitive integral array type such asbyte[].CodeTree,FrequencyTable, andCanonicalCodeexplicitly take a symbol limit as a parameter. This is not strictly required, but the alternative is to use sparse tables, which is more difficult to understand.A couple of methods use recursion to traverse a whole code tree. When using such a method on a deep tree, a
StackOverflowErrormay be thrown.
Suggestions
Here are some suggestions on how to use, modify, or extend this software:
Extract an interface from the bitwise I/O streams, and create other concrete implementation classes.
Improve the speed of Huffman encoding and decoding. One advanced suggestion is to treat the Huffman decoder as a finite state machine (FSM) and decode a whole byte per iteration.
Layer another encoding scheme on top of Huffman coding, such as RLE, LZW, n-gram models, DCT with quantization, etc. This is how most real compressed data formats are structured.