Index: > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Business Industries Finance Tax

Home > Huffman coding


First Prev [ 1 2 3 ] Next Last

In computer science, Huffman coding is an entropy encoding algorithm used for data compression that finds the optimal system of encoding strings based on the relative frequency of each character. It was developed by David A. Huffman as a PhD student at MIT in 1952, and published in A Method for the Construction of Minimum-Redundancy Codes.

Huffman coding uses a specific method for choosing the representations for each symbol, resulting in a prefix-free code (that is, no bit string of any symbol is a prefix of the bit string of any other symbol) that expresses the most common characters in the shortest way possible. It has been proven that Huffman coding is the most effective compression method of this type: no other mapping of source symbols to strings of bits will produce a smaller output when the actual symbol frequencies agree with those used to create the code. (Huffman coding is such a widespread method for creating prefix-free codes that the term "Huffman code" is widely used as a synonym for "prefix-free code" even when such a code was not produced by Huffman's algorithm.)

For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding.

1 History

In 1951, David Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree, and quickly proved this method the most efficient.

In doing so, the student outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of Shannon-Fano coding by building the tree from the bottom up instead of from the top down.

2 Problem definition

2.1 Informal description

Given. A set of symbols and their costs..
Find. A prefix binary character code (a sets of codewords) with minimum weighted path length..
Note-1. A code wherein each character is represented by a unique binary string (codeword) is called a binary character code..
Note-2. A prefix code is a code having the property that no codeword is a prefix of any other codeword.
Note-3. In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.

2.2 Formalized description

Input.
Alphabet A = {a[1], a[2], ..., a[n]} which is the symbol alphabet of size n.
Set C = {c[1], c[2], ..., c[n]} which is the set of the symbol costs, i.e., c[i] = cost (a[i]), 1 <= i <= n.

Output.
Code H(A,C) = {h[1], h[2], ..., h[n]} which is the set of (binary) codewords, where h[i] is codeword of a[i], 1 <= i <= n.

Goal.
Let S(H) = sum (c[i] * length (h[i])) (1 <= i <= n) be weighted path length of code H. Must be: S(H) <= S(T) for any code T(A,C).

2.3 Samples

2.3.1 Sample-1

Input.
Symbol alphabet A = {a, b, c, d, e, f, g, h, i },
Set of costs C = {10, 15, 5, 15, 20, 5, 15, 30, 5 }.

Output.
Code H(A,C) = {001, 010, 00000, 011, 101, 00001, 100, 11, 0001}.
Sum S(H) = 10*3 + 15*3 + 5*5 + 15*3 + 20*3 + 5*4 + 15*3 + 30*2 + 5*4 = 355.

2.3.2 Sample-2

Input.
Symbol alphabet A = {a, b, c, d, e, f, g, h, i},
Set of costs C = {3, 21, 2, 1, 8, 34, 1, 13, 5}.


Output.
Code H(A,C) = {000001, 01, 0000001, 00000000, 0001, 1, 00000001, 001, 00001}.
Sum S(H) = 3*6 + 21*2 + 2*7 + 1*8 + 8*4 + 34*1 + 1*8 + 13*3 + 5*5 = 220.

3 Basic technique

The technique works by creating a binary tree of nodes. These can be stored in a regular arrayIn computer programming, an array also known as a vector or list, is one of the simplest data structures. Arrays hold a fixed number of equally-sized data elements, generally of the same data type. Individual elements are accessed by index using a consecu, the size of which depends on the number of symbols(N). A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has N leaf nodes and N−1 internal nodes.

A fast way to create a Huffman tree is to use the heapThis page discusses the heap data structure. For the place where memory is dynamically allocated from, see dynamic memory allocation. In computer science a heap is a specialized tree-based data structure. Its base datatype (used for node keys) must be an data structure, which keeps the nodes in partially sorted order according to a predetermined criterion. In this case, the node with the lowest weight is always kept at the root of the heap for easy access.

Creating the tree:

  1. Start with as many leaves as there are symbols.
  2. Push all leaf nodes into the heap.
  3. While there is more than one node in the heap:
    1. Remove two nodes with the lowest weight from the heap.
    2. Put the two nodes into the tree, noting their location.
    3. If parent links are used, set the children of any internal nodes to point at their parents.
    4. Create a new internal node, using the two nodes as children and their combined weight as the weight.
    5. Push the new node into the heap.
  4. The remaining node is the root node.

Note: it may be beneficial to generate codes with minimum length variance, since in the worst case when several long codewords have to be transmitted in a row it may lead to unwanted side effects(for example if we use a buffer and constant rate transmitter, it increases the mimimum buffer size). To reduce variance every newly generated node must be favored among same weight nodes and placed as high as possible. This will balance the length, keeping same average rate.





Non User