Today I decided to write a Huffman Encoding program to help me learn Haskell. Structure and Interpretation of Computer Programs has a good section that explains Huffman Encoding Trees.
My Huffman Encoding program has three main parts.
makeTreethat turns a list of nodes into a Huffman Encoding Tree.
decodethat takes a Huffman Encoding Tree and a bit string and returns the unencoded text.
encodethat takes a Huffman Encoding Tree and a piece of text and returns a bit string.
Here is the program.
Let's look at each part.
A Huffman Encoding Tree is represented by the Node data type. A Node is either a Leaf that has a Symbol and a Weight or an Interior Node that has a left Node, a right Node, a list of Symbols in the tree, and a cummulative Weight.
You can construct a Huffman Encoding Tree by passing a list of Leaf Nodes that is sorted by the Weight of each Leaf to
makeTree like this.
let tree = makeTree [(Leaf 'A' 1), (Leaf 'B' 1), (Leaf 'C' 2)]
makeTree works by taking the two Nodes with with smallest weights from the list and combining them into a new Node then inserting the new Node into the list in the correct position.
encode works by taking a symbol from the text string and searching for that symbol in the tree. As
encode moves down the tree toward the leaves, it appends a bit to the resulting bit string. When
encode reaches a Leaf,
encode returns to the root of the Huffman Encoding Tree and takes the next symbol from the text string. When
encode is out of symbols it returns.
decode works in a similar way.
decode takes a bit from the bit string and moves to the left subtree if the bit is 0 and moves to the right subtree otherwise. When
decode reaches a Leaf, it pushes the leaf's symbol into the result string, returns to the root of the Huffman Encoding Tree and takes the next bit from the bit string.
Pattern-matching helps keep the definition of
makeTree small. The first argument the definition matches is a list containing only one element. This is the base case. The second argument the definition matches is a list containing at least two elements
second and the
rest of the list. The
x:xs syntax is called a cons and represents a list composed of an element
x and the rest of the elements in the list
rest saves us from having to write code to extract
rest from the list.
There is no
if in this program. It branches using pattern-matching instead. For example, instead of writing
if isLeaf tree, Haskell can tell when
encode is called on a Leaf if we write this
encode codeTree (Leaf s w) (sym:symbols). And instead of having to write code to extract the left and right subtrees from a Node, Haskell will bind the left and right subtrees to l and r respectively if we write this
encode codeTree (Interior l r) (sym:symbols).
Pattern-matching helps keep this code dense. Even though I am not used to reading dense code, I like it because almost everything you need to know about what the program is doing is represented locally as opposed to being scattered around the program in different functions.