Huffman Encoding in Haskell

Today I decided to write a Huffman Encoding program to help me learn Haskell. Structure and Interpretation of Computer Programs has a good section that explains Huffman Encoding Trees.

My Huffman Encoding program has three main parts.

A function makeTree that turns a list of nodes into a Huffman Encoding Tree.
A function decode that takes a Huffman Encoding Tree and a bit string and returns the unencoded text.
A function encode that takes a Huffman Encoding Tree and a piece of text and returns a bit string.

Here is the program.

Let's look at each part.

A Huffman Encoding Tree is represented by the Node data type. A Node is either a Leaf that has a Symbol and a Weight or an Interior Node that has a left Node, a right Node, a list of Symbols in the tree, and a cummulative Weight.

You can construct a Huffman Encoding Tree by passing a list of Leaf Nodes that is sorted by the Weight of each Leaf to makeTree like this.

let tree = makeTree [(Leaf 'A' 1), (Leaf 'B' 1), (Leaf 'C' 2)]

makeTree works by taking the two Nodes with with smallest weights from the list and combining them into a new Node then inserting the new Node into the list in the correct position.

encode works by taking a symbol from the text string and searching for that symbol in the tree. As encode moves down the tree toward the leaves, it appends a bit to the resulting bit string. When encode reaches a Leaf, encode returns to the root of the Huffman Encoding Tree and takes the next symbol from the text string. When encode is out of symbols it returns.

decode works in a similar way. decode takes a bit from the bit string and moves to the left subtree if the bit is 0 and moves to the right subtree otherwise. When decode reaches a Leaf, it pushes the leaf's symbol into the result string, returns to the root of the Huffman Encoding Tree and takes the next bit from the bit string.

Pushing Logic onto your type system

Pattern-matching helps keep the definition of makeTree small. The first argument the definition matches is a list containing only one element. This is the base case. The second argument the definition matches is a list containing at least two elements first and second and the rest of the list. The x:xs syntax is called a cons and represents a list composed of an element x and the rest of the elements in the list xs. Pattern-matching first, second, and rest saves us from having to write code to extract first, second, and rest from the list.

There is no if in this program. It branches using pattern-matching instead. For example, instead of writing if isLeaf tree, Haskell can tell when encode is called on a Leaf if we write this encode codeTree (Leaf s w) (sym:symbols). And instead of having to write code to extract the left and right subtrees from a Node, Haskell will bind the left and right subtrees to l and r respectively if we write this encode codeTree (Interior l r) (sym:symbols).

Pattern-matching helps keep this code dense. Even though I am not used to reading dense code, I like it because almost everything you need to know about what the program is doing is represented locally as opposed to being scattered around the program in different functions.