Today I decided to write a Huffman Encoding program to help me learn Haskell. Structure and Interpretation of Computer Programs has a good section that explains Huffman Encoding Trees.
My Huffman Encoding program has three main parts.
makeTree
that turns a list of nodes into a Huffman Encoding Tree.decode
that takes a Huffman Encoding Tree and a bit string and returns the unencoded text.encode
that takes a Huffman Encoding Tree and a piece of text and returns a bit string.Here is the program.
Let's look at each part.
A Huffman Encoding Tree is represented by the Node data type. A Node is either a Leaf that has a Symbol and a Weight or an Interior Node that has a left Node, a right Node, a list of Symbols in the tree, and a cummulative Weight.
You can construct a Huffman Encoding Tree by passing a list of Leaf Nodes that is sorted by the Weight of each Leaf to makeTree
like this.
let tree = makeTree [(Leaf 'A' 1), (Leaf 'B' 1), (Leaf 'C' 2)]
makeTree
works by taking the two Nodes with with smallest weights from the list and combining them into a new Node then inserting the new Node into the list in the correct position.
encode
works by taking a symbol from the text string and searching for that symbol in the tree. As encode
moves down the tree toward the leaves, it appends a bit to the resulting bit string. When encode
reaches a Leaf, encode
returns to the root of the Huffman Encoding Tree and takes the next symbol from the text string. When encode
is out of symbols it returns.
decode
works in a similar way. decode
takes a bit from the bit string and moves to the left subtree if the bit is 0 and moves to the right subtree otherwise. When decode
reaches a Leaf, it pushes the leaf's symbol into the result string, returns to the root of the Huffman Encoding Tree and takes the next bit from the bit string.
Pattern-matching helps keep the definition of makeTree
small. The first argument the definition matches is a list containing only one element. This is the base case. The second argument the definition matches is a list containing at least two elements first
and second
and the rest
of the list. The x:xs
syntax is called a cons and represents a list composed of an element x
and the rest of the elements in the list xs
. Pattern-matching first
, second
, and rest
saves us from having to write code to extract first
, second
, and rest
from the list.
There is no if
in this program. It branches using pattern-matching instead. For example, instead of writing if isLeaf tree
, Haskell can tell when encode
is called on a Leaf if we write this encode codeTree (Leaf s w) (sym:symbols)
. And instead of having to write code to extract the left and right subtrees from a Node, Haskell will bind the left and right subtrees to l and r respectively if we write this encode codeTree (Interior l r) (sym:symbols)
.
Pattern-matching helps keep this code dense. Even though I am not used to reading dense code, I like it because almost everything you need to know about what the program is doing is represented locally as opposed to being scattered around the program in different functions.