Huffman coding comparison

Fri May 28 15:01:05 PDT 2010

Now and then it's very useful to compare how to write in D2 a small interesting program written in another language, the comparison can help find problems, possible improvements, performance problems, and so on.

Time ago I have tried to convert to D2 a small Python program that finds the Huffman encoding of a given string, but Phobos2 was not mature enough, and I didn't produce a good enough D2 program. I have tried again and this time things are good enough.

The original Python code comes from the rosettacode site (that contains comparisons of many different programs implemented in many different languages), the point of the Python version is not performance, but to show elegant & compact code (there are ways to write this code shorter in Python but this code must has to be readable, it's not a code golf site).

This the Python 2.x version:

from heapq import heappush, heappop, heapify
from collections import defaultdict

def encode(symb2freq):
    """Huffman encode the given dict mapping symbols to weights"""
    heap = [[float(wt), [sym, ""]] for sym, wt in symb2freq.iteritems()]
    heapify(heap)
    while len(heap) > 1:
        lo = heappop(heap)
        hi = heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return sorted(heappop(heap)[1:], key=lambda p: (len(p[1]), p))

txt = "this is an example for huffman encoding"
symb2freq = defaultdict(int)
for ch in txt:
    symb2freq[ch] += 1
huff = encode(symb2freq)
print "Symbol\tWeight\tHuffman Code"
for p in huff:
    print "%s\t%s\t%s" % (p[0], symb2freq[p[0]], p[1])

It contains a demo string, and it prints (tabs set to 4 spaces):

Symbol  Weight  Huffman Code
    6   101
n   4   010
a   3   1001
e   3   1100
f   3   1101
h   2   0001
i   3   1110
m   2   0010
o   2   0011
s   2   0111
g   1   00000
l   1   00001
p   1   01100
r   1   01101
t   1   10000
u   1   10001
x   1   11110
c   1   111110
d   1   111111

After few experiments I have produced this working D2 version that uses just Phobos2:

import std.stdio, std.algorithm, std.typecons;

/// Huffman encode the given dict mapping symbols to weights
auto encode(TFreq, TSymb)(TFreq[TSymb] symb2freq) {
    alias Tuple!(TSymb, "symb", string, "code") Pair;
    alias Tuple!(TFreq, "freq", Pair[], "pairs") Block;
    Block[] blocks;
    foreach (symb, freq; symb2freq)
        blocks ~= Block(freq, [Pair(symb, "")]);
    auto heap = BinaryHeap!(Block[], "b < a")(blocks);

    while (heap.length > 1) {
        auto lo = heap.pop(1)[0];
        auto hi = heap.pop(1)[0];
        foreach (ref pair; lo.pairs)
            pair.code = '0' ~ pair.code;
        foreach (ref pair; hi.pairs)
            pair.code = '1' ~ pair.code;
        heap.push(Block(lo.freq + hi.freq, lo.pairs ~ hi.pairs));
    }
    auto r = heap.pop(1)[0].pairs;
    schwartzSort!((x){ return x.code.length; })(r);
    return r;
}

void main() {
    auto txt = "this is an example for huffman encoding";
    int[char] symb2freq;
    foreach (c; txt)
        symb2freq[c]++;
    auto huff = encode(symb2freq);
    writeln("Symbol\tWeight\tHuffman Code");
    foreach (p; huff)
        writefln("%s\t%s\t%s", p.symb, symb2freq[p.symb], p.code);
}

That prints:

Symbol  Weight  Huffman Code
n   4   010
    6   101
a   3   1001
e   3   1100
o   2   0011
i   3   1110
h   2   0001
f   3   1101
s   2   0111
m   2   0010
t   1   10000
g   1   00000
r   1   01101
p   1   01100
l   1   00001
u   1   10001
x   1   11110
d   1   111111
c   1   111110

Some notes (later I can send them to Andrei):

The D2 version is longer, but this size difference can be acceptable. The D2 version looks good enough (this is just a small test program. In a real D2 program I write code in a less compressed way, I add comments, preconditions, unittests, and when it's necessary I try to write more efficient code).

It is not fully equivalent to the Python version, to print the same thing the Schwartz sorting has to use  (x){return tuple(x.code.length, x);}.

For me for one thing the D program is better than the Python version: the Python std lib doesn't define a mutable named tuple like that one in D2 (there is only one immutable and dynamically typed one), this forces the Python code to use all those [0] and [1:]. Anonymous structures sometimes make Python code shorter, but they can be less explicit too. Of course it's easy to define in Python a function like Tuple of Phobos2 that accepts optional field names too.

The "std.typecons" module, that contains Tuple and tuple can be named something simpler to remember as "std.tuples".

I will keep missing array comprehensions in D. In the meantime other languages have got some forms of them (but Python ones use the best syntax still).

In translating the program I have not found significant problems. The only bug I have found was caused by the different ordering of the Python/Phobos2 heaps, so I have had to use "b<a" in D2. This is not a bug, but I have to keep it in mind in future usages of heaps across the two languages. I don't know why the two languages have chosen different orderings here, if someone of you knows I'd like to know it.

Another small problem I have found was caused by BinaryHeap.pop(). I'd like it to pop and return a single item, instead of an array of given length of items, because this is done quite more often than popping many items. If the need to pop many items is important, then a different method can be defined, as npop().

Maybe BinaryHeap can be moved to the collections module.

array(map(...)) is so common that an amap(...) can be considered.

A helper function like this one can be useful:
auto binaryHeap(alias less="a < b", Range)(Range items) {
    return BinaryHeap!(Range, less)(items);
}
Note that the 'less' is before the Range, so in that program you can write:
auto heap = binaryHeap!("b < a")(s2w);
Instead of:
auto heap = BinaryHeap!(Block[], "b < a")(s2w);
And in many cases you just write:
auto heap = binaryHeap(items);

I have seen that this doesn't work, but something like this can be useful (but I am not sure, I don't love strings used this way):
schwartzSort!("a.code.length")(r);

In Python there is both list.sort() and sorted(), the second creates a new list. In my dlibs1 I have similar functions. They allow you to write:
return schwartzSorted!((x){ return x.code.length; })(...items);
Instead of:
auto r = ...items;
schwartzSort!((x){ return x.code.length; })(items);
return items;
So I'd like sorted and schwartzSorted in Phobos2.

While debugging this program I have found that array(iota(0, 10)) is an uint[] instead of being an int[]. In most cases I prefer that syntax to produce an array of true ints. In the uncommon situations when I want an array of uints I can ask it with a syntax like array(iota(0U, 10U)). If this specification is not possible then I prefer iota to always yield signed values, because they are much *safer* in D.

I'd like iota(100) to be the same as iota(0, 100), as in the Python range().

To improve a *lot* my debugging in code like this I'd like this line of code:
  writeln(tuple(['x': 1.0, 'y': 2.0], "hello", [[1, 2], [3, 4]]));
To print (better):
  tuple(['x': 1.0, 'y': 2.0], "hello", [[1, 2], [3, 4]])
Or even:
  Tuple!(double[char], string, int[][])(['x': 1.0, 'y': 2.0], "hello", [[1, 2], [3, 4]])
instead of:
  Tuple!(double[char],string,int[][])([x:1, y:2], hello, 1 2 3 4)

Note that the double 1.0 is printed on default with a leading zero to denote it's a floating point, the string if inside a collection is printed with "" on its sides, the char inside '' (just like the way you have to write them in the D code), the arrays with [] and commas, and associative arrays have a bit more spaces to help readability.

>From various D2 programs I have seen that I usually don't need the full type of the tuple in the printout, they clutter the printing too much. For example if you try to print the 'heap' variable inside the encode() function just before the return you get the following text, this is too much noisy and unreadable (notice a funny detail, strings are correctly printed inside "" if they are tuple field names!):

BinaryHeap!(Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")[],"b < a")(1, Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(39, Tuple!(char,"c",string,"code")(g, 00000) Tuple!(char,"c",string,"code")(l, 00001) Tuple!(char,"c",string,"code")(h, 0001) Tuple!(char,"c",string,"code")(m, 0010) Tuple!(char,"c",string,"code")(o, 0011) Tuple!(char,"c",string,"code")(n, 010) Tuple!(char,"c",string,"code")(p, 01100) Tuple!(char,"c",string,"code")(r, 01101) Tuple!(char,"c",string,"code")(s, 0111) Tuple!(char,"c",string,"code")(t, 10000) Tuple!(char,"c",string,"code")(u, 10001) Tuple!(char,"c",string,"code")(a, 1001) Tuple!(char,"c",string,"code")( , 101) Tuple!(char,"c",string,"code")(e, 1100) Tuple!(char,"c",string,"code")(f, 1101) Tuple!(char,"c",string,"code")(i, 1110) Tuple!(char,"c",string,"code")(x, 11110) Tuple!(char,"c",string,"code")(c, 111110) Tuple!(char,"c",string,"code")(d, 111111)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(16, Tuple!(char,"c",string,"code")(g, 00000) Tuple!(char,"c",string,"code")(l, 00001) Tuple!(char,"c",string,"code")(h, 0001) Tuple!(char,"c",string,"code")(m, 0010) Tuple!(char,"c",string,"code")(o, 0011) Tuple!(char,"c",string,"code")(n, 010) Tuple!(char,"c",string,"code")(p, 01100) Tuple!(char,"c",string,"code")(r, 01101) Tuple!(char,"c",string,"code")(s, 0111)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(11, Tuple!(char,"c",string,"code")(t, 0000) Tuple!(char,"c",string,"code")(u, 0001) Tuple!(char,"c",string,"code")(a, 001) Tuple!(char,"c",string,"code")( , 01)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(8, Tuple!(char,"c",string,"code")(g, 0000) Tuple!(char,"c",string,"code")(l, 0001) Tuple!(char,"c",string,"code")(h, 001) Tuple!(char,"c",string,"code")(m, 010) Tuple!(char,"c",string,"code")(o, 011)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(6, Tuple!(char,"c",string,"code")(e, 00) Tuple!(char,"c",string,"code")(f, 01)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(5, Tuple!(char,"c",string,"code")(t, 000) Tuple!(char,"c",string,"code")(u, 001) Tuple!(char,"c",string,"code")(a, 01)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(4, Tuple!(char,"c",string,"code")(n, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(4, Tuple!(char,"c",string,"code")(g, 000) Tuple!(char,"c",string,"code")(l, 001) Tuple!(char,"c",string,"code")(h, 01)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(3, Tuple!(char,"c",string,"code")(i, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(3, Tuple!(char,"c",string,"code")(e, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(2, Tuple!(char,"c",string,"code")(t, 00) Tuple!(char,"c",string,"code")(u, 01)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(2, Tuple!(char,"c",string,"code")(p, 00) Tuple!(char,"c",string,"code")(r, 01)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(2, Tuple!(char,"c",string,"code")(m, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(2, Tuple!(char,"c",string,"code")(g, 00) Tuple!(char,"c",string,"code")(l, 01)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(1, Tuple!(char,"c",string,"code")(x, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(1, Tuple!(char,"c",string,"code")(t, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(1, Tuple!(char,"c",string,"code")(p, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(1, Tuple!(char,"c",string,"code")(g, 0)) Tuple!(double,"freq",Tuple!(char,"c",string,"code")[],"pairs")(1, Tuple!(char,"c",string,"code")(c, 0)))

This is what Python prints in the same situation, it's more readable for my eyes (but this is not a fair comparison, in D the structures are Tuples with named fields, and so on):

[[39.0, ['g', '00000'], ['l', '00001'], ['h', '0001'], ['m', '0010'], ['o', '0011'], ['n', '010'], ['p', '01100'], ['r', '01101'], ['s', '0111'], ['t', '10000'], ['u', '10001'], ['a', '1001'], [' ', '101'], ['e', '1100'], ['f', '1101'], ['i', '1110'], ['x', '11110'], ['c', '111110'], ['d', '111111']]]

The purpose of this program was to be short and "elegant". To test the performance I have replaced in both programs the input text with something that contains one hundred thousand different symbols.

Python:
txt = range(10000) + range(10000) + range(100000)

D2:
auto txt = array(iota(0, 10000)) ~ array(iota(0, 10000)) ~ array(iota(0, 100000));

The timings, seconds:
  Python:       2.98
  Python+Psyco: 2.84
  D2:           1.97

In this benchmark most of the running time is spent inside the encode(), the creation of symb2freq takes 0.05-0.16 seconds.

Bye,
bearophile