Huffman coding comparison

Mon May 31 05:00:51 PDT 2010

Andrei:

>Thanks for your notes.<

My pleasure, if you like them I will try to do this thing some more times.
There are probably tens or hundreds of small details that can be improved in Phobos. Some of such changes can improve the usage patterns. In past I have put some of them in Bugzilla.
One good way to find such possible improvements is to use Phobos, to write small programs, and keep eyes open.

>Looks like we're having two proposals.<

I am sceptic that this can be done with no compiler/language support to offer the good enough syntax sugar:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=110613

In my dlibs1 (for D1) I have implemented and then later sometimes used an expanded and improved an idea by Henning Hasemann shown in 2007, but this (you are free to use it if you want, code shown in this newsgroup page is free to use, I think):
- is not efficient
- you have to define the iteration variable types before this array comp
- the code that uses this array comp is not so easy to read
- this can't be done (in D1) for lazy comphrensions.

TA[] select(TA, TI, TC)(lazy TA mapper,
                        ref TI iter1, TC items1) {
    static if(is( TC == void[0] )) {
        return null;
    } else {
        auto aux1 = iter1; // save original iteration variable

        static if (HasLength!(TC)) {
            auto result = new TA[items1.length];

            uint i;
            static if (IsAA!(TC)) {
                foreach (k, v; items1) {
                    iter1 = k;
                    result[i] = mapper();
                    i++;
                }
            } else {
                // Then items1 is an iterable with attribute length
                // (an array, xrange, etc)
                // don't use foreach (i,el;items1), it's less general
                foreach (el; items1) {
                    iter1 = el;
                    result[i] = mapper();
                    i++;
                }
            }

            iter1 = aux1; // restore original iteration variable
            return result;
        } else {
            // Then items1 is an iterable object
            // when TA isn't an AA, geometric grow can be used to speed up
            ArrayBuilder!(TA) result;
            foreach (el; items1) {
                iter1 = el;
                result ~= mapper();
            }
            iter1 = aux1; // restore original iteration variable
            return result.toarray;
        }
    }
}

TA[] select(TA, TI, TC, TP)(lazy TA mapper,
                            ref TI iter1, TC items1,
                            lazy TP where) {
...

TA[] select(TA, TI1, TC1, TI2, TC2)(lazy TA mapper,
                                    ref TI1 iter1, TC1 items1,
                                    ref TI2 iter2, lazy TC2 items2) {
...

TA[] select(TA, TI1, TC1, TI2, TC2, TP)(lazy TA mapper,
                                        ref TI1 iter1, TC1 items1,
                                        ref TI2 iter2, lazy TC2 items2,
                                        lazy TP where) {
...

>There exists a pop() function that only pops one element.<

This is how it is implemented:

/**
Pops the largest element (according to the predicate $(D less)).
 */
    void pop()
    {
        enforce(_length);
        if (_length > 1)
        swap(_store.front, _store[_length - 1]);
        --_length;
        percolateDown(_store[0 .. _length]);
    }

I'd like it to also return the popped item, a ElementType!Range, is this possible?
Popping one item out is one of the most common operations I have to perform on an heap.

>>array(map(...)) is so common that an amap(...) can be considered.<<

>I don't know.<

A too much long list of function (that is a too much large API) is bad, but I have found that for the most common higher order functions (map and filter, they are common because array comps aren't present) I like a short cut for the eager version, amap/afilter. But they are not essential, we can survive without them :-)

>I'm not crazy about functions that return large arrays by value. I'd have sorted() return a range (a heap!) that lazily spans the input in sorted order.<

When I need only the few items in sorted order I can use just pop(n), or many pop().
Functional languages return data copies, but they are sometimes lazy (Haskell) or thy try to avoid using arrays and use more functional-friendly data structures that reduce the need for copying lot of data.

sorted() and schwartzSorted() can be handy because they can be used as expressions in a functional style. You can write:

foreach (x; sorted(items)) {...

Instead of:

auto sortedItems = items.dup;
sortedItems.sorted();
foreach (x; sortedItems) {...

If items is long then sorted() is not the top efficiency in both memory and time, but in many situations you don't have many items. Most arrays are short. If you have a 5 items long array and the items are small (like numbers) then using sorted() is not so bad unless it's in the middle of a critical piece of code. And in this case using a standard sort is probably wrong anyway.

So sorted/schwartzSorted are not for every situation, they are more for situations where you prefer handy and short code, and you don't need max performance. You don't have to abuse them, as most other things.

>I've never had a clear view on what the target audience for writeln() is. You seem to want it to output debug strings; I'm not sure that's the best way to purpose it.<

Usages of the printing functions:
- To debug code. For this purpose the text shown has to be human-readable, the writeln has to be as automatic as possible (to reduce time needed to add the printing statements), and the text shown has to be "nice" to show the data types but not too much noisy, otherwise the text can become useless. There are more modern ways to show data structures, even GUI-based, but having a fall-back strategy with a good writeln is good.
- To show output in small script-like programs or medium command line programs. I think this is the same case as the debug code one.
- To print a lot of numbers or simple symbols, for later processing with other programs. In this case printf() is better because it's faster than writeln.
- To print many strings. In this case in D printf()/puts() can be suboptimal or unfit. Some very simple, very fast and not templated function similar to puts() but designed for D strings.
- For (textual) serialization? In this case it's better to use functions more specialized for this purpose, and to avoid the writeln.

So I don't see why it's better for this command:
  writeln(tuple(['x': 1.0, 'y': 2.0], "hello", [[1, 2], [3, 4]]));

To print:
  Tuple!(double[char],string,int[][])([x:1, y:2], hello, 1 2 3 4)

Instead of something more fitter for humans, that can show the things well, as:
  tuple(['x': 1.0, 'y': 2.0], "hello", [[1, 2], [3, 4]])
Or:
  Tuple!(double[char], string, int[][])(['x': 1.0, 'y': 2.0], "hello", [[1, 2], [3, 4]])

Bye,
bearophile