Anagrams comparison

Thu Jun 3 16:01:55 PDT 2010

A good way to find parts of D2/Phobos2 that can be improved is to use them, so now and then I'll show a small program and the problems I've found. Small programs are enough for now, they are able to show enough troubles already. Larger programs are for later when the most common bugs are fixed.

I show a comparison with Python again, because I know it well enough, but a similar comparison can be done with Ruby, Clojure, etc.

This time I have used this simple RosettaCode problem:
http://rosettacode.org/wiki/Anagrams

>Using the word list at http://www.puzzlers.org/pub/wordlists/unixdict.txt, find the sets of words that share the same characters that contain the most words in them. <

----------------------

This is a procedural version for D1 that works:

import std.stdio: writefln;
import std.stream: BufferedFile;
import std.string: chomp;

void main() {
    string[][string] sign2anags;

    foreach (string piece; new BufferedFile("unixdict.txt")) {
        string word = piece.chomp().dup;
        string key = word.dup.sort;
        sign2anags[key] ~= word;
    }

    int lmax;
    foreach (anags; sign2anags) {
        int len = anags.length;
        lmax = lmax < len ? len : lmax;
    }

    foreach (anags; sign2anags)
        if (anags.length == lmax)
            writefln(anags);
}

Its correct output:
[abel,able,bale,bela,elba]
[evil,levi,live,veil,vile]
[elan,lane,lean,lena,neal]
[caret,carte,cater,crate,trace]
[angel,angle,galen,glean,lange]
[alger,glare,lager,large,regal]

Comments on this D1 version:
- The BufferedFile is read lazily, by line, this is good.
- I have had to add "string" as type of "piece" inside the foreach because BufferedFile doesn't yield strings, I have not understood why BufferedFile is designed like this.
- The piece string inside the foreach body is not a new string each iteration. This in theory can improve performance, but in practice similar code written in Python is faster, despite all strings read in Python are newly allocated. BufferedFile is not optimized for performance. And the optimization used by BufferedFile forces the usage of dup. You must not forget to dup those strings or the program will not work. This makes code even slower. And finding the right spots where to add those dups is not immediate, in practice it's a try-and-error process. This can be awful.
- The append to the string array of the values of sign2anags was slow in D1.
- The std.string.chomp() function of std.string is easy to confuse with std.string.chop().
- The creation of the sorted key is nice, you just sort the string chars in place.
- The final printing is not very nice, but it's readable enough.
- Overall this program is not too hard to understand and read, the most bug-prone part is adding those dups. So this program is acceptable, but I'd like a file line iterator similar to BufferedFile that yields strings that are already duplicated and despite that is overall faster than BufferedFile.
- I don't know how this program will work if the input dictionary of words is unicode and contains not ASCII chars like those requires to write French words.

----------------------

And now a Python version. There are many ways to write that program in Python, here I've used the RosettaCode version, that is more functional-style. This isn't the faster possible version, but this is a small problem, so max performance is not so important. It's more important to have an easy to write and read program that is correct.

import urllib
from collections import defaultdict

words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split()
anagram = defaultdict(list) # map sorted chars to anagrams

for word in words:
    anagram[tuple(sorted(word))].append(word)

count = max(len(ana) for ana in anagram.itervalues())

for ana in anagram.itervalues():
    if len(ana) >= count:
        print ana

The output:
['angel', 'angle', 'galen', 'glean', 'lange']
['alger', 'glare', 'lager', 'large', 'regal']
['caret', 'carte', 'cater', 'crate', 'trace']
['evil', 'levi', 'live', 'veil', 'vile']
['elan', 'lane', 'lean', 'lena', 'neal']
['abel', 'able', 'bale', 'bela', 'elba']

Some comments:
- This program doesn't read the dictionary of words from disk, it fetches it from internet. A similar easy to use lib can be added to Phobos2 too.
- This Python2.x program will not work well with a nonASCII input dictionary. Python3 can work better but you can't print accented chars that naively in the Windows shell.

----------------------

I have then tried to write a more functional-style D2 version that uses the algorithms and ranges of Phobos2, but this program fails in many different ways:

import std.stdio, std.stream, std.string, std.algorithm, std.range;

void main() {
    string[][string] sign2anags;
    foreach (string word; new BufferedFile("unixdict.txt"))
        sign2anags[word.chomp().dup.sort] ~= word.chomp();
    int maxCount = max(map!(walkLength)(sign2anags.byValue()));
    auto result = filter!((anags){ anags.length == maxCount; })(sign2anags.byValue());
    writeln(array(xresult));
}

Some note:
- map is probably currently buggy.
- BufferedFile doesn't seem to yield D2 strings.
- The need to find the max of a lazy sequence is natural and very common, but I am not sure it's currently supported.
- I have used walkLength to find the length of the string arrays.
- I am not sure that filter works, even if generally filter of Phobos looks less buggy than map.
- The printing is not improved compared to the D1 version, it's worse.
- Overall not good.

----------------------

To show that it's not impossible to write a nice functional-style version in D I have used my dlibs1, this is D1 code:

import std.string: chomp;
import d.string: putr;
import d.xio: xfile;
import d.func: maxs, len, array;

void main() {
    string[][string] sign2anags;
    foreach (word; xfile("unixdict.txt"))
        sign2anags[word.chomp().dup.sort] ~= word.chomp();
    auto result = maxs(sign2anags.values, &len!(string[]));
    putr(result);
}

Its correct printed output:

[["abel", "able", "bale", "bela", "elba"], ["evil", "levi", "live", "veil", "vile"], ["elan", "lane", "lean", "lena", "neal"], ["caret", "carte", "cater", "crate", "trace"], ["angel", "angle", "galen", "glean", "lange"], ["alger", "glare", "lager", "large", "regal"]]

Notes:
- The program is not as good as the Python one, but to me it looks acceptable.
- putr prints with a newline.
- xfiles is faster than BufferedFile and yields strings. But as in the D1 version such strings are shared, so you have to dup them. A version that doesn't need a dup can be added...
- max accepts 2, 3, 4 items, or a generic iterable of strings.
- By default in dlibs1 associative arrays are iterated on their keys first because this is more useful, so I have used the values there. This can be slow. If you prefer lazyness you can import d.func.xvalues, and then you can give sign2anags.xvalues() to maxs.
- There are functions max, min, mins, maxs, maxPos and minPos. All such functions accept a second optional argument that is a mapping "key", similar to the key of schwartzSort of std.algorithm. maxs finds all the max items.
- You can say that this program is short because I have a maxs() in my dlibs1 that is casually fit for this program/problem. But the truth is different, there is nothing random in that, maxs is present because it does a very common operation, so it is able to shorten many other programs different from this one. It's a very common "micropattern".
- The len is a templated function similar to walkLength.

Overall I think this time Phobos2 was not good enough (but of course it can be improved and fixed, no changes to the D2 language are necessary here).

Bye,
bearophile