maketrans and translate

Mon Feb 13 11:08:16 PST 2012

Jonathan M Davis:

> Do you have data to backup that there is a significant speed difference?

I have just written a small benchmark, it contains both the URL to the test data and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me.

version3 uses the old translate. version4 uses the new translate. version1InPlace is a reference point, it works in-place. version2InPlace is similar, uses a linear scan for the chars to remove, and it seems a bit slower than version1InPlace (probably because the small deltab array fits well in the L1 cache).

If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string).

import std.stdio, std.string, std.file, std.exception;

char[] version1InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    bool[256] deltab = false;
    foreach (char c; delchars)
        deltab[c] = true;

    int count;
    foreach (char c; text)
        if (!deltab[c]) {
            text[count] = transtab[c];
            count++;
        }

    return text[0 .. count];
}

char[] version2InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    int count;
    outer: foreach (char c; text) {
        foreach (char d; delchars)
            if (c == d)
                continue outer;
        text[count] = transtab[c];
        count++;
    }

    return text[0 .. count];
}

string version3(char[] text) {
    auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",
                         "0111222333445566677788899901112223334455666777888999");
    return text.translate(tab, "\"");
}

string version4(string text) {
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    return text.translate(aa, "\"");
}

void main() {
    // http://patriot.net/~bmcgin/kjv12.txt
    auto txt = cast(char[])read("kjv12.txt");

    assert(version1InPlace(txt.dup) == version2InPlace(txt.dup));
    assert(version1InPlace(txt.dup) == version3(txt.dup));
    assert(version1InPlace(txt.dup) == version4(txt.idup));

    static if (1) version1InPlace(txt); // 0.16 seconds
    static if (0) version2InPlace(txt); // 0.18 seconds
    static if (0) version3(txt); // 0.20 seconds
    static if (0) version4(assumeUnique(txt)); // 0.97 seconds
}

> We're trying to not have any ASCII-only string stuff.

Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good.

Bye,
bearophile