[Issue 7515] New: The new std.string.translate is slow for ASCII text

d-bugmail at puremagic.com d-bugmail at puremagic.com
Wed Feb 15 19:23:54 PST 2012


http://d.puremagic.com/issues/show_bug.cgi?id=7515

           Summary: The new std.string.translate is slow for ASCII text
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody at puremagic.com
        ReportedBy: bearophile_hugs at eml.cc


--- Comment #0 from bearophile_hugs at eml.cc 2012-02-15 19:23:51 PST ---
std.string.maketrans() is going to be deprecated. This is a demo program, it's
meant to work on a good amount of pure 7-bit ASCII text.

The conversion table is the same as the one in this page:
http://digitalmars.com/d/1.0/lisp-java-d.html


import std.stdio, std.string;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    auto tab =
maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",
                        
"0111222333445566677788899901112223334455666777888999");
    auto text2 = text.translate(tab, "\"");
    writeln(text2);
}


Using the new std.string.maketrans the code becomes something like:


import std.stdio, std.string, std.range;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    auto text3 = text.translate(aa, "\"");
    writeln(text3);
}


The code with the associative array is _uglier_, _longer_ to write and read,
and every char in the original string requires one or two associative array
lookups. This is quite inefficient for sizeable ASCII strings.



Jonathan M Davis has asked me a benchmark:

> Do you have data to backup that there is a significant speed difference?

So I have written the following small benchmark, it contains both the URL to
the test data (bible) and the timings I am seeing on a slow PC. If your PC is
faster feel free to use it on more data (creating a larger input file on disk,
concatenating many copies of the same test data). If you see bugs in the
benchmark please tell me.

Notes:
- version3 uses the old translate;
- version4 uses the new translate;
- version1InPlace is a reference point, it works in-place;
- version2InPlace is similar, uses a linear scan for the chars to remove, and
it seems a bit slower than version1InPlace (probably because the small deltab
array fits well in the L1 cache).

If both the benchmark and the timings are correct, the old translate seems
almost 5 times faster than the new one (and this is not strange, the new
translate performs two associative array lookups for each char of the input
string).


import std.stdio, std.string, std.file, std.exception;

char[] version1InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    bool[256] deltab = false;
    foreach (char c; delchars)
        deltab[c] = true;

    int count;
    foreach (char c; text)
        if (!deltab[c]) {
            text[count] = transtab[c];
            count++;
        }

    return text[0 .. count];
}


char[] version2InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    int count;
    outer: foreach (char c; text) {
        foreach (char d; delchars)
            if (c == d)
                continue outer;
        text[count] = transtab[c];
        count++;
    }

    return text[0 .. count];
}


string version3(char[] text) {
    auto tab =
maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",
                        
"0111222333445566677788899901112223334455666777888999");
    return text.translate(tab, "\"");
}


string version4(string text) {
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    return text.translate(aa, "\"");
}


void main() {
    // http://patriot.net/~bmcgin/kjv12.txt
    auto txt = cast(char[])read("kjv12.txt");

    assert(version1InPlace(txt.dup) == version2InPlace(txt.dup));
    assert(version1InPlace(txt.dup) == version3(txt.dup));
    assert(version1InPlace(txt.dup) == version4(txt.idup));

    static if (1) version1InPlace(txt); // 0.16 seconds
    static if (0) version2InPlace(txt); // 0.18 seconds
    static if (0) version3(txt); // 0.20 seconds
    static if (0) version4(assumeUnique(txt)); // 0.97 seconds
}



Jonathan M Davis:
> We're trying to not have any ASCII-only string stuff.

Around there is a large amount of data that is essentially text, but it's
ASCII, it's not Unicode. Like biological data, text data coming out of
instruments, etc. Handling it as binary data is not good.


A possibility is to keep both the old maketrans and translate functions, and
un-deprecate them (maybe moving them in std.ascii?). Other ideas are welcome.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------


More information about the Digitalmars-d-bugs mailing list