Andrei's Google Talk

bearophile bearophileHUGS at lycos.com
Tue Aug 3 16:24:51 PDT 2010


Andrei Alexandrescu:

> > 57.36: The Wikipedia Levenshtein entry can be improved a bit, to
> > remove the random access requirement, if not already done :-) Often
> > in practice you have to pay a certain performance price for
> > genericity. So an interesting question is, with the current DMD
> > compiler how much slower than an array-based Levenshtein function is
> > the generic Phobos one when the input is for example a couple of
> > ubyte arrays? If it's only about two or three times slower then it's
> > probably good enough.
> 
> The implementation using only forward access is not slower in theory 
> because it never needs to search to an index. It is a bit slower on 
> ASCII strings because it needs one extra test per character to assess 
> its width. But on UTF strings it saves time by avoiding copying and by 
> working correctly :o).

I have done three benchmarks, because I am curious. This is compiled with dmd 2.047 with -O -release -inline, the strings used are longhish, about 1200 chars long. To test what you have said I have cast the strings to immutable(ubyte)[] before calling levenshteinDistance, this can be done when you are sure the strings are 7 bit ASCII, and indeed the running times are different, about 1.66 seconds against 2.57 seconds on a slow 32 bit Windows PC:


import std.algorithm: levenshteinDistance;
import std.c.stdio: printf;

void main() {
    string s1 = "Returns the Levenshtein distance between s and t." ~
                "The Levenshtein distance computes the minimal amount of edit operations" ~
                "necessary to transform s into t.";
    string s2 = "Returns the Lovenshtein distonce between p and q." ~
                "The Lovenshtein distonce computes the minimal amount of odit operations" ~
                "necessary to transform p into q.";
    for (int i; i < 3; i++) {
        s1 ~= s1;
        s2 ~= s2;
    }

    int d;
    for (int i; i < 40; i++) {
        static if (1) {
            // 2.57 seconds
            d += levenshteinDistance(s1, s2);
        } else {
            // 1.66 seconds
            alias immutable(ubyte)[] ImmutableubytesArray;
            d += levenshteinDistance(cast(ImmutableubytesArray)s1, cast(ImmutableubytesArray)s2);
        }
    }

    printf("%d\n", d);
}


Then I have tried the same strings with the Levenshtein distance functions of the dlibs1, with dmd v1.042 (I have not yet ported the dlibs1 to D2, and probably there will be no need to port lot of the stuff, because Phobos2 contains more stuff) with -O -release -inline, using the same strings, the running time is about 0.25 seconds:


import std.c.stdio: printf;
import d.func: editDistanceFast;

void main() {
    string s1 = "Returns the Levenshtein distance between s and t." ~
                "The Levenshtein distance computes the minimal amount of edit operations" ~
                "necessary to transform s into t.";
    string s2 = "Returns the Lovenshtein distonce between p and q." ~
                "The Lovenshtein distonce computes the minimal amount of odit operations" ~
                "necessary to transform p into q.";
    for (int i; i < 3; i++) {
        s1 ~= s1;
        s2 ~= s2;
    }

    int d;
    for (int i; i < 40; i++) {
        // 0.25 seconds
        d += editDistanceFast(s1, s2);
    }

    printf("%d\n", d);
}


Unfortunately my editDistanceFast can't be used in Phobos2 for licence issues, because this pure-D1 (no asm used) routine is an improvement of C code written by another person that is not compatible with the Boost licence :-(

Bye,
bearophile


More information about the Digitalmars-d mailing list