std.regex performance

Wed Feb 8 18:43:15 PST 2012

On Wed, 08 Feb 2012 22:44:25 +0100, Jesse Phillips  
<jessekphillips+D at gmail.com> wrote:

> I've finely moved to the new regex for some real code. I'm seeing a  
> major change in performance when checking if a large number of words  
> contain a digit.
>
> The english.dic file contains 134,950 entries
>
> With
> 2.056: 0.22sec
> 2.058: 7.65sec
>
> I don't expect a correction for this would make it in 2.058 as it is  
> likely an issue in 2.057.
>
> --------
> import std.file;
> import std.string;
> import std.datetime;
> import std.regex;
>
> private int[string] model;
>
> void main() {
>    auto name = "english.dic";
>    foreach(w; std.file.readText(name).toLower.splitLines)
>       model[w] += 1;
>
>    foreach(w; std.string.split(readText(name)))
>       if(!match(w, regex(r"\d")).empty)
>       {}
> }
>

There are some more performance issues.
D has a nice built-in profiler to find such issues.

----------
import std.algorithm, std.stdio, std.string, std.path, std.regex;

private int[string] model;

int main(string[] args)
{
     if (args.length != 2)
     {
         std.stdio.stderr.writefln("usage: %s <file>",  
std.path.baseName(args[0]));
         return 1;
     }

     auto re = std.regex.regex(r"\d");
     foreach(line; std.stdio.File(args[1], "r").byLine())
     {
         // Bug 6791: splitter is UTF-8 unsafe
         foreach(w; std.algorithm.splitter(line))
         {
             if(!std.regex.match(w, re).empty)
             {
             }
         }

         std.string.toLowerInPlace(line);
         model[line.idup] += 1;
     }

     return 0;
}