Is str ~ regex the root of all evil, or the leaf of all good?

Benji Smith dlanguage at benjismith.net
Thu Feb 19 20:09:40 PST 2009


Some of the things I'd like to see in the regex implementation:

All functions accepting a compiled regex object/struct should also 
accept a string version of the pattern (and vice versa). Some 
implementations (Java) only accept the compiled version in some places 
and the string pattern in other places. That's annoying.

Just like with ordinary string-searching functions, you should be able 
to specify a start position (and maybe an end position) for the search. 
Even if the match exists somewhere in the string, it fails if not found 
within the target slice. Something like this:

    auto text = "ABCDEFG";
    auto pattern = regex("[ABCEFG]");

    // returns false, because the char at position 3 does not match
    auto result = match(text, 3);

    // this should be exactly equivalent (but the previous version
    // uses less memory, and ought to work with infinite ranges, whereas
    // the slice version wouldn't make any sense)
    auto equivalent = match(text[3..$]);

I've needed to use this technique in a few cases to implement a simple 
lexical scanner, and it's a godsend, if the regex engine supports it 
(though most don't).

Finally, it'd be extremely cool if the regex compiler automatically 
eliminated redundant nodes from its NFA, converting as much of it as 
possible to a DFA. I did some work on this a few years ago, and it's 
actually remarkably simple to implement using prefix trees.

    // These two expressions produce an identical set of matches,
    // but the first one is functionally an NFA, while the second
    // one is a DFA.
    auto a = regex("(cat|car|cry|dog|door|dry)");
    auto b = regex("(c(?:a[tr]|ry)|d(?:o(?:g|or)|ry)");

In cases where the expression can only be partially simplified, you can 
leave some NFA nodes deep within the tree, while still DFA-ifying the 
rest of it:

    auto a = regex("(attitude|attribute|att.+ion");
    auto b = regex("(att(?:itude|ribute|.+ion)");

It's a very simple transformation, increases speed (dramatically) for 
complex regular expressions (especially those produced dynamically at 
runtime by combining large sets of unrelated target expressions), and it 
reliably produces equivalent results with the inefficient version.

The only really tricky part is if the subexpressions have their own 
capturing groups, in which case the DFA transformation screws up the 
ordinal-numbering of the resultant captures.

Anyhoo...

I don't have any strong feelings about the function names (though I'd 
rather have functions that operators, like "~", for searching and matching).

And I don't have any strong feelings about whether the compiled regex is 
an object or a struct (though I prefer reference semantics over value 
semantics for regexen, and right now, I think that makes objects the 
(slightly) better choice).

Thanks for your hard work! I've implemented a small regex engine before, 
so I know it's no small chunk of effort. Regular expressions are my 
personal favorite "tiny language", and I'm glad to see them get some 
special attention in phobos2.

--benji



More information about the Digitalmars-d mailing list