Is str ~ regex the root of all evil, or the leaf of all good?
Benji Smith
dlanguage at benjismith.net
Thu Feb 19 20:09:40 PST 2009
Some of the things I'd like to see in the regex implementation:
All functions accepting a compiled regex object/struct should also
accept a string version of the pattern (and vice versa). Some
implementations (Java) only accept the compiled version in some places
and the string pattern in other places. That's annoying.
Just like with ordinary string-searching functions, you should be able
to specify a start position (and maybe an end position) for the search.
Even if the match exists somewhere in the string, it fails if not found
within the target slice. Something like this:
auto text = "ABCDEFG";
auto pattern = regex("[ABCEFG]");
// returns false, because the char at position 3 does not match
auto result = match(text, 3);
// this should be exactly equivalent (but the previous version
// uses less memory, and ought to work with infinite ranges, whereas
// the slice version wouldn't make any sense)
auto equivalent = match(text[3..$]);
I've needed to use this technique in a few cases to implement a simple
lexical scanner, and it's a godsend, if the regex engine supports it
(though most don't).
Finally, it'd be extremely cool if the regex compiler automatically
eliminated redundant nodes from its NFA, converting as much of it as
possible to a DFA. I did some work on this a few years ago, and it's
actually remarkably simple to implement using prefix trees.
// These two expressions produce an identical set of matches,
// but the first one is functionally an NFA, while the second
// one is a DFA.
auto a = regex("(cat|car|cry|dog|door|dry)");
auto b = regex("(c(?:a[tr]|ry)|d(?:o(?:g|or)|ry)");
In cases where the expression can only be partially simplified, you can
leave some NFA nodes deep within the tree, while still DFA-ifying the
rest of it:
auto a = regex("(attitude|attribute|att.+ion");
auto b = regex("(att(?:itude|ribute|.+ion)");
It's a very simple transformation, increases speed (dramatically) for
complex regular expressions (especially those produced dynamically at
runtime by combining large sets of unrelated target expressions), and it
reliably produces equivalent results with the inefficient version.
The only really tricky part is if the subexpressions have their own
capturing groups, in which case the DFA transformation screws up the
ordinal-numbering of the resultant captures.
Anyhoo...
I don't have any strong feelings about the function names (though I'd
rather have functions that operators, like "~", for searching and matching).
And I don't have any strong feelings about whether the compiled regex is
an object or a struct (though I prefer reference semantics over value
semantics for regexen, and right now, I think that makes objects the
(slightly) better choice).
Thanks for your hard work! I've implemented a small regex engine before,
so I know it's no small chunk of effort. Regular expressions are my
personal favorite "tiny language", and I'm glad to see them get some
special attention in phobos2.
--benji
More information about the Digitalmars-d
mailing list