Fixing the API of std.regex
Dmitry Olshansky
dmitry.olsh at gmail.com
Tue Mar 12 02:41:06 PDT 2013
*Spoiler*: let's slowly deprecate "g" option in std.regex in a few years
or with any luck a bit faster. The better replacement is proposed.
For better or worse the current API has retained a (high) level of
compatibility with the old API. That means I've missed the chance to fix
it when I could, and here is the prime problem (the hardest) I have with it:
foreach(m; match("bleh-blah", "bl[ea]h"))
{
writeln(m.hit);
}
The "quiz" is - how many lines will this print?
The current answer is 1. And that the right solution for all matches is:
foreach(m; match("bleh-blah", regex("bl[ea]h","g"))
{
writeln(m.hit);
}
Which is not only looks unsightly but also confuses operation option
(find _all_ vs find _first_) with property of a pattern (like
case-insensitivity is). And if regex pattern is defined elsewhere it
could easily introduce a bug (albeit one that's easy to track, "usually").
To underline the point: std.regex.splitter doesn't take "g" flag into
account at all (it makes no sense there).
I've pondered a couple of solutions in a bug report by bearophile:
http://d.puremagic.com/issues/show_bug.cgi?id=7260
After all of these ideas born and discarded, here is what I believe is
the way forward out of this mess:
Make "g" indicates only the intended _default_ search mode of this
pattern (global - first match).
User is free to override this default explicitly and in fact encouraged
to do so. The idea of default search mode attached to the regex pattern
is marked as discouraged.
The overrides have to be convenient and backwards compatible.
Thus I propose the follwing:
match and replace become structs (types, oh my!) with the following
"interface":
struct match //ditto for replace
{
//current behavior
static auto opCall(.....);
//get the first match / replace only first occurance
static auto first();
// force to find all matches (still lazy range) and
static auto all();
}
OT: C++ folks call this namespace, but they don't have static opCall -
suckers ;) And I actually proposed (twice) to kill static opCall, sweet
irony.
Then the motivating example would be :
foreach(m; match.all("bleh-blah", "bl[ea]h"))
{
writeln(m.hit);
}
and :
//prints all submatches of the first match:
foreach(m; match.first("bleh-blah", "bl[ea]h"))
{
// don't compile, m - is the first match itself no .hit there
// that should make it harder to confuse
// "first match" with "all matches"
//writeln(m.hit);
writeln(m);
}
We can go further and introduce the enhancement I long dreamed of:
//'any' or 'test' are also the names to choose from
if(match.anywhere(string, "[0-9]+"))
{
//there is at least 1 match (no need for other info)
...
}
The reason I want this "shorthand" is that regex engine can cut a bunch
of corners and serve up this "is there a match somewhere?" request much,
MUCH faster then "where is the first match and all of its submatches?".
And many use cases only need this yes/no thing anyway.
... that got a bit lengthy - any thoughts, criticism, opinions ?
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list