Fixing the API of std.regex

Dmitry Olshansky dmitry.olsh at gmail.com
Tue Mar 12 02:41:06 PDT 2013


*Spoiler*: let's slowly deprecate "g" option in std.regex in a few years 
or with any luck a bit faster. The better replacement is proposed.

For better or worse the current API has retained a (high) level of 
compatibility with the old API. That means I've missed the chance to fix 
it when I could, and here is the prime problem (the hardest) I have with it:

foreach(m; match("bleh-blah", "bl[ea]h"))
{
	writeln(m.hit);
}

The "quiz" is - how many lines will this print?

The current answer is 1. And that the right solution for all matches is:

foreach(m; match("bleh-blah", regex("bl[ea]h","g"))
{
	writeln(m.hit);
}

Which is not only looks unsightly but also confuses operation option 
(find _all_ vs find _first_) with property of a pattern (like 
case-insensitivity is). And if regex pattern is defined elsewhere it 
could easily introduce a bug (albeit one that's easy to track, "usually").

To underline the point: std.regex.splitter doesn't take "g" flag into 
account at all (it makes no sense there).

I've pondered a couple of solutions in a bug report by bearophile:
http://d.puremagic.com/issues/show_bug.cgi?id=7260

After all of these ideas born and discarded, here is what I believe is 
the way forward out of this mess:

Make "g" indicates only the intended _default_ search mode of this 
pattern (global - first match).

User is free to override this default explicitly and in fact encouraged 
to do so. The idea of default search mode attached to the regex pattern 
is marked as discouraged.

The overrides have to be convenient and backwards compatible.
Thus I propose the follwing:

match and replace become structs (types, oh my!) with the following 
"interface":

struct match //ditto  for replace
{
	//current behavior
	static auto opCall(.....);
	//get the first match / replace only first occurance
	static auto first();
	// force to find all matches (still lazy range) and
	static auto all();
}

OT: C++ folks call this namespace, but they don't have static opCall - 
suckers ;)  And I actually proposed (twice) to kill static opCall, sweet 
irony.

Then the motivating example would be :

foreach(m; match.all("bleh-blah", "bl[ea]h"))
{
	writeln(m.hit);
}

and :

//prints all submatches of the first match:
foreach(m; match.first("bleh-blah", "bl[ea]h"))
{
	// don't compile, m - is the first match itself no .hit there
	// that should make it harder to confuse
	// "first match" with "all matches"
	//writeln(m.hit);
	writeln(m);
}

We can go further and introduce the enhancement I long dreamed of:

//'any' or 'test' are also the names to choose from
if(match.anywhere(string, "[0-9]+"))
{
	//there is at least 1 match (no need for other info)
	...
}

The reason I want this "shorthand" is that regex engine can cut a bunch 
of corners and serve up this "is there a match somewhere?" request much, 
MUCH faster then "where is the first match and all of its submatches?". 
And many use cases only need this yes/no thing anyway.

... that got a bit lengthy - any thoughts, criticism, opinions ?

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list