Is str ~ regex the root of all evil, or the leaf of all good?

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Feb 19 07:01:56 PST 2009


bearophile wrote:
> Andrei Alexandrescu:
> 
>> but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
> 
> I like the following syntaxes (the one with .match() too):
> 
> import std.re: regex;
> 
> foreach (e; regex("a[b-e]", "g") in "abracazoo")
>      writeln(e);
> 
> foreach (e; regex("a[b-e]", "g").match("abracazoo"))
>      writeln(e);
> 
> auto re1 = regex("a[b-e]", "g");
> foreach (e; re1.match("abracazoo"))
>      writeln(e);
> 
> auto re1 = regex("a[b-e]", "g");
> foreach (e; re1 in "abracazoo")
>      writeln(e);

These all put the regex before the string, something many people would 
find unsavory.

> ----------------
> 
> I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again.
> 
> This is an example of usual RE in Python:
> 
> finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")
> 
> 
> This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax):
> 
> finder = re.compile(r"""
>     ^ \s*             # start at beginning+ opt spaces
>     ( [\[\]] )        # Group 1: opening bracket
>         \s*           # optional spaces
>         ( [-+]? \d+ ) # Group 2: first number
>         \s* , \s*     # opt spaces+ comma+ opt spaces
>         ( [-+]? \d+ ) # Group 3: second number
>         \s*           # opt spaces
>     ( [\[\]] )        # Group 4: closing bracket
>     \s* $             # opt spaces+ end at the end
>     """, flags=re.VERBOSE)
> 
> As you can see it's often very positive to indent logically those lines just like code.

Yah, I saw that ECMA introduced comments in regexes too. At some point 
we'll implement that.

> ----------------
> 
> As the other people here, I don't like the following much, it's a misleading overload of the ~ operator:
> 
> "abracazoo" ~ regex("a[b-e]", "g")
> 
> ----------------
> 
> I don't like that "g" argument much, my suggestions:
> 
> RE attributes:
> "repeat", "r": Repeat over the whole input string
> "ignorecase", "i": case insensitive
> "multiline", "m": treat as multiple lines separated by newlines
> "verbose", "v": ignores space outside [] and allows comments

And how do you combine them? "repeat, ignorecase"? Writing and parsing 
such options becomes a little adventure in itself. I think the "g", "i", 
and "m" flags are popular enough if you've done any amount of regex 
programming. If not, you'll look up the manual regardless.

> If not already so, I'd like sub() to take as replacement a string or a callable.

It does, I haven't mentioned it yet. Pass-by-alias of course :o).


Andrei



More information about the Digitalmars-d mailing list