Is str ~ regex the root of all evil, or the leaf of all good?

Thu Feb 19 07:34:09 PST 2009

On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> bearophile wrote:
>> Andrei Alexandrescu:
>>
>>> but most regex code I've seen mentions the string first and the regex  
>>> second. So I dropped that idea.<
>>  I like the following syntaxes (the one with .match() too):
>>  import std.re: regex;
>>  foreach (e; regex("a[b-e]", "g") in "abracazoo")
>>      writeln(e);
>>  foreach (e; regex("a[b-e]", "g").match("abracazoo"))
>>      writeln(e);
>>  auto re1 = regex("a[b-e]", "g");
>> foreach (e; re1.match("abracazoo"))
>>      writeln(e);
>>  auto re1 = regex("a[b-e]", "g");
>> foreach (e; re1 in "abracazoo")
>>      writeln(e);
>
> These all put the regex before the string, something many people would  
> find unsavory.
>
>> ----------------
>>  I like the support of verbose regular expressions too, that ignore  
>> whitespace and comments (for example with //...) inserted into the  
>> regex itself. This simple thing is able to turn the messy world of  
>> regexes into programming again.
>>  This is an example of usual RE in Python:
>>  finder =  
>> re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")
>>   This is the same RE in verbose mode, in Python still (# is the Python  
>> single-line comment syntax):
>>  finder = re.compile(r"""
>>     ^ \s*             # start at beginning+ opt spaces
>>     ( [\[\]] )        # Group 1: opening bracket
>>         \s*           # optional spaces
>>         ( [-+]? \d+ ) # Group 2: first number
>>         \s* , \s*     # opt spaces+ comma+ opt spaces
>>         ( [-+]? \d+ ) # Group 3: second number
>>         \s*           # opt spaces
>>     ( [\[\]] )        # Group 4: closing bracket
>>     \s* $             # opt spaces+ end at the end
>>     """, flags=re.VERBOSE)
>>  As you can see it's often very positive to indent logically those  
>> lines just like code.
>
> Yah, I saw that ECMA introduced comments in regexes too. At some point  
> we'll implement that.
>
>> ----------------
>>  As the other people here, I don't like the following much, it's a  
>> misleading overload of the ~ operator:
>>  "abracazoo" ~ regex("a[b-e]", "g")
>>  ----------------
>>  I don't like that "g" argument much, my suggestions:
>>  RE attributes:
>> "repeat", "r": Repeat over the whole input string
>> "ignorecase", "i": case insensitive
>> "multiline", "m": treat as multiple lines separated by newlines
>> "verbose", "v": ignores space outside [] and allows comments
>
> And how do you combine them? "repeat, ignorecase"? Writing and parsing  
> such options becomes a little adventure in itself. I think the "g", "i",  
> and "m" flags are popular enough if you've done any amount of regex  
> programming. If not, you'll look up the manual regardless.
>

Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be  
better? I don't find "gmi" immediately clear nor self-documenting.

>> If not already so, I'd like sub() to take as replacement a string or a  
>> callable.
>
> It does, I haven't mentioned it yet. Pass-by-alias of course :o).
>
>
> Andrei