Is str ~ regex the root of all evil, or the leaf of all good?

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Feb 19 07:50:21 PST 2009


Denis Koroskin wrote:
> On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu 
> <SeeWebsiteForEmail at erdani.org> wrote:
> 
>> bearophile wrote:
>>> Andrei Alexandrescu:
>>>
>>>> but most regex code I've seen mentions the string first and the 
>>>> regex second. So I dropped that idea.<
>>>  I like the following syntaxes (the one with .match() too):
>>>  import std.re: regex;
>>>  foreach (e; regex("a[b-e]", "g") in "abracazoo")
>>>      writeln(e);
>>>  foreach (e; regex("a[b-e]", "g").match("abracazoo"))
>>>      writeln(e);
>>>  auto re1 = regex("a[b-e]", "g");
>>> foreach (e; re1.match("abracazoo"))
>>>      writeln(e);
>>>  auto re1 = regex("a[b-e]", "g");
>>> foreach (e; re1 in "abracazoo")
>>>      writeln(e);
>>
>> These all put the regex before the string, something many people would 
>> find unsavory.
>>
>>> ----------------
>>>  I like the support of verbose regular expressions too, that ignore 
>>> whitespace and comments (for example with //...) inserted into the 
>>> regex itself. This simple thing is able to turn the messy world of 
>>> regexes into programming again.
>>>  This is an example of usual RE in Python:
>>>  finder = 
>>> re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")
>>>   This is the same RE in verbose mode, in Python still (# is the 
>>> Python single-line comment syntax):
>>>  finder = re.compile(r"""
>>>     ^ \s*             # start at beginning+ opt spaces
>>>     ( [\[\]] )        # Group 1: opening bracket
>>>         \s*           # optional spaces
>>>         ( [-+]? \d+ ) # Group 2: first number
>>>         \s* , \s*     # opt spaces+ comma+ opt spaces
>>>         ( [-+]? \d+ ) # Group 3: second number
>>>         \s*           # opt spaces
>>>     ( [\[\]] )        # Group 4: closing bracket
>>>     \s* $             # opt spaces+ end at the end
>>>     """, flags=re.VERBOSE)
>>>  As you can see it's often very positive to indent logically those 
>>> lines just like code.
>>
>> Yah, I saw that ECMA introduced comments in regexes too. At some point 
>> we'll implement that.
>>
>>> ----------------
>>>  As the other people here, I don't like the following much, it's a 
>>> misleading overload of the ~ operator:
>>>  "abracazoo" ~ regex("a[b-e]", "g")
>>>  ----------------
>>>  I don't like that "g" argument much, my suggestions:
>>>  RE attributes:
>>> "repeat", "r": Repeat over the whole input string
>>> "ignorecase", "i": case insensitive
>>> "multiline", "m": treat as multiple lines separated by newlines
>>> "verbose", "v": ignores space outside [] and allows comments
>>
>> And how do you combine them? "repeat, ignorecase"? Writing and parsing 
>> such options becomes a little adventure in itself. I think the "g", 
>> "i", and "m" flags are popular enough if you've done any amount of 
>> regex programming. If not, you'll look up the manual regardless.
>>
> 
> Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might 
> be better? I don't find "gmi" immediately clear nor self-documenting.

I got disabused a very long time ago of the notion that everything about 
regexes is clear or self-documenting. Really. You just get to a level of 
understanding that's appropriate for your needs. On that scale, getting 
used to "gmi" is so low, it's not even worth discussing.


Andrei



More information about the Digitalmars-d mailing list