compile-time regex redux

Wed Feb 7 17:11:39 PST 2007

Andrei Alexandrescu (See Website For Email) wrote:
> Bill Baxter wrote:
>> Walter Bright wrote:
>>> String mixins, in order to be useful, need an ability to manipulate 
>>> strings at compile time. Currently, the core operations on strings 
>>> that can be done are:
>>>
>>> 1) indexed access
>>> 2) slicing
>>> 3) comparison
>>> 4) getting the length
>>> 5) concatenation
>>>
>>> Any other functionality can be built up from these using template 
>>> metaprogramming.
>>>
>>> The problem is that parsing strings using templates generates a large 
>>> number of template instantiations, is (relatively) very slow, and 
>>> consumes a lot of memory (at compile time, not runtime). For example, 
>>> ParseInteger would need 4 template instantiations to parse 5678, and 
>>> each template instantiation would also include the rest of the input 
>>> as part of the template instantiation's mangled name.
>>>
>>> At some point, this will prove a barrier to large scale use of this 
>>> feature.
>>>
>>> Andrei suggested using compile time regular expressions to shoulder 
>>> much of the burden, reducing parsing of any particular token to one 
>>> instantiation.
>>
>> That would help I suppose, but at the same time regexps themselves 
>> have a tendancy to end up being 'write-only' code.  The heavy use of 
>> them in perl is I think a large part of what gives it a rep as a 
>> write-only language.   Heh heh.  I just found this regexp for matching 
>> RFC 822 email addresses:
>>     http://www.regular-expressions.info/email.html
>> (the one at the bottom of the page)
> 
> I think this must be qualified and understood in context. First, much of 
> Perl's reputation of write-only code has much to do with the implicit 
> variables and the generous syntax. The Perl regexps are a standard that 
> all other regexp packages emulate and compare against.

Agreed.  Implicit variables also make things tough to follow.  Regexps 
also contribute to Perl's reputation for looking like line-noise.  But I 
like perl actually.  And regular expressions are ok too, but I feel like 
they're not optimal for writing maintainable code.
They tend to look like line noise.  They're difficult to comment 
effectively.  And they're certainly not suited for certain tasks, and if 
you try to use them for something they're not particularly good at, they 
get very messy.

Unfortunately, lot of what they're not good at is exactly the kind of 
thing you *need* them to be good at for parsing/generating code.  Like 
parenthesis balancing, or nested comment parsing, or quoted string munching.

They can be a good tool, but if they're the only tool, or even the main 
tool, I think we're in trouble.

> Showcasing the raw RFC 822 email parsing regexp is not very telling. 
> Notice there's a lot of repetition. With symbols, the grammar is very 
> easy to implement with readable regular expressions - and this is how 
> anyone in their right mind would do it.

True it's not a realistic example.  The page says as much, and includes 
several versions that are more realistic
Here's the recommended one:
   \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
Ok, that's not so bad, but throw in a few sets of capturing parenthesis 
here and there, and it starts to look pretty messy.

I my opinion about regexps is that they're too dense and full of 
abbreviations.  And the typical methods for creating them don't 
encourage encapsulation and abstraction, which are the foundations of 
software.  For instance, every time you look at the above you have to 
re-interpret what [A-Z0-9._%-] really means.  When I'm writing regular 
expressions I always have to have that chart next to me to remember all 
those \s \b \w \S \W \ codes, and then again when trying to figure out 
what the code does later.  There has to be a better way.  Apparently the 
Perl guys thing so too, because they're redoing regular expressions 
completely for Perl 6.

--bb