Template regexes, version 2

Tue Feb 21 03:55:47 PST 2006

I've had another go at doing compile-time regexps. With all the bugfixes 
in recent releases, it's now possible to do it with mixins, generating 
local functions instead of global ones. This provides _considerably_ 
more flexibility.

The ultimately generated functions can look like:

char [] firstmatch(char [] s)
{
    char [][] results;
    int someOtherResult;

    void func(char [] s) {
       // this parses the string, using the original regexp string.
       // intermediate results (eg expressions in parentheses) are
       // in the local variables: results[][], someOtherResult, etc.
    }
    func(s);
    return results[0];
}

Usage is like:
char [] x = firstmatch!("ab+")(someStr);

It seems to me that there are 3 types of regexes:
* pure static -- where the regex string is a string literal, known at 
compile time
* pure dynamic -- eg, as used in a grep utility.
* pseudo-static. By this I mean regexps where the structure is constant, 
but some strings are replaced with variables.

Perl-ish languages deal with the last case by allowing embedded 
variables in strings. The template regexes Ive been developing can cope 
with them by using additional parameters and extended syntax, eg @1 for 
the string value of the first parameter.
eg

char [] var1 = "here";
char [] input = " a56aherea";
char [] x = firstmatch!("a at 1a")(input, var1);
var1 = "56";
char [] y = firstmatch!("a at 1a")(input, var1);

would give x == "aherea", y=="aherea".

As far as runtime efficiency goes, it's almost ideal, except that 
consecutive terms result in heavily nested trivial functions:

eg for the trivial "x\dy":
bool func(char [] s)
{
     bool a(char [] s) {
     	bool a(char [] s) {
             return s[0]=='y';
     	}
         return isdigit(s[0]) && a(s[1..$]);
     }
     return s[0]=='x' && a(s[1..$]);
}

and we need to rely on the optimiser to turn this into:
s[0]=='x' && isdigit(s[1]) && s[2]=='y';
How good is DMD at performing this optimisation? A nested function that 
is only called once _should_ always be inlined, no matter how deeply 
nested it is, but reality could be quite different...

Many optimisations are possible with this scheme (for example, the inner 
functions could just use int parameters, so that they don't need to pass 
s[] repeatedly; almost all of the optimisations from the runtime 
std.regexp could be applied). It should also be possible to implement 
Perl 6-style regexps.

...And then I return to the D newsgroups after a week and find the 
goalposts have moved: regexps are now built into the language.

The mixin regexps are only at an early stage of development, but given 
the current discussions I think it's important to know what can be done
with templates (probably more than most people expect). In the case of 
what I've called "pseudo-static" regexps, they are arguably more 
powerful than the built-in regexps of DMD 0.147.

I don't know where to go from here. There are a few possibilities:
1. use template regexps as a demonstration of the power of D templates.
--> Implement reduced syntax, keep the code simple and not well 
optimised; more of a proof-of-concept; go for "maximum coolness".
2. Like 1, but with optimisations; try to be the fastest regexp of any 
language. (cleaner-faster-better ... but we use the built-in regexps 
anyway <g>).
3. Create a standard library. This is what I was aiming for, but I don't
think it makes sense any more.
4. potential integration into the language. Is this even possible?

Probably the most sensible is to go with 1, to wake up the C++/boost 
crowd. Hopefully this should also provide some direction for the 
built-in syntax.
Thoughts?