Template regexes, version 2

Fri Feb 24 00:27:23 PST 2006

pragma wrote:
> Don, I love this rendition.

Thanks! I'm just amazed at how this stuff has progressed, it's so easy now.
I'm a little uncomfortable that people seem to be credited me for the 
regexp parser which you wrote. I want to make sure that you receive 
proper acknowledgment.

> In article <dtjv60$1tc9$1 at digitaldaemon.com>, Don Clugston says...
>> I should mention that the code I posted doesn't do real regexps, because 
>> it has no backtracking. It doesn't do *, +, ? properly, they are always 
>> greedy. It can never match ".*a", for example.
>> I was too focused on getting access to variables.
> 
> I'm scratching my head as to how to fix the greedy-matching problem.  (my
> apologies if you've already mulled this over)

> 
> One way to go is provide a switch for greedy and non-greedy behaviors at the
> interface level, such that a programmer can select the behavior.  Another is to
> adopt a different token sequence in the expression itself (I think some engines
> use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).  
> 
> A third option, is to redesign the expression engine itself to work with
> multiple matches, such that each predicate is understood to match the left-hand
> side against n right-hand sides (as returned from other predicates).  This way,
> the expression ".*a" will match "a" in each sucessive substring following each
> successive attempt at matching "."; the result would be a "char[][]" containing
> zero or more matches of the "*" predicate expression.
> 
> 
>> However, the mixin technique means that the regexp is actually turned 
>> into a binary tree of nested functions.
>>
>> eg (ab.*|c)a
>> becomes something like:
>>
>> bool test(char [] searchstr)
>> {
>>   int p=0;
>>
>>   bool seq() {
>>     bool union() {
>>        bool seq() {
>>            bool char_a {}
>>            bool seq() {
>>              bool char_b {}
>>              bool star() {
>>                  bool dot()
>>              }
>>            }
>>         }
>>         bool char_c() {}
>>     }
>>     bool char_a() {}
>>   }
>>   return seq();
>> }
>>
>> so that in every case (including unions), the next function that needs 
>> to be called is always in scope. Intriguingly, it also has access to the 
>> local variables all the way down the tree. (Since of these levels can 
>> contain its own stack as an array, I'm pretty sure that a Turing machine 
>> can be created, it's likely to be more powerful than the state machines 
>> normally used for regexps).
> 
> Also, I'm curious to see what DMD would do to optimize that code.  Would some of
> those internals dissapear, or would they have to be anonymous functions in order
> for that to happen?  It might be worth the trouble to go the latter, given there
> are no function arguments used anywhere, as the resulting code might be pretty
> darn close to using raw goto statements (is DMD smart enough to *not* create a
> stack frame without use of the 'naked' keyword?).

I suspect it's not very optimised, since nested functions are not 
relevant for C++. But there's a lot of potential. If it can convert 
CALL; RET; sequences into JMP, the whole thing will collapse into an 
optimal mass of goto spaghetti, something like

     xxx
     call part2
     xxx
part2:
     xxx
     call part3
     xxx
part3:
     xxx
     ret

> 
>> Obviously this can be used to do regexps properly. I was hoping to use a 
>> clever naming scheme to avoid the alias issues from pragma's version, 
>> but it seems to be inevitable.
> 
> Heh... finding that problem was the worst part. 
> 
> Don't forget how my solution has a tendency to hammer and pollute the symbol
> namespace for large expressions.  It looks like you managed to work around some
> of that as well.  Again, I'm floored by how much more can be gained via nested
> functions rather than a more "traditional" approach.

Me too. It's just awesome how many language features are coming together 
in this: real compile time constants, template string parameters, nested 
functions, having ~ concatenation built into the language, mixins, alias 
template parameters, static if, constant folding, the !() notation for 
templates, array slices...
And C++ can't do any of those things. It's incredible.

Update: I got the aliases to work in about 20 minutes, without 
difficulties. Now it does proper non-greedy matches. Amazingly, the code 
is barely any more complicated. It still doesn't need explicit backtracking!

Greedy matching can be implemented by repeatedly attempting non-greedy 
matches, and finally redoing the last one which was successful. Because 
each step in the process has its own local variables, the only 
backtracking is restoring a copy of the current position.
I think that greedy matching is an inherently inefficient operation, it 
would be much nicer to have non-greedy as the default.

Consider the case
(expr1)*(expr2)
I think that what most engines do is keep matching (expr1) until they 
get a failure. Then try (expr2). If it fails, wind back (expr1) and try 
(expr2) again. Keep going until you've wound all the way back to the 
beginning. This gives only one successful match to expr2, but many 
unnecessary matches to expr1.

So it's good for cases like "b*(really_complex_regexp)"
But it's bad for cases like "(really_complex_regexp)*b"
Also, backtracking expr1 is a real pain.

My method is the reverse, it has no unnecessary matches to (expr1), but 
may have many for (expr2). I'm not sure that it's really inferior.

Ideally, at compile time you'd determine which of the expressions was 
more complex (perhaps just by counting the number of *,+, ? in each 
string) and choose between the two methods that way.

if something that matches (expr1) can ever be a partial match for 
(expr2); and if it can't, then you know you can just keep matching 
(expr1) until it fails, and only then evaluate (expr2). But I suspect 
that comparing two regexps is one of those NP-Hard problems from CompSci 
that's just not feasible to solve.

>> It seems hard to do this properly without creating a highly optimised 
>> regexp parser!
> 
> Hehe.. so the optimal solution is also the most complete? ;)

It certainly looks that way. And it seems to be the simplest, too. For 
quantifiers "{m,n}, *,+, ?", I used the (I thought) quick-and-dirty 
method of dealing with all of them at once. But a side-effect is that 
it's more optimal.

Also, I used functions with no parameters because it was quick and 
dirty. But it's optimal, too, and it's really easy to add more features.
It's a bizarre experience.