Template regexes, version 2

Thu Feb 23 08:53:23 PST 2006

Don, I love this rendition.

In article <dtjv60$1tc9$1 at digitaldaemon.com>, Don Clugston says...
>
>I should mention that the code I posted doesn't do real regexps, because 
>it has no backtracking. It doesn't do *, +, ? properly, they are always 
>greedy. It can never match ".*a", for example.
>I was too focused on getting access to variables.

I'm scratching my head as to how to fix the greedy-matching problem.  (my
apologies if you've already mulled this over)

One way to go is provide a switch for greedy and non-greedy behaviors at the
interface level, such that a programmer can select the behavior.  Another is to
adopt a different token sequence in the expression itself (I think some engines
use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).  

A third option, is to redesign the expression engine itself to work with
multiple matches, such that each predicate is understood to match the left-hand
side against n right-hand sides (as returned from other predicates).  This way,
the expression ".*a" will match "a" in each sucessive substring following each
successive attempt at matching "."; the result would be a "char[][]" containing
zero or more matches of the "*" predicate expression.

>However, the mixin technique means that the regexp is actually turned 
>into a binary tree of nested functions.
>
>eg (ab.*|c)a
>becomes something like:
>
>bool test(char [] searchstr)
>{
>   int p=0;
>
>   bool seq() {
>     bool union() {
>        bool seq() {
>            bool char_a {}
>            bool seq() {
>              bool char_b {}
>              bool star() {
>                  bool dot()
>              }
>            }
>         }
>         bool char_c() {}
>     }
>     bool char_a() {}
>   }
>   return seq();
>}
>
>so that in every case (including unions), the next function that needs 
>to be called is always in scope. Intriguingly, it also has access to the 
>local variables all the way down the tree. (Since of these levels can 
>contain its own stack as an array, I'm pretty sure that a Turing machine 
>can be created, it's likely to be more powerful than the state machines 
>normally used for regexps).

Also, I'm curious to see what DMD would do to optimize that code.  Would some of
those internals dissapear, or would they have to be anonymous functions in order
for that to happen?  It might be worth the trouble to go the latter, given there
are no function arguments used anywhere, as the resulting code might be pretty
darn close to using raw goto statements (is DMD smart enough to *not* create a
stack frame without use of the 'naked' keyword?).

>
>Obviously this can be used to do regexps properly. I was hoping to use a 
>clever naming scheme to avoid the alias issues from pragma's version, 
>but it seems to be inevitable.

Heh... finding that problem was the worst part. 

Don't forget how my solution has a tendency to hammer and pollute the symbol
namespace for large expressions.  It looks like you managed to work around some
of that as well.  Again, I'm floored by how much more can be gained via nested
functions rather than a more "traditional" approach.

>
>It seems hard to do this properly without creating a highly optimised 
>regexp parser!

Hehe.. so the optimal solution is also the most complete? ;)

- Eric Anderton at yahoo