Template regexes, version 2

Don Clugston dac at nospam.com.au
Thu Feb 23 01:27:25 PST 2006


I should mention that the code I posted doesn't do real regexps, because 
it has no backtracking. It doesn't do *, +, ? properly, they are always 
greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.
However, the mixin technique means that the regexp is actually turned 
into a binary tree of nested functions.

eg (ab.*|c)a
becomes something like:

bool test(char [] searchstr)
{
   int p=0;

   bool seq() {
     bool union() {
        bool seq() {
            bool char_a {}
            bool seq() {
              bool char_b {}
              bool star() {
                  bool dot()
              }
            }
         }
         bool char_c() {}
     }
     bool char_a() {}
   }
   return seq();
}

so that in every case (including unions), the next function that needs 
to be called is always in scope. Intriguingly, it also has access to the 
local variables all the way down the tree. (Since of these levels can 
contain its own stack as an array, I'm pretty sure that a Turing machine 
can be created, it's likely to be more powerful than the state machines 
normally used for regexps).

Obviously this can be used to do regexps properly. I was hoping to use a 
clever naming scheme to avoid the alias issues from pragma's version, 
but it seems to be inevitable.

It seems hard to do this properly without creating a highly optimised 
regexp parser!

Don Clugston wrote:
> Walter Bright wrote:
>> "Don Clugston" <dac at nospam.com.au> wrote in message 
>> news:dtev42$1g98$1 at digitaldaemon.com...
>>> Perl-ish languages deal with the last case by allowing embedded 
>>> variables in strings. The template regexes Ive been developing can 
>>> cope with them by using additional parameters and extended syntax, eg 
>>> @1 for the string value of the first parameter.
>>> eg
>>>
>>> char [] var1 = "here";
>>> char [] input = " a56aherea";
>>> char [] x = firstmatch!("a at 1a")(input, var1);
>>> var1 = "56";
>>> char [] y = firstmatch!("a at 1a")(input, var1);
>>>
>>> would give x == "aherea", y=="aherea".
>>
>> Great! But can I suggest using the format used by 
>> std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.
> 
> OK. Except ... I think replace() doesn't have to worry about $ colliding 
> with the end of line marker? Anyway, I've used $n for now.
> (Doesn't SQL use @ for parameters in stored procedures? Can't quite 
> remember.)
> 
>>> How good is DMD at performing this optimisation?
>>
>> Not good enough. But I wouldn't worry about that for the moment.
> 
> I'm not really surprised -- I imagine the back-end doesn't have much 
> that's specific to D.
> 
> I've attached what I've done. It's not complete -- doesn't do ranges, or 
> any of the escape characters, or beginning and end of line.
> But it does do the full set of repetition operators, parentheses,
> and unions (arbitrarily nestable). It also does the $n parameters.
> Most of the remaining stuff is straightforward, and not very exciting.
> 
> I feel as though I've finally worked out what mixins are for. They are a 
> perfect match for this application.
> 
> In any case, this should give you a good idea for what you'll able to 
> use in your presentation.
> 



More information about the Digitalmars-d mailing list