Improving std.regex(p)

Fri Jun 18 07:32:01 PDT 2010

On 06/18/2010 06:36 AM, Ben Hanson wrote:
> Reading your original post again and thinking about this a bit more...
>
> If someone can help me get up to speed with TMP in D, I could probably put
> together a proof of concept pretty quickly. Aside from the D syntax, it is all
> in the design (which is why I would like to discuss it in more detail). For
> anyone who is interested, here is how I go about DFA production currently:
>
> - Bog standard tokenisation of regex tokens
> - Normalise each regex charset and enumerate in a map
> - build a regex syntax tree with leaf nodes storing the enumerated charset id
> - Intersect all the charsets and build a vector of equivalent charsets
> - Perform the followset to DFA transform as described in the Dragon book
>
> For my lexer generator, I build a lookup table for each character to an
> equivalence class. For my wild card matcher, I leave the representation as
> characters and do a linear lookup of chars to 'next' state. The lexer
> equivalence class method (a la flex) is more efficient as you just have two
> lookups per character with no linear searches.
>
> I'm thinking I could look at getting as far as the regex syntax tree and then
> we could discuss DFA representation (it sounds like you want to generate code
> directly - how do you do that?! This is why I mentioned the .NET bytecode stuff
> - I imagine there is some overlap there, but I don't know enough to be sure.
> Can you output source code at compilation time and have D compile it, or do you
> use a load of recursion like C++ TMP?
>
> And I forgot to mention - boost.spirit uses my lexertl library.
>
> Regards,
>
> Ben

I've always understood that dfas have an exponential lower bound 
relative to .. something .. but I don't believe I have ever actually 
used a dfa generator (or written one, sadly enough). What has your 
experience been on dfa compile time?

I suppose RE2's syntax is a good starting point

http://code.google.com/p/re2/wiki/Syntax

I wonder about those unicode character classes, though. Looks like an 
imminent head-on collision with dmd's issues regarding memory management 
at compile time.