Improving std.regex(p)

Fri Jun 18 08:43:23 PDT 2010

Ellery Newcomer wrote:
> On 06/18/2010 06:36 AM, Ben Hanson wrote:
>> Reading your original post again and thinking about this a bit more...
>>
>> If someone can help me get up to speed with TMP in D, I could probably 
>> put
>> together a proof of concept pretty quickly. Aside from the D syntax, 
>> it is all
>> in the design (which is why I would like to discuss it in more 
>> detail). For
>> anyone who is interested, here is how I go about DFA production 
>> currently:
>>
>> - Bog standard tokenisation of regex tokens
>> - Normalise each regex charset and enumerate in a map
>> - build a regex syntax tree with leaf nodes storing the enumerated 
>> charset id
>> - Intersect all the charsets and build a vector of equivalent charsets
>> - Perform the followset to DFA transform as described in the Dragon book
>>
>> For my lexer generator, I build a lookup table for each character to an
>> equivalence class. For my wild card matcher, I leave the 
>> representation as
>> characters and do a linear lookup of chars to 'next' state. The lexer
>> equivalence class method (a la flex) is more efficient as you just 
>> have two
>> lookups per character with no linear searches.
>>
>> I'm thinking I could look at getting as far as the regex syntax tree 
>> and then
>> we could discuss DFA representation (it sounds like you want to 
>> generate code
>> directly - how do you do that?! This is why I mentioned the .NET 
>> bytecode stuff
>> - I imagine there is some overlap there, but I don't know enough to be 
>> sure.
>> Can you output source code at compilation time and have D compile it, 
>> or do you
>> use a load of recursion like C++ TMP?
>>
>> And I forgot to mention - boost.spirit uses my lexertl library.
>>
>> Regards,
>>
>> Ben
> 
> I've always understood that dfas have an exponential lower bound 
> relative to .. something .. but I don't believe I have ever actually 
> used a dfa generator (or written one, sadly enough). What has your 
> experience been on dfa compile time?

I think only backtracking engines (not DFAs) have the exponential run 
time problem. Regular expression grammar is itself context free (so it's 
parsable efficiently), the NFA and NFA->DFA algorithms are polynomial, 
and the generated DFA makes one transition per input character.

Andrei