Improving std.regex(p)
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Fri Jun 18 08:43:23 PDT 2010
Ellery Newcomer wrote:
> On 06/18/2010 06:36 AM, Ben Hanson wrote:
>> Reading your original post again and thinking about this a bit more...
>>
>> If someone can help me get up to speed with TMP in D, I could probably
>> put
>> together a proof of concept pretty quickly. Aside from the D syntax,
>> it is all
>> in the design (which is why I would like to discuss it in more
>> detail). For
>> anyone who is interested, here is how I go about DFA production
>> currently:
>>
>> - Bog standard tokenisation of regex tokens
>> - Normalise each regex charset and enumerate in a map
>> - build a regex syntax tree with leaf nodes storing the enumerated
>> charset id
>> - Intersect all the charsets and build a vector of equivalent charsets
>> - Perform the followset to DFA transform as described in the Dragon book
>>
>> For my lexer generator, I build a lookup table for each character to an
>> equivalence class. For my wild card matcher, I leave the
>> representation as
>> characters and do a linear lookup of chars to 'next' state. The lexer
>> equivalence class method (a la flex) is more efficient as you just
>> have two
>> lookups per character with no linear searches.
>>
>> I'm thinking I could look at getting as far as the regex syntax tree
>> and then
>> we could discuss DFA representation (it sounds like you want to
>> generate code
>> directly - how do you do that?! This is why I mentioned the .NET
>> bytecode stuff
>> - I imagine there is some overlap there, but I don't know enough to be
>> sure.
>> Can you output source code at compilation time and have D compile it,
>> or do you
>> use a load of recursion like C++ TMP?
>>
>> And I forgot to mention - boost.spirit uses my lexertl library.
>>
>> Regards,
>>
>> Ben
>
> I've always understood that dfas have an exponential lower bound
> relative to .. something .. but I don't believe I have ever actually
> used a dfa generator (or written one, sadly enough). What has your
> experience been on dfa compile time?
I think only backtracking engines (not DFAs) have the exponential run
time problem. Regular expression grammar is itself context free (so it's
parsable efficiently), the NFA and NFA->DFA algorithms are polynomial,
and the generated DFA makes one transition per input character.
Andrei
More information about the Digitalmars-d
mailing list