Improving std.regex(p)

Ben Hanson Ben.Hanson at tfbplc.co.uk
Fri Jun 18 04:36:48 PDT 2010


Reading your original post again and thinking about this a bit more...

If someone can help me get up to speed with TMP in D, I could probably put
together a proof of concept pretty quickly. Aside from the D syntax, it is all
in the design (which is why I would like to discuss it in more detail). For
anyone who is interested, here is how I go about DFA production currently:

- Bog standard tokenisation of regex tokens
- Normalise each regex charset and enumerate in a map
- build a regex syntax tree with leaf nodes storing the enumerated charset id
- Intersect all the charsets and build a vector of equivalent charsets
- Perform the followset to DFA transform as described in the Dragon book

For my lexer generator, I build a lookup table for each character to an
equivalence class. For my wild card matcher, I leave the representation as
characters and do a linear lookup of chars to 'next' state. The lexer
equivalence class method (a la flex) is more efficient as you just have two
lookups per character with no linear searches.

I'm thinking I could look at getting as far as the regex syntax tree and then
we could discuss DFA representation (it sounds like you want to generate code
directly - how do you do that?! This is why I mentioned the .NET bytecode stuff
- I imagine there is some overlap there, but I don't know enough to be sure.
Can you output source code at compilation time and have D compile it, or do you
use a load of recursion like C++ TMP?

And I forgot to mention - boost.spirit uses my lexertl library.

Regards,

Ben


More information about the Digitalmars-d mailing list