Is str ~ regex the root of all evil, or the leaf of all good?

Leandro Lucarella llucax at gmail.com
Thu Feb 19 18:10:21 PST 2009


Andrei Alexandrescu, el 18 de febrero a las 21:35 me escribiste:
> I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired.
> 
> First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some 
> particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled 
> form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by 
> the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop.
> 
> In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it 
> for all matches (if the "g" option was passed to the engine).
> 
> The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges 
> as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - 
> handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point 
> numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular 
> expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that 
> generalization without adding complexity and there is nothing in principle opposed to it.
> 
> The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g.
> 
> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
>     writeln(e.pre, e.hit, e.post);
> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

BTW, why are the flags passed as string and not as an integer mask? For
example:
auto s = regex.sub("abracazoo", regex.regex("a([b-e])", regex.G), "A$1");

This way you can catch a few errors at compile-time.


-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
When I was a child I had a fever
My hands felt just like two balloons.
Now I've got that feeling once again
I can't explain you would not understand
This is not how I am.
I have become comfortably numb.



More information about the Digitalmars-d mailing list