Template regexes, version 2
Don Clugston
dac at nospam.com.au
Fri Feb 24 00:27:23 PST 2006
pragma wrote:
> Don, I love this rendition.
Thanks! I'm just amazed at how this stuff has progressed, it's so easy now.
I'm a little uncomfortable that people seem to be credited me for the
regexp parser which you wrote. I want to make sure that you receive
proper acknowledgment.
> In article <dtjv60$1tc9$1 at digitaldaemon.com>, Don Clugston says...
>> I should mention that the code I posted doesn't do real regexps, because
>> it has no backtracking. It doesn't do *, +, ? properly, they are always
>> greedy. It can never match ".*a", for example.
>> I was too focused on getting access to variables.
>
> I'm scratching my head as to how to fix the greedy-matching problem. (my
> apologies if you've already mulled this over)
>
> One way to go is provide a switch for greedy and non-greedy behaviors at the
> interface level, such that a programmer can select the behavior. Another is to
> adopt a different token sequence in the expression itself (I think some engines
> use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).
>
> A third option, is to redesign the expression engine itself to work with
> multiple matches, such that each predicate is understood to match the left-hand
> side against n right-hand sides (as returned from other predicates). This way,
> the expression ".*a" will match "a" in each sucessive substring following each
> successive attempt at matching "."; the result would be a "char[][]" containing
> zero or more matches of the "*" predicate expression.
>
>
>> However, the mixin technique means that the regexp is actually turned
>> into a binary tree of nested functions.
>>
>> eg (ab.*|c)a
>> becomes something like:
>>
>> bool test(char [] searchstr)
>> {
>> int p=0;
>>
>> bool seq() {
>> bool union() {
>> bool seq() {
>> bool char_a {}
>> bool seq() {
>> bool char_b {}
>> bool star() {
>> bool dot()
>> }
>> }
>> }
>> bool char_c() {}
>> }
>> bool char_a() {}
>> }
>> return seq();
>> }
>>
>> so that in every case (including unions), the next function that needs
>> to be called is always in scope. Intriguingly, it also has access to the
>> local variables all the way down the tree. (Since of these levels can
>> contain its own stack as an array, I'm pretty sure that a Turing machine
>> can be created, it's likely to be more powerful than the state machines
>> normally used for regexps).
>
> Also, I'm curious to see what DMD would do to optimize that code. Would some of
> those internals dissapear, or would they have to be anonymous functions in order
> for that to happen? It might be worth the trouble to go the latter, given there
> are no function arguments used anywhere, as the resulting code might be pretty
> darn close to using raw goto statements (is DMD smart enough to *not* create a
> stack frame without use of the 'naked' keyword?).
I suspect it's not very optimised, since nested functions are not
relevant for C++. But there's a lot of potential. If it can convert
CALL; RET; sequences into JMP, the whole thing will collapse into an
optimal mass of goto spaghetti, something like
xxx
call part2
xxx
part2:
xxx
call part3
xxx
part3:
xxx
ret
>
>> Obviously this can be used to do regexps properly. I was hoping to use a
>> clever naming scheme to avoid the alias issues from pragma's version,
>> but it seems to be inevitable.
>
> Heh... finding that problem was the worst part.
>
> Don't forget how my solution has a tendency to hammer and pollute the symbol
> namespace for large expressions. It looks like you managed to work around some
> of that as well. Again, I'm floored by how much more can be gained via nested
> functions rather than a more "traditional" approach.
Me too. It's just awesome how many language features are coming together
in this: real compile time constants, template string parameters, nested
functions, having ~ concatenation built into the language, mixins, alias
template parameters, static if, constant folding, the !() notation for
templates, array slices...
And C++ can't do any of those things. It's incredible.
Update: I got the aliases to work in about 20 minutes, without
difficulties. Now it does proper non-greedy matches. Amazingly, the code
is barely any more complicated. It still doesn't need explicit backtracking!
Greedy matching can be implemented by repeatedly attempting non-greedy
matches, and finally redoing the last one which was successful. Because
each step in the process has its own local variables, the only
backtracking is restoring a copy of the current position.
I think that greedy matching is an inherently inefficient operation, it
would be much nicer to have non-greedy as the default.
Consider the case
(expr1)*(expr2)
I think that what most engines do is keep matching (expr1) until they
get a failure. Then try (expr2). If it fails, wind back (expr1) and try
(expr2) again. Keep going until you've wound all the way back to the
beginning. This gives only one successful match to expr2, but many
unnecessary matches to expr1.
So it's good for cases like "b*(really_complex_regexp)"
But it's bad for cases like "(really_complex_regexp)*b"
Also, backtracking expr1 is a real pain.
My method is the reverse, it has no unnecessary matches to (expr1), but
may have many for (expr2). I'm not sure that it's really inferior.
Ideally, at compile time you'd determine which of the expressions was
more complex (perhaps just by counting the number of *,+, ? in each
string) and choose between the two methods that way.
if something that matches (expr1) can ever be a partial match for
(expr2); and if it can't, then you know you can just keep matching
(expr1) until it fails, and only then evaluate (expr2). But I suspect
that comparing two regexps is one of those NP-Hard problems from CompSci
that's just not feasible to solve.
>> It seems hard to do this properly without creating a highly optimised
>> regexp parser!
>
> Hehe.. so the optimal solution is also the most complete? ;)
It certainly looks that way. And it seems to be the simplest, too. For
quantifiers "{m,n}, *,+, ?", I used the (I thought) quick-and-dirty
method of dealing with all of them at once. But a side-effect is that
it's more optimal.
Also, I used functions with no parameters because it was quick and
dirty. But it's optimal, too, and it's really easy to add more features.
It's a bizarre experience.
More information about the Digitalmars-d
mailing list