"Regular Expression Matching Can Be Simple And Fast (but...) "

Thu Feb 24 14:43:57 PST 2011

"spir" <denis.spir at gmail.com> wrote in message 
news:mailman.1923.1298557650.4748.digitalmars-d at puremagic.com...
> Hello,
>
> "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, 
> Perl, PHP, Python, Ruby, ...)"
> A *very* interesting and well written article about slow & fast regex 
> engines, why and how:
> http://swtch.com/~rsc/regexp/regexp1.html
>

Yea, I've come across that before. There's definitely good stuff in there, 
and it's much better-written than a lot of things I've seen. But one thing I 
really don't like about it is that there's a fundamental relationship going 
on that he merely implies throughout the article and never clearly states 
right up-front:

---------------------------
A regex engine first needs to convert the regex to an NFA, but then there's 
three ways it can proceed:

1. Execute the NFA directly, which involves making guesses at certain points 
and then backing up if the guesses turn out wrong. This is what Perl does. 
Unfortunately, all this backing up and retrying makes it scale very poorly 
on certain complex regexes.

2. Convert the NFA to a DFA and execute the DFA. Optionally, the resulting 
DFA can then be optimized further. There's the extra upfront overhead of the 
conversion, but executing the DFA (even if it isn't optimized) never 
involves any guessing or backing up to retry, so it scales well.

3. Thompson NFA: Execute the NFA as if it were the equivalent DFA. You're 
essentially executing the DFA *as* you generate it from the NFA. (Although 
the resultant DFA doesn't actually get stored.)

Note, of course, that all this applies to any use of an NFA, not just regex.
---------------------------

Those few little paragraphs are the whole damn *point* of the article, but 
the only way to get that take-away is to stitch it together by reading the 
entire multi-page text. Without that "big picture" I had a much harder time 
than necessary understanding it the first time I read it, even though I was 
still aware that NFAs can be converted to DFAs.

Actually, that's one of the things I really hate about academic writings in 
general (not that this article is particularly academic, as far as 
academic-themed stuff goes): There's rarely a decent overview. You're 
expected to implicitly piece together the big picture *as* you learn the 
details. Which, of course, is stupid because details mean practically 
nothing without proper context. Context *helps* you understand the details, 
plus it's less to be trying to learn all at once.