Writing a really fast lexer

Fri Dec 11 20:19:49 UTC 2020

On Fri, Dec 11, 2020 at 07:49:12PM +0000, vnr via Digitalmars-d-learn wrote:
> For a project with good performance, I would need to be able to
> analyse text. To do so, I would write a parser by hand using the
> recursive descent algorithm, based on a stream of tokens. I started
> writing a lexer with the d-lex package
> (https://code.dlang.org/packages/d-lex), it works really well,
> unfortunately, it's quite slow for the number of lines I'm aiming to
> analyse (I did a test, for a million lines, it lasted about 3
> minutes). As the parser will only have to manipulate tokens, I think
> that the performance of the lexer will be more important to consider.
> Therefore, I wonder what resources there are, in D, for writing an
> efficient lexer. I could of course write it by hand, but I would be
> interested to know what already exists, so as not to reinvent the
> wheel. Of course, if the standard library (or its derivatives, such as
> mir) has features that could be of interest to me for this lexer, I am
> interested. Thanks to you :)

If you want a *really* fast lexer, I recommend using GNU Flex
(https://github.com/westes/flex/).  Unfortunately, AFAIK it does not
support D directly; it generates a lexer in C that you then compile.
Fortunately, you can interface with the generated C code quite easily
from D.

I took a quick glance at d-lex, and immediately noticed that every rule
match incurs a GC allocation, which is sure to cause a performance
slowdown.  It does have a nice API, but for a truly fast lexer at the
very least you need to compile the lexing rules down into a fast IR,
which d-lex does not do.  So it's unsurprising that performance isn't
the best.

Some things to watch out for when writing high-performance
string-processing in D:

1) Avoid GC allocations as much as possible.  If some object needs to be
created frequently (e.g., tokens), try your best to make it a struct
rather than a class, to avoid incurring allocation overhead and to
decrease GC pressure. (This is a common mistake by people who write
lexers/parsers in D.)

2) Use slices of the input (e.g. for returning tokens) as much as
possible, avoid .dup/.idup as much as you can.  This may not always be
possible if you're reading from a file; if that's the case, consider
using std.mmfile to memory-map the input into memory so that you can
freely slice over the entire input without needing to handle
caching/paging yourself.  Ideally, your lexer should simply return
slices over the input, wrapped in a struct with an enum tag to indicate
what type of token it is.  Don't bother with things like converting
integers/floats in the input until semantic analysis, or when your
parser needs to create AST nodes that store the actual values. This
eliminates the need for polymorphic token types, which tends to slow
lexers down.

3) Avoid autodecoding unless you need it. Use .byCodeUnit when
processing strings with Phobos algorithms, unless the correctness of
what you need to do happens to depend on decoding Unicode code points.

T

-- 
"A one-question geek test. If you get the joke, you're a geek: Seen on a California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs - Natural Intelligence, Inc.