std.d.lexer : voting thread

Dmitry Olshansky dmitry.olsh at gmail.com
Fri Oct 11 02:44:56 PDT 2013


08-Oct-2013 04:16, Andrei Alexandrescu пишет:
> On 10/4/13 5:24 PM, Andrei Alexandrescu wrote:
> To put my money where my mouth is, I have a proof-of-concept tokenizer
> for C++ in working state.
>
> http://dpaste.dzfl.pl/d07dd46d
>
> It contains some rather unsavory bits (I'm sure a ctRegex would be nicer
> for parsing numbers etc), but it works on a lot of code just swell.

No - ctRegex as it stands right now is too generic and conservative with 
the code it generates so "\d+" would do:
a) use full Unicode for "Number"
b) keep tabs on where to return as if there could be ambiguity of how 
many '\d' it may eat (no maximal munch etc.). The reason is that the 
fact that said '\d+' may be in the middle of some pattern (e.g. 9\d+0), 
and the fact that it's unambiguous on its own is not exploited.

Both are quite suboptimal and there is a long road I going to take to 
have a _general_ solution for both points. One day we would reach that 
goal though. ATM just hack your way through if pattern is sooo simple.

> Most importantly, there's a clear distinction between the generic core
> and the C++-specific part. It should be obvious how to use the generic
> matcher for defining a D tokenizer.



> Token representation is minimalistic and expressive. Just write tk!"<<"
> for left shift, tk!"int" for int etc. Typos will be detected during
> compilation. One does NOT need to define and use TK_LEFTSHIFT or TK_INT;
> all needed by the generic tokenizer is the list of tokens. In return, it
> offers an efficient trie-based matcher for all tokens.
>
> (Keyword matching is unusual in that keywords are first found by the
> trie matcher, and then a simple check figures whether more characters
> follow, e.g. "if" vs. "iffy".

+1

Given that many tokenizers use a hashtable
> anyway to look up all symbols, there's no net loss of speed with this
> approach.

Yup. The only benefit is slimmer giant switch.
Another "hybrid" option is insated of hash-table use a generated keyword 
trie searcher separately as a function. Then just test each identifier 
with it. This is what std.d.lexer does and is quite fast. (awaiting 
latest benchmarks)

> The lexer generator compiles fast and should run fast. If not, it should
> be easy to improve at the matcher level.
>
> Now, what I'm asking for is that std.d.lexer builds on this design
> instead of the traditional one. At a slight delay, we get the proverbial
> fishing rod IN ADDITION TO of the equally proverbial fish, FOR FREE. It
> is quite evident there's a bunch of code sharing going on already
> between std.d.lexer and the proposed design, so it shouldn't be hard to
> effect the adaptation.

Agreed. Let us take a moment to incorporate a better design.

> So with this I'm leaving it all within the hands of the submitter and
> the review manager. I didn't count the votes, but we may have a "yes"
> majority built up. Since additional evidence has been introduce, I
> suggest at least a revote. Ideally, there would be enough motivation for
> Brian to suspend the review and integrate the proposed design within
> std.d.lexer.
>
>
> Andrei
>


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list