Formal review of std.lexer

Joseph Cassman jc7919 at outlook.com
Fri Feb 21 08:50:13 PST 2014


On Friday, 21 February 2014 at 12:12:17 UTC, Dicebot wrote:
> http://wiki.dlang.org/Review/std.lexer
>
> This is follow-up by Brian to his earlier proposal 
> (http://wiki.dlang.org/Review/std.d.lexer). This time proposed 
> module focuses instead on generic lexer generation as discussed 
> in matching voting thread.
>
> Docs: 
> http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
> Code: 
> https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d

Thanks for all the work Brian. Read through the previous threads 
about the development of this code (links at the bottom) and I 
can see a lot of effort has gone into it. So the following 
comments may come across as uninformed, but hopefully they will 
be helpful.

1. StringCache is a custom hash table. It looks like it's primary 
role is to reduce some sort of duplication. Hash tables, though, 
are difficult to get right. So perhaps could a benchmark 
comparison be made against the built-in HT to show what savings 
it brings? Since it is in the public interface should its payload 
also be public? Although it is built using GC.malloc how about 
the in-the-works std.allocator module? Perhaps a version 1 could 
use GC.malloc but if a later PR could make it possible to use a 
custom allocator that would be nice.

2. I like the fact that a range interface is provided. I realize 
that the previous discussions stipulated the use of ubyte to 
avoid encoding work during scanning. The reasoning about 
performance makes sense to me. That being the case, could a code 
example be provided showing how to use this module to scan a 
UTF-8 encoded string? Even if this is going to focus only on 
scanning code files, the D language spec allows for arbitrary 
Unicode in a code file. How is this possible? (I have a general 
idea, just looking for some explicit code sample help).

3. I tried to understand the reason for and usage of the 
"extraFields" parameter in "TokenStructure" but couldn't figure 
it out. Could some more explanation of its intent and usage be 
provided?

4. Do you want the commented-out pragma statement left over on 
line 601?

5. Should the template "TokenId" perhaps be something like 
"generateTokenId" instead? I am not sure what an "Id" for a token 
means. Is it an integral hash value? Had difficulty seeing how it 
ties in with the concept of "value" in the header documentation. 
If this is a numerical hash of a string token, why is the string 
still stored and used in "tokenStringRepresentation"? I probably 
am missing something big but couldn't the number be used to 
represent the string everywhere, saving on time and space?

6. I tried but had difficulty understanding the difference 
between the four token types -- "staticTokens", "dynamicTokens", 
"possibleDefaultTokens", "tokenHandlers" -- provided as arguments 
to "Lexer". What is a token that has a value that changes versus 
a token that does not change? I am not sure where to put my token 
definitions.

7. Just thinking about using the module and I would like to use 
it to make a scanner for xml, json, csv, c/c++, etc. I wasn't 
able to figure out how to do so, however. The initial code 
example is nice. But could some additional guidance be provided? 
Also, I wasn't sure how to make use of a lexer once created. The 
documentation focuses well on how to initialize a "Lexer" but 
could some guidance also be provided on how to use one past 
initialization?

8. Andrei's trie search 
(http://forum.dlang.org/thread/eeenynxifropasqcufdg@forum.dlang.org?page=4#post-l2nm7m:2416e1:241:40digitalmars.com) 
seemed like a really interesting idea. And I saw in that thread 
you continued with his ideas. Does this module incorporate that 
work? Or was it less performant in the end?

9. I ran "dmd -cov" against the module and got zero percent unit 
test coverage. Perhaps adding some test code will help clarify 
usage patterns?

You have put a lot of work into this code so I apologize if the 
above comes across as picking it apart. Just some questions I had 
in trying to make use of the code. Hopefully some of it is 
helpful.

Joseph

Other related posts
http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe@forum.dlang.org
http://forum.dlang.org/thread/dpdgcycrgfspcxenzrjf@forum.dlang.org
http://forum.dlang.org/thread/eeenynxifropasqcufdg@forum.dlang.org



More information about the Digitalmars-d mailing list