Let's stop parser Hell

Thu Aug 2 00:06:25 PDT 2012

"Jonathan M Davis" , dans le message (digitalmars.D:173942), a écrit :
> It may very well be a good idea to templatize Token on range type. It would be 
> nice not to have to templatize it, but that may be the best route to go. The 
> main question is whether str is _always_ a slice (or the result of 
> takeExactly) of the orignal range. I _think_ that it is, but I'd have to make 
> sure of that. If it's not and can't be for whatever reason, then that poses a 
> problem.

It can't if it is a simple input range! Like a file read with most 
'lazy' methods. Then you need either to transform the input range in a 
forward range using a range adapter that performs buffering, or perform 
your own buffering internally. You also have to decide how long the 
token will be valid (I believe if you want lexing to be blazing fast, 
you don't want to allocate for each token).

Maybe you want you lexer to work with range of strings too, like 
File.byLine or File.byChunk (the latter require buffering if you split 
in the middle of a token...). But that may wait until a nice API for 
files, streams, etc. is found.

> If Token _does_ get templatized, then I believe that R will end up 
> being the original type in the case of the various string types or a range 
> which has slicing, but it'll be the result of takeExactly(range, len) for 
> everything else.

A range which has slicing doesn't necessarily return it's own type when 
opSlice is used, according to hasSlicing. I'm pretty sure parts of 
Phobos doesn't take that into account. However, the result of 
takeExactly will always be the good type, since it uses opSlice when it 
can, so you can just use that.

Making a generic lexer that works with any forward range of dchar and 
returns a range of tokens without performing decoding of litteral seems 
to be a good first step.

> I just made str a string to begin with, since it was simple, and I was still 
> working on a lot of the initial design and how I was going to go about things. 
> If it makes more sense for it to be templated, then it'll be changed so that
> it's templated.

string may not be where you want to start, because it is a 
specialization for which you need to optimize utf-8 decoding.

Also, you said in this thread that you only need to consider ansy 
characters in the lexer because non-ansy characters are only used in 
non-keyword identifier. That is not entirely true: EndOfLine defines 2 
non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR. 
  http://dlang.org/lex.html#EndOfLine
  Maybe they should be dropped, since other non-ansy whitespace are not 
supported. You may want the line count to be consistent with other 
programs. I don't know what text-processing programs usualy considers an 
end of line.

-- 
Christophe