Let's stop parser Hell

Thu Aug 2 00:37:36 PDT 2012

On Thursday, August 02, 2012 07:06:25 Christophe Travert wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:173942), a écrit :
> > It may very well be a good idea to templatize Token on range type. It
> > would be nice not to have to templatize it, but that may be the best
> > route to go. The main question is whether str is _always_ a slice (or the
> > result of takeExactly) of the orignal range. I _think_ that it is, but
> > I'd have to make sure of that. If it's not and can't be for whatever
> > reason, then that poses a problem.
> 
> It can't if it is a simple input range! Like a file read with most
> 'lazy' methods. Then you need either to transform the input range in a
> forward range using a range adapter that performs buffering, or perform
> your own buffering internally. You also have to decide how long the
> token will be valid (I believe if you want lexing to be blazing fast,
> you don't want to allocate for each token).

My lexer specifically requires a forward range. The more that I deal with input 
ranges, the more that I'm convinced that they're nearly useless. If you need 
even _one_ character of lookahead, then an input range doesn't fly at all, and 
considered the performance gains in using slicing (or takeExactly), I just 
don't think that it makes sense to operate on an input range. Besides, if 
anyone wants full performance, they'll probably need to use one of the built-
in string types. Any range of dchar which needs to decode on the call to front 
or popFront will take a serious performance hit. It'll work, but it's not 
really advisable if you need performance.

> Also, you said in this thread that you only need to consider ansy
> characters in the lexer because non-ansy characters are only used in
> non-keyword identifier. That is not entirely true: EndOfLine defines 2
> non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR.
>   http://dlang.org/lex.html#EndOfLine
>   Maybe they should be dropped, since other non-ansy whitespace are not
> supported. You may want the line count to be consistent with other
> programs. I don't know what text-processing programs usualy considers an
> end of line.

I'm well aware of all fo that. Almost all of the lexer can operate entirely on 
ASCII, with a few exceptions, and even in some of those cases, decoding isn't 
required (e.g. lineSep and paraSep can be dealt with as code units rather than 
having to decode to compare agains them). The lexer that I'm writing will 
follow the spec. And any lexer that wants to get into Phobos will need to do 
the same. So, stuff like lineSep and the end of file characters that the spec 
has will be supported.

- Jonathan M Davis