Let's stop parser Hell
Christophe Travert
travert at phare.normalesup.org
Thu Aug 2 00:06:25 PDT 2012
"Jonathan M Davis" , dans le message (digitalmars.D:173942), a écrit :
> It may very well be a good idea to templatize Token on range type. It would be
> nice not to have to templatize it, but that may be the best route to go. The
> main question is whether str is _always_ a slice (or the result of
> takeExactly) of the orignal range. I _think_ that it is, but I'd have to make
> sure of that. If it's not and can't be for whatever reason, then that poses a
> problem.
It can't if it is a simple input range! Like a file read with most
'lazy' methods. Then you need either to transform the input range in a
forward range using a range adapter that performs buffering, or perform
your own buffering internally. You also have to decide how long the
token will be valid (I believe if you want lexing to be blazing fast,
you don't want to allocate for each token).
Maybe you want you lexer to work with range of strings too, like
File.byLine or File.byChunk (the latter require buffering if you split
in the middle of a token...). But that may wait until a nice API for
files, streams, etc. is found.
> If Token _does_ get templatized, then I believe that R will end up
> being the original type in the case of the various string types or a range
> which has slicing, but it'll be the result of takeExactly(range, len) for
> everything else.
A range which has slicing doesn't necessarily return it's own type when
opSlice is used, according to hasSlicing. I'm pretty sure parts of
Phobos doesn't take that into account. However, the result of
takeExactly will always be the good type, since it uses opSlice when it
can, so you can just use that.
Making a generic lexer that works with any forward range of dchar and
returns a range of tokens without performing decoding of litteral seems
to be a good first step.
> I just made str a string to begin with, since it was simple, and I was still
> working on a lot of the initial design and how I was going to go about things.
> If it makes more sense for it to be templated, then it'll be changed so that
> it's templated.
string may not be where you want to start, because it is a
specialization for which you need to optimize utf-8 decoding.
Also, you said in this thread that you only need to consider ansy
characters in the lexer because non-ansy characters are only used in
non-keyword identifier. That is not entirely true: EndOfLine defines 2
non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR.
http://dlang.org/lex.html#EndOfLine
Maybe they should be dropped, since other non-ansy whitespace are not
supported. You may want the line count to be consistent with other
programs. I don't know what text-processing programs usualy considers an
end of line.
--
Christophe
More information about the Digitalmars-d
mailing list