Let's stop parser Hell

Wed Aug 1 13:10:15 PDT 2012

On Wednesday, August 01, 2012 20:29:45 Jacob Carlborg wrote:
> But if you read a source file which is encoded using UTF-16 you would
> need to re-encode that to store it in the "str" filed in your Token struct?

Currently, yes.

> If that's the case, wouldn't it be better to make Token a template to be
> able to store all Unicode encodings without re-encoding? Although I
> don't know how if that will complicate the rest of the lexer.

It may very well be a good idea to templatize Token on range type. It would be 
nice not to have to templatize it, but that may be the best route to go. The 
main question is whether str is _always_ a slice (or the result of 
takeExactly) of the orignal range. I _think_ that it is, but I'd have to make 
sure of that. If it's not and can't be for whatever reason, then that poses a 
problem. If Token _does_ get templatized, then I believe that R will end up 
being the original type in the case of the various string types or a range 
which has slicing, but it'll be the result of takeExactly(range, len) for 
everything else.

I just made str a string to begin with, since it was simple, and I was still 
working on a lot of the initial design and how I was going to go about things. 
If it makes more sense for it to be templated, then it'll be changed so that
it's templated.

- Jonathan M Davis