Let's stop parser Hell

Wed Aug 1 01:40:41 PDT 2012

On Wednesday, August 01, 2012 10:25:18 Jacob Carlborg wrote:
> On 2012-08-01 00:38, Jonathan M Davis wrote:
> > I don't have the code with me at the moment, but I believe that the token
> > type looks something like
> > 
> > struct Token
> > {
> > 
> >   TokenType type;
> >   string str;
> >   LiteralValue value;
> >   SourcePos pos;
> > 
> > }
> > 
> > struct SourcePos
> > {
> > 
> >   size_t line;
> >   size_ col;
> >   size_t tabWidth = 8;
> > 
> > }
> 
> What about the end/length of a token? Token.str.length would give the
> number of bytes (code units?) instead of the number of characters (code
> points?). I'm not entirely sure what's needed when, for example, doing
> syntax highlighting. I assume you would know the length in characters of
> a given token internally inside the lexer?

I'm not sure. I don't think so. It doesn't really keep track of code points. 
It operates in code units as much as possible, and pos doesn't really help, 
because any newline that occurred would make it so that subtracting the start 
col from the end col would be completely bogus (that and tabs would mess that 
up pretty thoroughly, but as Christophe pointed out, the whole tabWidth thing 
may not actually have been a good idea anyway).

It could certainly be added, but unless the lexer always knows it (and I'm 
pretty sure that it doesn't), then keeping track of that entails extra 
overhead. But maybe it's worth that overhead. I'll have to look at what I have 
and see. Worst case, the caller can just use walkLength on str, but if it has 
to do that all the time, then that's not exactly conducive to good 
performance.

- Jonathan M Davis