Let's stop parser Hell

Philippe Sigaud philippe.sigaud at gmail.com
Tue Jul 31 23:20:04 PDT 2012


On Wed, Aug 1, 2012 at 8:11 AM, Jonathan M Davis <jmdavisProg at gmx.com> wrote:
> On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
>> Is it based on a prioritized list of regexes?
>
> I'm not using regexes at all. It's using string mixins to reduce code
> duplication, but it's effectively hand-written. If I do it right, it should be
> _very_ difficult to make it any faster than it's going to be. It even
> specifically avoids decoding unicode characters and operates on ASCII
> characters as much as possible.

OK. It'll more difficult to genericize, then, but that's not your
problem (could be mine, though).


> Well, whatever is using the lexer is going to have to make sure that what it
> passes to the lexer continues to exist while it uses the lexer.

Yeah, I realized that just after posting. And anyway, the token are
made to be consumed at once, normally.

> I'll have to look at that to see whether using Algebraic is better. I'm not
> super-familiar with std.variant, so I may have picked the wrong type. However,
> VariantN already holds a specific set of types though (unlike Variant), so that
> part isn't a problem.

OK, I forgot about VariantN (I'm not so used to std.variant either)


> I'm not supporting any symbols which are known to be scheduled for deprecation
> (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is
> to-be-deprecated keywords (e.g. volatile and delete), since they still won't
> be legal to use as identifiers. And there's a function to query a token as to
> whether it's using a deprecated or unused keyword so that the program using
> the lexer can flag it if it wants to.

Good idea. A nice bunch of query functions will be a nice complement

- isDoc
- isComment
- isString
- isDeprecated
...


>> That's seem reasonable enough, but can you really store  a dstring
>> literal in a string field?
>
> Yeah. Why not? The string is the same in the source code regardless of the
> type of the literal. The only difference is the letter tacked onto the end.
> That will be turned into the appropriate string type be the semantic analyzer,
> but the lexer doesn't care.

I don't get it. Say I have an literal with non UTF-8 chars, how will
it be stored inside the .str field as a string?


> Basically, the lexer that I'm writing needs to be 100% compliant with the D
> spec and dmd (which means updating the spec or fixing dmd in some cases), and
> it needs to be possible to build on top of it anything and everything that dmd
> does that would use a lexer (even if it's not the way that dmd currently does
> it) so that it's possible to build a fully compliant D parser and compiler on
> top of it as well as a ddoc generator and anything else that you'd need to do
> with a lexer for D. So, if you have any questions about what my lexer does (or
> is supposed to do) with regards to the spec, that should answer it. If my
> lexer doesn't match the spec or dmd when I'm done (aside from specific
> exceptions relating to stuff like deprecated symbols), then I screwed up.

That's a lofty goal, but that would indeed be quite good to have an
officially integrated lexer in Phobos that would (as Andrei said) "be
it". The token spec would be the lexer.


More information about the Digitalmars-d mailing list