Let's stop parser Hell

Philippe Sigaud philippe.sigaud at gmail.com
Tue Jul 31 15:01:05 PDT 2012


On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
<dmitry.olsh at gmail.com> wrote:

>> I guess creating a tree of symbol tables according to scope visibility
>> is then more the job of the parser, but I'm not sure.
>>
> Parser can use constant IDs for nested tables, IDs point to string table.
> String table is populated by lexer.

The latter, I get. The former, not so much.


>>> I've only glimpsed at your code. For most languages lexing is far more
>>> expensive then parsing
>>
>>
>> Is that so?
>
>
> It usually is. Unless parser is overly generic as GLL/GLR (not to claim that
> it's very slow but GLL/GLR are slower for obvious reasons).

I'd have thought that, seeing as 'the end result' is more complicated
(richer, if I may say so) for parsing, then parsing is more expensive.
I'm reading on GLR, and was wondering what the practical limits are.
Do people use GLR to parse thousands of pages of text?


> Have to agree here if anything a better DFA generator is needed, current
> std.regex can't get as good in this field because of:
>
> a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
> ctRegex via codegen)
> b) full unicode features commitment (again expect some improvement here in
> near future)
> c) designed to take multiple captures from matched piece of text.
>
> I'm not sure when (or even if) std.regex will ever special case all of the
> above.

Well,

- for a lexer lookahead is sometimes useful (the Dragon book cite the
FORTRAN grammar, for which keywords are not reserved and so when you
encounter IF, you don't know if (!) it's a function call or a 'real'
if)
- Unicode is needed to lex D correctly, no?
- multiple captures doesn't seem necessary *for a lexer*


>> How does the lexer know when to stop producing a 'comment'
>> token?
>
>
> Call special function skipComment when lexer hits comment_start pseudotoken.

Ah, and this special function must somehow maintain a stack, to
'count' the comment_start and comment_end pseudotokens. So in a way,
it quits the regular world to become temporarily more powerful.

> Typically lexeres are regular as it allows them to be fast.

Hmm, but then it has to treat comments a one token. So no Ddoc love?


More information about the Digitalmars-d mailing list