Let's stop parser Hell

Dmitry Olshansky dmitry.olsh at gmail.com
Tue Jul 31 22:48:28 PDT 2012


On 01-Aug-12 02:01, Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
> <dmitry.olsh at gmail.com> wrote:
>
>>> I guess creating a tree of symbol tables according to scope visibility
>>> is then more the job of the parser, but I'm not sure.
>>>
>> Parser can use constant IDs for nested tables, IDs point to string table.
>> String table is populated by lexer.
>
> The latter, I get. The former, not so much.
>
>
>>>> I've only glimpsed at your code. For most languages lexing is far more
>>>> expensive then parsing
>>>
>>>
>>> Is that so?
>>
>>
>> It usually is. Unless parser is overly generic as GLL/GLR (not to claim that
>> it's very slow but GLL/GLR are slower for obvious reasons).
>
> I'd have thought that, seeing as 'the end result' is more complicated
> (richer, if I may say so) for parsing, then parsing is more expensive.
> I'm reading on GLR, and was wondering what the practical limits are.
> Do people use GLR to parse thousands of pages of text?
>
>
>> Have to agree here if anything a better DFA generator is needed, current
>> std.regex can't get as good in this field because of:
>>
>> a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
>> ctRegex via codegen)
>> b) full unicode features commitment (again expect some improvement here in
>> near future)
>> c) designed to take multiple captures from matched piece of text.
>>
>> I'm not sure when (or even if) std.regex will ever special case all of the
>> above.
>
> Well,
>
> - for a lexer lookahead is sometimes useful (the Dragon book cite the
> FORTRAN grammar, for which keywords are not reserved and so when you
> encounter IF, you don't know if (!) it's a function call or a 'real'
> if)

Well while lookahead will help, there are simpler ways. e.g.
regex ("IF\ZYOUR_LOOKAHEAD");

\Z means that you are capturing text only up to \Z. It's still regular. 
I think Boost regex does support this. We could have supported this 
cleanly but have chosen ECM-262 standard.


Critical difference is that lookahead can be used like this:

regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");

> - Unicode is needed to lex D correctly, no?

Not that much of unicode, e.g. it could skip decoding stuff most of the 
time.

> - multiple captures doesn't seem necessary *for a lexer*
>
>
>>> How does the lexer know when to stop producing a 'comment'
>>> token?
>>
>>
>> Call special function skipComment when lexer hits comment_start pseudotoken.
>
> Ah, and this special function must somehow maintain a stack, to
> 'count' the comment_start and comment_end pseudotokens. So in a way,
> it quits the regular world to become temporarily more powerful.
>
>> Typically lexeres are regular as it allows them to be fast.
>
> Hmm, but then it has to treat comments a one token. So no Ddoc love?
>


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list