Let's stop parser Hell
Dmitry Olshansky
dmitry.olsh at gmail.com
Tue Jul 31 14:38:26 PDT 2012
On 01-Aug-12 01:10, Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias at pankrath.net> wrote:
>> On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:
>>
>>> I know I'm a little late coming into this conversation. This seems like a
>>> nice thread to toss myself into. I've started working on a generic lexer
>>> that is based off of a defined grammar.
>>
>>
>> Every helping hand is appreciated :-)
>
> Hi Kai, welcome here!
>
>>> As I read through the thread (I unfortunately don't have enough time to
>>> read every post, but I skimmed through as many as I could, and read the ones
>>> that seemed important), it seems like we need a parser in D that can lex D,
>>> and provide a Range of tokens that could be consumed.
>>
>>
>> Yes. To make this statement more precise: We need a lexer that provides
>> a range of tokens and we need a parser which makes it possible to build
>> an AST. Pegged combines both approaches but imposes an overhead if you
>> just need a token list. However I'm not sure if this is a problem.
>
> I think we need a tokenizer generator and a parser generator. They can
> be grouped or not, but we need both, in Phobos.
>
> We also need to define what's needed in a token. Kai's approach is OK,
> but what's the _type field for an operator or and identifier?
> Also, a lexer can fill a symbol table and produce identifier tokens
> that refer to a particular entry in the symbol table.
>
Yeah.
> I guess creating a tree of symbol tables according to scope visibility
> is then more the job of the parser, but I'm not sure.
>
Parser can use constant IDs for nested tables, IDs point to string table.
String table is populated by lexer.
>
>> I've only glimpsed at your code. For most languages lexing is far more
>> expensive then parsing
>
> Is that so?
It usually is. Unless parser is overly generic as GLL/GLR (not to claim
that it's very slow but GLL/GLR are slower for obvious reasons).
>
>> and thus the lexer has to be very fast and I wouldn't
>> pursue your approach and instead use something like ragel. It already has D
>> output but needs a separate build step.
Have to agree here if anything a better DFA generator is needed, current
std.regex can't get as good in this field because of:
a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
ctRegex via codegen)
b) full unicode features commitment (again expect some improvement here
in near future)
c) designed to take multiple captures from matched piece of text.
I'm not sure when (or even if) std.regex will ever special case all of
the above.
>
> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.
>
> The only problem I see in Kai's approach (which is the standard one, a
> prioritized list of regexes) is how to recognize tokens that are not
> regular (I mean, that cannot be encoded as a regex), like nesting
> comments?
See below.
> How does the lexer know when to stop producing a 'comment'
> token?
Call special function skipComment when lexer hits comment_start pseudotoken.
Typically lexeres are regular as it allows them to be fast.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list