[OT] Partial-reparsing thory?
Nick Sabalausky
a at a.a
Thu Apr 14 17:40:33 PDT 2011
"spir" <denis.spir at gmail.com> wrote in message
news:mailman.3527.1302824019.4748.digitalmars-d at puremagic.com...
> On 04/15/2011 12:51 AM, Nick Sabalausky wrote:
>> Is anyone aware of any (formal or informal)
>> theory/information/research/articles/etc. that's already been done on the
>> idea of reparsing *part* of an already-parsed source? Either lexing,
>> parsing
>> or both together. Preferably LR, but LL might be helpful, too. (I'd
>> imagine
>> it may be notably more difficult with LR. May need to keep track of all
>> the
>> parsing steps taken the first time around, and may need to continue
>> re-parsing to the end of the source.)
>>
>> I know code editors deal with that sort of thing a lot, but my
>> understanding
>> is they often accept a certain amount of inaccuracy (for the sake of
>> being
>> more line-oriented) and also tend to be more lexing-oriented and very
>> light
>> on parsing.
>
> This is a very interesting question.
>
> I guess it may have to play with "diffs", as a required tool to identify
> differences in-relation-to original source. The result may be a set of
> intervals (each a pair of indices) in both original and modified sources.
> Then, if ever nodes keep track of indices in original source, one may be
> able to identify the one node covering (all of) a diven piece of diff.
> This would be what needs be reparsed.
> If ever this succeds, then one can pass to next diff, with an
> adjustment/offset of indices which should normally be already given by the
> diff tool.
> Well, this superficial view may be totally wrong, and/or biased toward PEG
> parsing which I'm most familiar with. I guess things may well become more
> complicated with the common approach of 2-separated-pass lexing + parsing;
> at least, you'd need offset adjustment successively on character- and then
> on lexeme- indices.
>
> There may also be some similarity with the (apparently) simpler problem of
> error recovery: restart parsing "somewhere" after a match failure.
>
Yea. My thought was to keep note, for each node, of all the start/end
indicies into the source. When a chunk of text is changed, the parse tree
(or list of tokens if it's lex-only) gets notified about what got changed
and where (Maybe the parse tree would even *be* the primary internal
representation of the source). Then it can check through the tree and know
what needs to be re-lexed/re-parsed.
But I guess what I'm more interested in is the mechanics of actually doing
that re-parsing once you know what sub-tree (or sub-trees) you need to
re-parse. I guess it may seem simple, just run the lexer/parser on that
"dirty" subset of the code (using the root of the affected sub-tree as the
"starting nonterminal" instead of using the grammar's usual "starting
nonterminal") and replace the "dirty" subtrees with the fresh new ones (and
then update the start/end indicies of the rest of the tree, perhaps lazily.)
But I suspect there may be some notable gotchas, especially relating to
either LR parsing (since it's bottom-up, but you're tying to fit it into a
known "top" - then again, I guess that's normally true of LR anyway) or just
what to do with the nodes that come from after the affected part of the
source (it could probably cause some sort of cascade, or at least a
partial-cascade). And maybe certain sub-trees of the affected sub-trees
could be skipped. Etc.
I'd imagine lex-only would be a lot easier.
More information about the Digitalmars-d
mailing list