buffered input (examples?)
spir
denis.spir at gmail.com
Mon Feb 7 05:24:32 PST 2011
Hello,
Andrei Alexandrescu wrote:
"Generally laziness may be a tactics you could use to help with memory use. A
good example is split() vs. splitter(). The "er" version offers one element at
a time thus never forcing an allocation. The split() version must do all work
upfront and also allocate a structure for depositing the output."
-------
"Let's restart this. What do you want to do that you can't do with the proposed
API?"
First, I would love to read about some typical or sample uses of buffered input
(hopefully clearly explained); especially showing why & how buffering is a true
(theoretical, concrete) gain there; and why lower-level control over buffering
is helpful, if not required.
Here is a (real & current) use case about which I wonder whether it matches
intended use of buffered input ranges: the interface between a lexer and a
parser. (This is the reason why I followed this thread closely, while not
knowing much in the domain.)
Currently, my lexing toolkit constructs lexers that generate a complete lexeme
stream in one go. How to make it work lazily, meaning on-demand from the
parser? (*) Here is the way I see the problem, and how to solve it naively:
This is simple for a subset of language grammars for which --at every syntactic
level-- the first element (eg 1 lexeme) univoquely determines which syntactic
construct may be expected. Just like at the morphological level (for most prog
langs) the first character univoquely determines the lexeme type. (Hope I'm
clear.) In this case, we only need to read one token/lexeme at a time: sa,y the
buffer is a single token.
In the general case, things are somewhat more complicated. I won't enter
details (see eg http://en.wikipedia.org/wiki/Lookahead &
http://en.wikipedia.org/wiki/Backtracking) of how/why, but we need to be able
to lookahead & to backtrack when a choice pattern's sub-patterns are composite.
Not only that, but --if I do not fool myself-- the amount of lookeahead
required can be in some cases constantly undefined (due to recursivity) and we
may in some extreme case need to backtrack the whole of what has been matched
as of now (if top-pattern is a choice). Seems this means illimited buffer.
Thus, here is how I see the lexer <--> parser interface:
* One or more match function(s), mainly checking the current lexeme's type,
possibly giving some more info or performing some useful actions.
* store() command: if first call, start to record matched lexemes into buffer;
push current cursor in buffer on a stack [0, i, j, c...]
* unstore() command: backtrack, meaning discard buffer[c..$]; pop last cursor c
This is to be used the following way:
1. When starting to match a composite pattern which is a sub-pattern of a
choice, meaning we need lookahead and may need backtracking, call store().
2. When matching this pattern ends, either by success or failure, call unstore().
~ success: matched tokens are validated, and have produced one or more
AST node(s), thus we do not need them in buffer anymore.
~ failure: we need to backtrack to where matching this pattern started
to try matching an alternative pattern (**)
The possible necessity of storing in buffer more than current composite
pattern's tokens, and having a stack of cursors, is (as you guess) due to
pattern nesting, and to self- or mutual recursivity. When unstoring a pattern's
tokens, be it after validation or to backtrack, there may be a higher-level
pattern under match trial, which also requires lookahead.
If the top-pattern is a choice, and some of its sub-patterns are composite,
then the whole stream of matched tokens needs be stored in buffer until parsing
the whole source is validated or fails. I guess.
Does this have anything to do with currently discussed buffered input ranges?
If yes, how does such a design, or any alternative, fit their proposed interface?
Denis
(*) I doubt there is any great advantage to expect from avoiding whole source
lexing in one go (typically one module file). I rather explore the topic for
its proper interest.
(**) There may be match memoisation (eg so-called "packrat parsing"), to avoid
re-matching same pattern type at same position, but this is another story, and
rather happens at a higher-level than individual tokens as such.
--
_________________
vita es estrany
spir.wikidot.com
More information about the Digitalmars-d
mailing list