buffered input (examples?)

spir denis.spir at gmail.com
Mon Feb 7 05:24:32 PST 2011


Hello,


Andrei Alexandrescu wrote:
"Generally laziness may be a tactics you could use to help with memory use. A 
good example is split() vs. splitter(). The "er" version offers one element at 
a time thus never forcing an allocation. The split() version must do all work 
upfront and also allocate a structure for depositing the output."
-------
"Let's restart this. What do you want to do that you can't do with the proposed 
API?"


First, I would love to read about some typical or sample uses of buffered input 
(hopefully clearly explained); especially showing why & how buffering is a true 
(theoretical, concrete) gain there; and why lower-level control over buffering 
is helpful, if not required.
Here is a (real & current) use case about which I wonder whether it matches 
intended use of buffered input ranges: the interface between a lexer and a 
parser. (This is the reason why I followed this thread closely, while not 
knowing much in the domain.)

Currently, my lexing toolkit constructs lexers that generate a complete lexeme 
stream in one go. How to make it work lazily, meaning on-demand from the 
parser? (*) Here is the way I see the problem, and how to solve it naively:

This is simple for a subset of language grammars for which --at every syntactic 
level-- the first element (eg 1 lexeme) univoquely determines which syntactic 
construct may be expected. Just like at the morphological level (for most prog 
langs) the first character univoquely determines the lexeme type. (Hope I'm 
clear.) In this case, we only need to read one token/lexeme at a time: sa,y the 
buffer is a single token.

In the general case, things are somewhat more complicated. I won't enter 
details (see eg http://en.wikipedia.org/wiki/Lookahead & 
http://en.wikipedia.org/wiki/Backtracking) of how/why, but we need to be able 
to lookahead & to backtrack when a choice pattern's sub-patterns are composite. 
Not only that, but --if I do not fool myself-- the amount of lookeahead 
required can be in some cases constantly undefined (due to recursivity) and we 
may in some extreme case need to backtrack the whole of what has been matched 
as of now (if top-pattern is a choice). Seems this means illimited buffer.

Thus, here is how I see the lexer <--> parser interface:
* One or more match function(s), mainly checking the current lexeme's type, 
possibly giving some more info or performing some useful actions.
* store() command: if first call, start to record matched lexemes into buffer; 
push current cursor in buffer on a stack [0, i, j, c...]
* unstore() command: backtrack, meaning discard buffer[c..$]; pop last cursor c

This is to be used the following way:
1. When starting to match a composite pattern which is a sub-pattern of a 
choice, meaning we need lookahead and may need backtracking, call store().
2. When matching this pattern ends, either by success or failure, call unstore().
     ~ success: matched tokens are validated, and have produced one or more
       AST node(s), thus we do not need them in buffer anymore.
     ~ failure: we need to backtrack to where matching this pattern started
       to try matching an alternative pattern (**)

The possible necessity of storing in buffer more than current composite 
pattern's tokens, and having a stack of cursors, is (as you guess) due to 
pattern nesting, and to self- or mutual recursivity. When unstoring a pattern's 
tokens, be it after validation or to backtrack, there may be a higher-level 
pattern under match trial, which also requires lookahead.
If the top-pattern is a choice, and some of its sub-patterns are composite, 
then the whole stream of matched tokens needs be stored in buffer until parsing 
the whole source is validated or fails. I guess.


Does this have anything to do with currently discussed buffered input ranges? 
If yes, how does such a design, or any alternative, fit their proposed interface?


Denis

(*) I doubt there is any great advantage to expect from avoiding whole source 
lexing in one go (typically one module file). I rather explore the topic for 
its proper interest.

(**) There may be match memoisation (eg so-called "packrat parsing"), to avoid 
re-matching same pattern type at same position, but this is another story, and 
rather happens at a higher-level than individual tokens as such.
-- 
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list