std.jgrandson

Sun Aug 3 13:40:46 PDT 2014

On Sunday, 3 August 2014 at 17:40:48 UTC, Andrei Alexandrescu 
wrote:
> On 8/3/14, 10:19 AM, Sean Kelly wrote:
>> I don't want to pay for anything I don't use.  No allocations 
>> should
>> occur within the parser and it should simply slice up the 
>> input.
>
> What to do about arrays and objects, which would naturally 
> allocate arrays and associative arrays respectively? What about 
> strings with backslash-encoded characters?

This is tricky with a range. With an event-based parser I'd have 
events for object and array begin / end, but with a range you end 
up having an element that's a token, which is pretty weird. For 
encoded characters (and you need to make sure you handle 
surrogate pairs in your decoder) I'd still provide some means of 
decoding on demand. If nothing else, decode lazily when the user 
asks for the string value.  That way the user isn't paying to 
decode strings he isn't interested in.

> No allocation works for tokenization, but parsing is a whole 
> different matter.
>
>> So the
>> lowest layer should allow me to iterate across symbols in some 
>> way.
>
> Yah, that would be the tokenizer.

But that will halt on comma and colon and such, correct?  That's 
a tad lower than I'd want, though I guess it would be easy enough 
to build a parser on top of it.

>> When I've done this in the past it was SAX-style (ie. a 
>> callback per
>> type) but with the range interface that shouldn't be necessary.
>>
>> The parser shouldn't decode or convert anything unless I ask 
>> it to.
>> Most of the time I only care about specific values, and paying 
>> for
>> conversions on everything is wasted process time.
>
> That's tricky. Once you scan for 2 specific characters you may 
> as well scan for a couple more, the added cost is negligible. 
> In contrast, scanning once for finding termination and then 
> again for decoding purposes will definitely be a lot more 
> expensive.

I think I'm getting a bit confused. For the JSON parser I wrote, 
the parser performs full validation but leaves the content as-is, 
then provides a routine to decode values from their string 
representation if the user wishes to. I'm not sure where scanning 
figures in here.
> Andrei