std.data.json formal review

Tue Aug 18 02:05:32 PDT 2015

On Monday, 17 August 2015 at 22:21:50 UTC, Andrei Alexandrescu 
wrote:
> * stdx.data.json.generator: I think the API for converting 
> in-memory JSON values to strings needs to be redone, as follows:
>
> - JSONValue should offer a byToken range, which offers the 
> contents of the value one token at a time. For example, "[ 1, 
> 2, 3 ]" offers the '[' token followed by three numeric tokens 
> with the respective values followed by the ']' token.

For iterating tree-like structures, a callback-based seems nicer, 
because it can naturally use the stack for storing its state. (I 
assume std.concurrency.Generator is too heavy-weight for this 
case.)

>
> - On top of byToken it's immediate to implement a method (say 
> toJSON or toString) that accepts an output range of characters 
> and formatting options.

If there really needs to be a range, `joiner` and `copy` should 
do the job.

>
> - On top of the method above with output range, implementing a 
> toString overload that returns a string for convenience is a 
> two-liner. However, it shouldn't return a "string"; Phobos APIs 
> should avoid "hardcoding" the string type. Instead, it should 
> return a user-chosen string type (including reference counting 
> strings).

`to!string`, for compatibility with std.conv.

>
> - While at it make prettyfication a flag in the options, not 
> its own part of the function name.

(That's already done.)

>
> * stdx.data.json.lexer:
>
> - I assume the idea was to accept ranges of integrals to mean 
> "there's some raw input from a file". This seems to be a bit 
> overdone, e.g. there's no need to accept signed integers or 
> 64-bit integers. I suggest just going with the three character 
> types.
>
> - I see tokenization accepts input ranges. This forces the 
> tokenizer to store its own copy of things, which is no doubt 
> the business of appenderFactory. Here the departure of the 
> current approach from what I think should become canonical 
> Phobos APIs deepens for multiple reasons. First, 
> appenderFactory does allow customization of the append 
> operation (nice) but that's not enough to allow the user to 
> customize the lifetime of the created strings, which is usually 
> reflected in the string type itself. So the lexing method 
> should be parameterized by the string type used. (By default 
> string (as is now) should be fine.) Therefore instead of 
> customizing the append method just customize the string type 
> used in the token.
>
> - The lexer should internally take optimization opportunities, 
> e.g. if the string type is "string" and the lexed type is also 
> "string", great, just use slices of the input instead of 
> appending them to the tokens.
>
> - As a consequence the JSONToken type also needs to be 
> parameterized by the type of its string that holds the payload. 
> I understand this is a complication compared to the current 
> approach, but I don't see an out. In the grand scheme of things 
> it seems a necessary evil: tokens may or may not need a means 
> to manage lifetime of their payload, and that's determined by 
> the type of the payload. Hopefully simplifications in other 
> areas of the API would offset this.

I've never seen JSON encoded in anything other than UTF-8. Is it 
really necessary to complicate everything for such an infrequent 
niche case?

>
> - At token level there should be no number parsing. Just store 
> the payload with the token and leave it for later. Very often 
> numbers are converted without there being a need, and the 
> process is costly. This also nicely sidesteps the entire matter 
> of bigints, floating point etc. at this level.
>
> - Also, at token level strings should be stored with escapes 
> unresolved. If the user wants a string with the escapes 
> resolved, a lazy range does it.

This was already suggested, and it looks like a good idea, though 
there was an objection because of possible performance costs. The 
other objection, that it requires an allocation, is no longer 
valid if sliceable input is used.

>
> - Validating UTF is tricky; I've seen some discussion in this 
> thread about it. On the face of it JSON only accepts valid UTF 
> characters. As such, a modularity-based argument is to pipe UTF 
> validation before tokenization. (We need a lazy UTF validator 
> and sanitizer stat!) An efficiency-based argument is to do 
> validation during tokenization. I'm inclining in favor of 
> modularization, which allows us to focus on one thing at a time 
> and do it well, instead of duplicationg validation everywhere. 
> Note that it's easy to write routines that do JSON tokenization 
> and leave UTF validation for later, so there's a lot of 
> flexibility in composing validation with JSONization.

Well, in an ideal world, there should be no difference in 
performance between manually combined tokenization/validation, 
and composed ranges. We should practice what we preach here.

> * stdx.data.json.parser:
>
> - FWIW I think the whole thing with accommodating BigInt etc. 
> is an exaggeration. Just stick with long and double.

Or, as above, leave it to the end user and provide a `to(T)` 
method that can support built-in types and `BigInt` alike.