std.data.json formal review
via Digitalmars-d
digitalmars-d at puremagic.com
Tue Aug 18 02:05:32 PDT 2015
On Monday, 17 August 2015 at 22:21:50 UTC, Andrei Alexandrescu
wrote:
> * stdx.data.json.generator: I think the API for converting
> in-memory JSON values to strings needs to be redone, as follows:
>
> - JSONValue should offer a byToken range, which offers the
> contents of the value one token at a time. For example, "[ 1,
> 2, 3 ]" offers the '[' token followed by three numeric tokens
> with the respective values followed by the ']' token.
For iterating tree-like structures, a callback-based seems nicer,
because it can naturally use the stack for storing its state. (I
assume std.concurrency.Generator is too heavy-weight for this
case.)
>
> - On top of byToken it's immediate to implement a method (say
> toJSON or toString) that accepts an output range of characters
> and formatting options.
If there really needs to be a range, `joiner` and `copy` should
do the job.
>
> - On top of the method above with output range, implementing a
> toString overload that returns a string for convenience is a
> two-liner. However, it shouldn't return a "string"; Phobos APIs
> should avoid "hardcoding" the string type. Instead, it should
> return a user-chosen string type (including reference counting
> strings).
`to!string`, for compatibility with std.conv.
>
> - While at it make prettyfication a flag in the options, not
> its own part of the function name.
(That's already done.)
>
> * stdx.data.json.lexer:
>
> - I assume the idea was to accept ranges of integrals to mean
> "there's some raw input from a file". This seems to be a bit
> overdone, e.g. there's no need to accept signed integers or
> 64-bit integers. I suggest just going with the three character
> types.
>
> - I see tokenization accepts input ranges. This forces the
> tokenizer to store its own copy of things, which is no doubt
> the business of appenderFactory. Here the departure of the
> current approach from what I think should become canonical
> Phobos APIs deepens for multiple reasons. First,
> appenderFactory does allow customization of the append
> operation (nice) but that's not enough to allow the user to
> customize the lifetime of the created strings, which is usually
> reflected in the string type itself. So the lexing method
> should be parameterized by the string type used. (By default
> string (as is now) should be fine.) Therefore instead of
> customizing the append method just customize the string type
> used in the token.
>
> - The lexer should internally take optimization opportunities,
> e.g. if the string type is "string" and the lexed type is also
> "string", great, just use slices of the input instead of
> appending them to the tokens.
>
> - As a consequence the JSONToken type also needs to be
> parameterized by the type of its string that holds the payload.
> I understand this is a complication compared to the current
> approach, but I don't see an out. In the grand scheme of things
> it seems a necessary evil: tokens may or may not need a means
> to manage lifetime of their payload, and that's determined by
> the type of the payload. Hopefully simplifications in other
> areas of the API would offset this.
I've never seen JSON encoded in anything other than UTF-8. Is it
really necessary to complicate everything for such an infrequent
niche case?
>
> - At token level there should be no number parsing. Just store
> the payload with the token and leave it for later. Very often
> numbers are converted without there being a need, and the
> process is costly. This also nicely sidesteps the entire matter
> of bigints, floating point etc. at this level.
>
> - Also, at token level strings should be stored with escapes
> unresolved. If the user wants a string with the escapes
> resolved, a lazy range does it.
This was already suggested, and it looks like a good idea, though
there was an objection because of possible performance costs. The
other objection, that it requires an allocation, is no longer
valid if sliceable input is used.
>
> - Validating UTF is tricky; I've seen some discussion in this
> thread about it. On the face of it JSON only accepts valid UTF
> characters. As such, a modularity-based argument is to pipe UTF
> validation before tokenization. (We need a lazy UTF validator
> and sanitizer stat!) An efficiency-based argument is to do
> validation during tokenization. I'm inclining in favor of
> modularization, which allows us to focus on one thing at a time
> and do it well, instead of duplicationg validation everywhere.
> Note that it's easy to write routines that do JSON tokenization
> and leave UTF validation for later, so there's a lot of
> flexibility in composing validation with JSONization.
Well, in an ideal world, there should be no difference in
performance between manually combined tokenization/validation,
and composed ranges. We should practice what we preach here.
> * stdx.data.json.parser:
>
> - FWIW I think the whole thing with accommodating BigInt etc.
> is an exaggeration. Just stick with long and double.
Or, as above, leave it to the end user and provide a `to(T)`
method that can support built-in types and `BigInt` alike.
More information about the Digitalmars-d
mailing list