std.data.json formal review

Mon Aug 17 15:21:57 PDT 2015

On 7/28/15 10:07 AM, Atila Neves wrote:
> Start of the two week process, folks.
>
> Code: https://github.com/s-ludwig/std_data_json
> Docs: http://s-ludwig.github.io/std_data_json/
>
> Atila

I'll preface my review with a general comment. This API comes at an 
interesting juncture; we're striving as much as possible for interfaces 
that abstract away lifetime management, so they can be used comfortably 
with GC, or at high performance (and hopefully no or only marginal loss 
of comfort) with client-chosen lifetime management policies.

The JSON API is a great test bed for our emerging recommended "push 
lifetime up" idioms; it's not too complicated yet it's not trivial 
either, and has great usefulness.

With this, here are some points:

* All new stuff should go in std.experimental. I assume "stdx" would 
change to that, should this work be merged.

* On the face of it, dedicating 6 modules to such a small specification 
as JSON seems excessive. I'm thinking one module here. (As a simple 
point: who would ever want to import only foundation, which in turn has 
one exception type and one location type in it?) I think it shouldn't be 
up for debate that we must aim for simple and clean APIs.

* stdx.data.json.generator: I think the API for converting in-memory 
JSON values to strings needs to be redone, as follows:

- JSONValue should offer a byToken range, which offers the contents of 
the value one token at a time. For example, "[ 1, 2, 3 ]" offers the '[' 
token followed by three numeric tokens with the respective values 
followed by the ']' token.

- On top of byToken it's immediate to implement a method (say toJSON or 
toString) that accepts an output range of characters and formatting options.

- On top of the method above with output range, implementing a toString 
overload that returns a string for convenience is a two-liner. However, 
it shouldn't return a "string"; Phobos APIs should avoid "hardcoding" 
the string type. Instead, it should return a user-chosen string type 
(including reference counting strings).

- While at it make prettyfication a flag in the options, not its own 
part of the function name.

* stdx.data.json.lexer:

- I assume the idea was to accept ranges of integrals to mean "there's 
some raw input from a file". This seems to be a bit overdone, e.g. 
there's no need to accept signed integers or 64-bit integers. I suggest 
just going with the three character types.

- I see tokenization accepts input ranges. This forces the tokenizer to 
store its own copy of things, which is no doubt the business of 
appenderFactory. Here the departure of the current approach from what I 
think should become canonical Phobos APIs deepens for multiple reasons. 
First, appenderFactory does allow customization of the append operation 
(nice) but that's not enough to allow the user to customize the lifetime 
of the created strings, which is usually reflected in the string type 
itself. So the lexing method should be parameterized by the string type 
used. (By default string (as is now) should be fine.) Therefore instead 
of customizing the append method just customize the string type used in 
the token.

- The lexer should internally take optimization opportunities, e.g. if 
the string type is "string" and the lexed type is also "string", great, 
just use slices of the input instead of appending them to the tokens.

- As a consequence the JSONToken type also needs to be parameterized by 
the type of its string that holds the payload. I understand this is a 
complication compared to the current approach, but I don't see an out. 
In the grand scheme of things it seems a necessary evil: tokens may or 
may not need a means to manage lifetime of their payload, and that's 
determined by the type of the payload. Hopefully simplifications in 
other areas of the API would offset this.

- At token level there should be no number parsing. Just store the 
payload with the token and leave it for later. Very often numbers are 
converted without there being a need, and the process is costly. This 
also nicely sidesteps the entire matter of bigints, floating point etc. 
at this level.

- Also, at token level strings should be stored with escapes unresolved. 
If the user wants a string with the escapes resolved, a lazy range does it.

- Validating UTF is tricky; I've seen some discussion in this thread 
about it. On the face of it JSON only accepts valid UTF characters. As 
such, a modularity-based argument is to pipe UTF validation before 
tokenization. (We need a lazy UTF validator and sanitizer stat!) An 
efficiency-based argument is to do validation during tokenization. I'm 
inclining in favor of modularization, which allows us to focus on one 
thing at a time and do it well, instead of duplicationg validation 
everywhere. Note that it's easy to write routines that do JSON 
tokenization and leave UTF validation for later, so there's a lot of 
flexibility in composing validation with JSONization.

- Litmus test: if the input type is a forward range AND if the string 
type chosen for tokens is the same as input type, successful 
tokenization should allocate exactly zero memory. I think this is a 
simple way to make sure that the tokenization API works well.

- If noThrow is a runtime option, some functions can't be nothrow (and 
consequently nogc). Not sure how important this is. Probably quite a bit 
because of the current gc implications of exceptions. IMHO: at lexing 
level a sound design might just emit error tokens (with the culprit as 
payload) and never throw. Clients may always throw when they see an 
error token.

* stdx.data.json.parser:

- Similar considerations regarding string type used apply here as well: 
everything should be parameterized with it - the use case to keep in 
mind is someone wants everything with refcounted strings.

- The JSON value does its own internal allocation (for e.g. arrays and 
hashtables), which should be fine as long as it's encapsulated and we 
can tweak it later (e.g. make it use reference counting inside).

- parseJSONStream should parameterize on string type, not on 
appenderFactory.

- Why both parseJSONStream and parseJSONValue? I'm thinking 
parseJSONValue would be enough because then you trivially parse a stream 
with repeated calls to parseJSONValue.

- FWIW I think the whole thing with accommodating BigInt etc. is an 
exaggeration. Just stick with long and double.

- readArray suddenly introduces a distinct kind of interacting - 
callbacks. Why? Should be a lazy range lazy range lazy range. An adapter 
using callbacks is then a two-liner.

- Why is readBool even needed? Just readJSONValue and then enforce it as 
a bool. Same reasoning applies to readDouble and readString.

- readObject is with callbacks again - it would be nice if it were a 
lazy range.

- skipXxx are nice to have and useful.

* stdx.data.json.value:

- The etymology of "opt" is unclear - no word starting with "opt" or 
obviously abbreviating to it is in the documentation. "opt2" is awkward. 
How about "path" and "dyn", respectively.

- I think Algebraic should be used throughout instead of 
TaggedAlgebraic, or motivation be given for the latter.

- JSONValue should be more opaque and not expose representation as much 
as it does now. In particular, offering a built-in hashtable is bound to 
be problematic because those are expensive to construct, create garbage, 
and are not customizable. Instead, the necessary lookup and set APIs 
should be provided by JSONValue whilst keeping the implementation 
hidden. The same goes about array - a JSONValue shall not be exposed; 
instead, indexed access primitives should be exposed. Separate types 
might be offered (e.g. JSONArray, JSONDictionary) if deemed necessary. 
The string type should be a type parameter of JSONValue.

==============================

So, here we are. I realize a good chunk of this is surprising ("you mean 
I shouldn't create strings in my APIs?"). My point here is, again, we're 
at a juncture. We're trying to factor garbage (heh) out of API design in 
ways that defer the lifetime management to the user of the API.

We could pull json into std.experimental and defer the policy decisions 
for later, but I think it's a great driver for them. (Thanks Sönke for 
doing all the work, this is a great baseline.) I think we should use the 
JSON API as a guinea pig for the new era of D API design in which we 
have a solid set of principles, tools, and guidelines to defer lifetime 
management. Please advise.

Andrei