std.data.json formal review

Sat Aug 22 06:41:47 PDT 2015

Am 21.08.2015 um 19:30 schrieb Andrei Alexandrescu:
> On 8/18/15 12:54 PM, Sönke Ludwig wrote:
>> Am 18.08.2015 um 00:21 schrieb Andrei Alexandrescu:
>>> * On the face of it, dedicating 6 modules to such a small specification
>>> as JSON seems excessive. I'm thinking one module here. (As a simple
>>> point: who would ever want to import only foundation, which in turn has
>>> one exception type and one location type in it?) I think it shouldn't be
>>> up for debate that we must aim for simple and clean APIs.
>>
>> That would mean a single module that is >5k lines long. Spreading out
>> certain things, such as JSONValue into an own module also makes sense to
>> avoid unnecessarily large imports where other parts of the functionality
>> isn't needed. Maybe we could move some private things to "std.internal"
>> or similar and merge some of the modules?
>
> That would help. My point is it's good design to make the response
> proportional to the problem. 5K lines is not a lot, but reducing those
> 5K in the first place would be a noble pursuit. And btw saving parsing
> time is so C++ :o).

Most lines are needed for tests and documentation. Surely dropping some 
functionality would make the module smaller, too. But there is not a lot 
to take away without making severe compromises in terms of actual 
functionality or usability.

>> But I also think that grouping symbols by topic is a good thing and
>> makes figuring out the API easier. There is also always package.d if you
>> really want to import everything.
>
> Figuring out the API easily is a good goal. The best way to achieve that
> is making the API no larger than necessary.

So, what's your suggestion, remove all read*/skip* functions for 
example? Make them member functions of JSONParserRange instead of UFCS 
functions? We could of course also just use the pseudo modules that 
std.algorithm had for example, where we'd create a table in the 
documentation for each category of functions.

>> Another thing I'd like to add is an output range that takes parser nodes
>> and writes to a string output range. This would be the kind of interface
>> that would be most useful for a serialization framework.
>
> Couldn't that be achieved trivially by e.g. using map!(t => t.toString)
> or similar?
>
> This is the nice thing about rangifying everything - suddenly you have a
> host of tools at your disposal.

No, the idea is to have an output range like so:

     Appender!string dst;
     JSONNodeOutputRange r(&dst);
     r.put(beginArray);
     r.put(1);
     r.put(2);
     r.put(endArray);

This would provide a forward interface for code that has to directly 
iterate over its input, which is the case for a serializer - it can't 
provide an input range interface in a sane way. The alternative would be 
to either let the serializer re-implement all of JSON, or to just 
provide some primitives (writeJSON() that takes bool, number or string) 
and to let the serializer implement the rest of JSON (arrays/objects), 
which includes certain options, such as pretty-printing.

>>> - Also, at token level strings should be stored with escapes unresolved.
>>> If the user wants a string with the escapes resolved, a lazy range
>>> does it.
>>
>> To make things efficient, it currently stores escaped strings if slices
>> of the input are used, but stores unescaped strings if allocations are
>> necessary anyway.
>
> That seems a good balance, and probably could be applied to numbers as
> well.

With the difference that numbers stored as numbers never need to 
allocate, so for non-slicable inputs the compromise is not the same.

What about just offering basically three (CT selectable) modes:
- Always parse as double (parse lazily if slicing can be used) (default)
- Parse double or long (again, lazily if slicing can be used)
- Always store the string representation

The question that remains is how to handle this in JSONValue - support 
just double there? Or something like JSONNumber that abstracts away the 
differences, but makes writing generic code against JSONValue difficult? 
Or make it also parameterized in what it can store?

>>> - Validating UTF is tricky; I've seen some discussion in this thread
>>> about it. On the face of it JSON only accepts valid UTF characters. As
>>> such, a modularity-based argument is to pipe UTF validation before
>>> tokenization. (We need a lazy UTF validator and sanitizer stat!) An
>>> efficiency-based argument is to do validation during tokenization. I'm
>>> inclining in favor of modularization, which allows us to focus on one
>>> thing at a time and do it well, instead of duplicationg validation
>>> everywhere. Note that it's easy to write routines that do JSON
>>> tokenization and leave UTF validation for later, so there's a lot of
>>> flexibility in composing validation with JSONization.
>>
>> It's unfortunate to see this change of mind in face of the work that
>> already went into the implementation. I also still think that this is a
>> good optimization opportunity that doesn't really affect the
>> implementation complexity. Validation isn't duplicated, but reused from
>> std.utf.
>
> Well if the validation is reused from std.utf, it can't have been very
> much work. I maintain that separating concerns seems like a good
> strategy here.

There is more than the actual call to validate(), such as writing tests 
and making sure the surroundings work, adjusting the interface and 
writing documentation. It's not *that* much work, but nonetheless wasted 
work.

I also still think that this hasn't been a bad idea at all. Because it 
speeds up the most important use case, parsing JSON from a non-memory 
source that has not yet been validated. I also very much like the idea 
of making it a programming error to have invalid UTF stored in a string, 
i.e. forcing the validation to happen before the cast from bytes to chars.

>>> - Litmus test: if the input type is a forward range AND if the string
>>> type chosen for tokens is the same as input type, successful
>>> tokenization should allocate exactly zero memory. I think this is a
>>> simple way to make sure that the tokenization API works well.
>>
>> Supporting arbitrary forward ranges doesn't seem to be enough, it would
>> at least have to be combined with something like take(), but then the
>> type doesn't equal the string type anymore. I'd suggest to keep it to
>> "if is sliceable and input type equals string type", at least for the
>> initial version.
>
> I had "take" in mind. Don't forget that "take" automatically uses slices
> wherever applicable. So if you just use typeof(take(...)), you get the
> best of all worlds.
>
> The more restrictive version seems reasonable for the first release.

Okay.

>>> - The JSON value does its own internal allocation (for e.g. arrays and
>>> hashtables), which should be fine as long as it's encapsulated and we
>>> can tweak it later (e.g. make it use reference counting inside).
>>
>> Since it's based on (Tagged)Algebraic, the internal types are part of
>> the interface. Changing them later is bound to break some code. So AFICS
>> this would either require to make the types used parameterized (string,
>> array and AA types). Or to abstract them away completely, i.e. only
>> forward operations but deny direct access to the type.
>>
>> ... thinking about it, TaggedAlgebraic could do that, while Algebraic
>> can't.
>
> Well if you figure the general Algebraic type is better replaced by a
> type specialized for JSON, fine.
>
> What we shouldn't endorse is two nearly identical library types
> (Algebraic and TaggedAlgebraic) that are only different in subtle
> matters related to performance in certain use patterns.
>
> If integral tags are better for closed type universes, specialize
> Algebraic to use integral tags where applicable.

TaggedAlgebraic would not be a type specialized for JSON! It's useful 
for all kinds of applications and just happens to have some advantages 
here, too.

An (imperfect) idea for merging this with the existing Algebraic name:

template Algebraic(T)
     if (is(T == struct) || is(T == union))
{
	// ... implementation of TaggedAlgebraic ...
}

To avoid the ambiguity with a single type Algebraic, a UDA could be 
required for T to get the actual TaggedAgebraic behavior.

Everything else would be problematic, because TaggedAlgebraic needs to 
be supplied with names for the different types, so the Algebraic(T...) 
way of specifying allowed types doesn't really work. And, more 
importantly, because exploiting static type information in the generated 
interface means breaking code that currently is built around a Variant 
return value.

>>> - Why both parseJSONStream and parseJSONValue? I'm thinking
>>> parseJSONValue would be enough because then you trivially parse a stream
>>> with repeated calls to parseJSONValue.
>>
>> parseJSONStream is the pull parser (StAX style) interface. It returns
>> the contents of a JSON document as individual nodes instead of storing
>> them in a DOM. This part is vital for high-performance parsing,
>> especially of large documents.
>
> So perhaps this is just a naming issue. The names don't suggest
> everything you said. What I see is "parse a JSON stream" and "parse a
> JSON value". So I naturally assumed we're looking at consuming a full
> stream vs. consuming only one value off a stream and stopping. How about
> better names?

parseToJSONValue/parseToJSONStream? parseAsX?

>>> - readArray suddenly introduces a distinct kind of interacting -
>>> callbacks. Why? Should be a lazy range lazy range lazy range. An adapter
>>> using callbacks is then a two-liner.
>>
>> It just has a more complicated implementation, but is already on the
>> TODO list.
>
> Great. Let me say again that with ranges you get to instantly tap into a
> wealth of tools. I say get rid of the callbacks and let a "tee" take
> care of it for whomever needs it.

The callbacks would surely be dropped when ranges get available. 
foreach() should usually be all that is needed.

>>> - Why is readBool even needed? Just readJSONValue and then enforce it as
>>> a bool. Same reasoning applies to readDouble and readString.
>>
>> This is for lower level access, using parseJSONValue would certainly be
>> possible, but it would have quite some unneeded overhead and would also
>> be non- at nogc.
>
> Meh, fine. But all of this is adding weight to the API in the wrong places.

Frankly, I don't think that this is even the wrong place. The pull 
parser interface is the single most important part of the API when we 
talk about allocation-less and high-performance operation. It also 
really has low weight, as it's just a small function that joins the 
other read* functions quite naturally and doesn't create any additional 
cognitive load.

>
>>> - readObject is with callbacks again - it would be nice if it were a
>>> lazy range.
>>
>> Okay, is also already on the list.
>
> Awes!

It could return a Tuple!(string, JSONNodeRange). But probably there 
should also be an opApply for the object field range, so that foreach 
(key, value; ...) becomes possible.

>> But apart from that, algebraic is unfortunately currently quite unsuited
>> for this kind of abstraction, even if that can be solved in theory (with
>> a lot of work). It requires to write things like
>> obj.get!(JSONValue[string])["foo"].get!JSONValue instead of just
>> obj["foo"], because it simply returns Variant from all of its forwarded
>> operators.
>
> Algebraic does not expose opIndex. We could add it to Algebraic such
> that obj["foo"] returns the same type a "this".

https://github.com/D-Programming-Language/phobos/blob/6df5d551fd8a21feef061483c226e7d9b26d6cd4/std/variant.d#L1088
https://github.com/D-Programming-Language/phobos/blob/6df5d551fd8a21feef061483c226e7d9b26d6cd4/std/variant.d#L1348

> It's easy for anyone to say that what's there is unfit for a particular
> purpose. It's also easy for many to define a ever-so-slightly-different
> new artifact that fits a particular purpose. Where you come as a
> talented hacker is to operate with the understanding of the importance
> of making things work, and make it work.

The problem is that making Algebraic exploit static type information 
means nothing short of a complete reimplementation, which 
TaggedAlgebraic is. It also means breaking existing code, if, for 
example, alg[0] suddenly returns a string instead of just a Variant with 
a string stored inside.

>>> - JSONValue should be more opaque and not expose representation as much
>>> as it does now. In particular, offering a built-in hashtable is bound to
>>> be problematic because those are expensive to construct, create garbage,
>>> and are not customizable. Instead, the necessary lookup and set APIs
>>> should be provided by JSONValue whilst keeping the implementation
>>> hidden. The same goes about array - a JSONValue shall not be exposed;
>>> instead, indexed access primitives should be exposed. Separate types
>>> might be offered (e.g. JSONArray, JSONDictionary) if deemed necessary.
>>> The string type should be a type parameter of JSONValue.
>>
>> This would unfortunately at the same time destroy almost all benefits
>> that using (Tagged)Algebraic has, namely that it would opens up the
>> possibility to have interoperability between different data formats (for
>> example, passing a JSONValue to a BSON generator without letting the
>> BSON generator know about JSON). This is unfortunately an area that I've
>> also not yet properly explored, but I think it's important as we go
>> forward with other data formats.
>
> I think we need to do it. Otherwise we're stuck with "D's JSON API
> cannot be used without the GC". We want to escape that gravitational
> pull. I know it's hard. But it's worth it.

I can't fight the feeling that what Phobos currently has in terms of 
allocators, containters and reference counting is simply not mature 
enough to make a good decision here. Restricting JSONValue as much as 
possible would at least keep the possibility to extend it later, but I 
think that we can and should do better in the long term.