std.data.json formal review

Sun Aug 16 05:34:50 PDT 2015

Am 16.08.2015 um 02:50 schrieb Walter Bright:
> On 8/15/2015 3:18 AM, Sönke Ludwig wrote:
>>> I don't know what 'isStringInputRange' is. Whatever it is, it should be
>>> a 'range of char'.
>>
>> I'll rename it to isCharInputRange. We don't have something like that
>> in Phobos,
>> right?
>
> That's right, there isn't one. But I use:
>
>      if (isInputRange!R && is(Unqual!(ElementEncodingType!R) == char))
>
> I'm not a fan of more names for trivia, the deluge of names has its own
> costs.

Good, I'll use `if (isInputRange!R && 
(isSomeChar!(ElementEncodingType!R) || 
isIntegral!(ElementEncodingType!R))`. It's just used in number of places 
and quite a bit more verbose (twice as long) and I guess a large number 
of algorithms in Phobos accept char ranges, so that may actually warrant 
a name in this case.

>>> There is no reason to validate UTF-8 input. The only place where
>>> non-ASCII code units can even legally appear is inside strings, and
>>> there they can just be copied verbatim while looking for the end of the
>>> string.
>> The idea is to assume that any char based input is already valid UTF
>> (as D
>> defines it), while integer based input comes from an unverified
>> source, so that
>> it still has to be validated before being cast/copied into a 'string'.
>> I think
>> this is a sensible approach, both semantically and performance-wise.
>
> The json parser will work fine without doing any validation at all. I've
> been implementing string handling code in Phobos with the idea of doing
> validation only if the algorithm requires it, and only for those parts
> that require it.

Yes, and it won't do that if a char range is passed in. If the integral 
range path gets removed there are basically two possibilities left, 
perform the validation up-front (slower), or risk UTF exceptions in 
unrelated parts of the code base. I don't see why we shouldn't take the 
opportunity for a full and fast validation here. But I'll relay this to 
Andrei, it was his idea originally.

> There are many validation algorithms in Phobos one can tack on - having
> two implementations of every algorithm, one with an embedded reinvented
> validation and one without - is too much.

There is nothing reinvented here. It simply implicitly validates all 
non-string parts of a JSON document and uses validate() for parts of 
JSON strings that can contain unicode characters.

> The general idea with algorithms is that they do not combine things, but
> they enable composition.

It's just that there is no way to achieve the same performance using 
composition in this case.

>>> Why do both? Always return an input range. If the user wants a string,
>>> he can pipe the input range to a string generator, such as .array
>> Convenience for one.
>
> Back to the previous point, that means that every algorithm in Phobos
> should have two versions, one that returns a range and the other a
> string? All these variations will result in a combinatorical explosion.

This may be a factor of two, but not a combinatorial explosion.

> The other problem, of course, is that returning a string means the
> algorithm has to decide how to allocate that string. As much as
> possible, algorithms should not be making allocation decisions.

Granted, the fact that format() and to!() support input ranges (I didn't 
notice that until now) makes the issue less important. But without 
those, it would basically mean that almost all places that generate JSON 
strings would have to import std.array and append .array. Nothing 
particularly bad if viewed in isolation, but makes the language appear a 
lot less clean/more verbose if it occurs often. It's also a stepping 
stone for language newcomers.

>> The lack of number to input range conversion functions is
>> another concern. I'm not really keen to implement an input range style
>> floating-point to string conversion routine just for this module.
>
> Not sure what you mean. Phobos needs such routines anyway, and you still
> have to do something about floating point.

There are output range and allocation based float->string conversions 
available, but no input range based one. But well, using an internal 
buffer together with formattedWrite would probably be a viable workaround...

>> Finally, I'm a little worried about performance. The output range
>> based approach
>> can keep a lot of state implicitly using the program counter register.
>> But an
>> input range would explicitly have to keep track of the current JSON
>> element, as
>> well as the current character/state within that element (and possibly
>> one level
>> deeper, for example for escape sequences). This means that it will
>> require
>> either multiple branches or indirection for each popFront().
>
> Often this is made up for by not needing to allocate storage. Also, that
> state is in the cached "hot zone" on top of the stack, which is much
> faster to access than a cold uninitialized array.

Just branch misprediction will most probably be problematic. But I think 
this can be made fast enough anyway by making the input range partially 
eager and serving chunks of strings at a time. That way, the additional 
branching only has to happen once per chunk. I'll have a look.

> I share your concern with performance, and I had very good results with
> Warp by keeping all the state on the stack in this manner.
>