RFC: std.json sucessor

Mon Oct 13 00:39:40 PDT 2014

Am 12.10.2014 20:17, schrieb Andrei Alexandrescu:
> Here's my destruction of std.data.json.
>
> * lexer.d:
>
> ** Beautifully done. From what I understand, if the input is string or
> immutable(ubyte)[] then the strings are carved out as slices of the
> input, as opposed to newly allocated. Awesome.
>
> ** The string after lexing is correctly scanned and stored in raw format
> (escapes are not rewritten) and decoded on demand. Problem with decoding
> is that it may allocate memory, and it would be great (and not
> difficult) to make the lexer 100% lazy/non-allocating. To achieve that,
> lexer.d should define TWO "Kind"s of strings at the lexer level: regular
> string and undecoded string. The former is lexer.d's way of saying "I
> got lucky" in the sense that it didn't detect any '\\' so the raw and
> decoded strings are identical. No need for anyone to do any further
> processing in the majority of cases => win. The latter means the lexer
> lexed the string, saw at least one '\\', and leaves it to the caller to
> do the actual decoding.

This is actually more or less done in unescapeStringLiteral() - if it 
doesn't find any '\\', it just returns the original string. Also 
JSONString allows to access its .rawValue without doing any 
decoding/allocations.

https://github.com/s-ludwig/std_data_json/blob/master/source/stdx/data/json/lexer.d#L1421

Unfortunately .rawValue can't be @nogc because the "raw" value might 
have to be constructed first when the input is not a "string" (in this 
case unescaping is done on-the-fly for efficiency reasons).

>
> ** After moving the decoding business out of lexer.d, a way to take this
> further would be to qualify lexer methods as @nogc if the input is
> string/immutable(ubyte)[]. I wonder how to implement a conditional
> attribute. We'll probably need a language enhancement for that.

Isn't @nogc inferred? Everything is templated, so that should be 
possible. Or does attribute inference only work for template function 
and not for methods of templated types? Should it?

>
> ** The implementation uses manually-defined tagged unions for work.
> Could we use Algebraic instead - dogfooding and all that? I recall there
> was a comment in Sönke's original work that Algebraic has a specific
> issue (was it false pointers?) - so the question arises, should we fix
> Algebraic and use it thus helping other uses as well?

I had started on an implementation of a type and ID safe TaggedAlgebraic 
that uses Algebraic for its internal storage. If we can get that in 
first, it should be no problem to use it instead (with no or minimal API 
breakage). However, it uses a struct instead of an enum to define the 
"Kind" (which is the only nice way I could conceive to safely couple 
enum value and type at compile time), so it's not as nice in the 
generated documentation.

>
> ** I see the "boolean" kind, should we instead have the "true_" and
> "false_" kinds?

I always found it cumbersome and awkward to work like that. What would 
be the reason to go that route?

>
> ** Long story short I couldn't find any major issue with this module,
> and I looked! I do think the decoding logic should be moved outside of
> lexer.d or at least the JSONLexerRange.
>
> * generator.d: looking good, no special comments. Like the consistent
> use of structs filled with options as template parameters.
>
> * foundation.d:
>
> ** At four words per token, Location seems pretty bulky. How about
> reducing line and column to uint?

Single line JSON files >64k (or line counts >64k) are no exception, so 
that would only work in a limited way. My thought about this was that it 
is quite unusual to actually store the tokens for most purposes 
(especially when directly serializing to a native D type), so that it 
should have minimal impact on performance or memory consumption.

>
> ** Could JSONException create the message string in toString (i.e.
> when/if used) as opposed to in the constructor?

That could of course be done, but the you'd not get the full error 
message using ex.msg, only with ex.toString(), which usually prints a 
call trace instead. Alternatively, it's also possible to completely 
avoid using exceptions with LexOptions.noThrow.

>
> * parser.d:
>
> ** How about using .init instead of .defaults for options?

I'd slightly tend to prefer the more explicit "defaults", especially 
because "init" could mean either "defaults" or "none" (currently it 
means "none"). But another idea would be to invert the option values so 
that defaults==none... any objections?

>
> ** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End
> markers shouldn't appear as nodes. There should be an "object" node
> only. I guess that's needed for laziness.

While you could infer the end of an object in the parser range by 
looking for the first entry that doesn't start with a "key" node, the 
same would not be possible for arrays, so in general the end marker *is* 
required. Not that the parser range is a StAX style parser, which is 
still very close to the lexical structure of the document.

I was also wondering if there might be a better name than 
"JSONParserNode". It's not really embedded into a tree or graph 
structure, which the name tends to suggest.

>
> ** It's unclear where memory is being allocated in the parser. @nogc
> annotations wherever appropriate would be great.

The problem is that the parser accesses the lexer, which in turn 
accesses the underlying input range, which in turn could allocate. 
Depending on the options passed to the lexer, it could also throw, and 
thus allocate, an exception. In the end only JSONParserRange.empty could 
generally be made @nogc.

However, attribute inference should be possible here in theory (the 
noThrow option is compile-time).

>
> * value.d:
>
> ** Looks like this is/may be the only place where memory is being
> managed, at least if the input is string/immutable(ubyte)[]. Right?

Yes, at least when setting aside optional exceptions and lazy allocations.

>
> ** Algebraic ftw.
>
> ============================
>
> Overall: This is very close to everything I hoped! A bit more care to
> @nogc would be awesome, especially with the upcoming focus on memory
> management going forward.

I've tried to use @nogc (as well as nothrow) in more places, but mostly 
due to not knowing if the underlying input range allocates, it hasn't 
really been possible. Even on lower levels (private functions) almost 
any Phobos function that is called is currently not @nogc for reasons 
that are not always obvious, so I gave up on that for now.

>
> After one more pass it would be great to move forward for review.

There is also still one pending change that I didn't finish yet; the 
optional UTF input validation (never validate "string" inputs, but do 
validate "ubyte[]" inputs).

Oh and there is the open issue of how to allocate in case of non-array 
inputs. Initially I wanted to wait with this until we have an allocators 
module, but Walter would like to have a way to do manual memory 
management in the initial version. However, the ideal design is still 
unclear to me - it would either simply resemble a general allocator 
interface, or could use something like a callback that returns an output 
range, which would probably be quite cumbersome to work with. Any ideas 
in this direction would be welcome.

Sönke