std.data.json formal review

Tue Aug 25 00:03:13 PDT 2015

Am 25.08.2015 um 07:55 schrieb Martin Nowak:
> On Saturday, 22 August 2015 at 13:41:49 UTC, Sönke Ludwig wrote:
>> There is more than the actual call to validate(), such as writing
>> tests and making sure the surroundings work, adjusting the interface
>> and writing documentation. It's not *that* much work, but nonetheless
>> wasted work.
>>
>> I also still think that this hasn't been a bad idea at all. Because it
>> speeds up the most important use case, parsing JSON from a non-memory
>> source that has not yet been validated. I also very much like the idea
>> of making it a programming error to have invalid UTF stored in a
>> string, i.e. forcing the validation to happen before the cast from
>> bytes to chars.
>
> Also see "utf/unicode should only be validated once"
> https://issues.dlang.org/show_bug.cgi?id=14919
>
> If combining lexing and validation is faster (why?) then a ubyte
> consuming interface should be available, though why couldn't it be done
> by adding a lazy ubyte->char validator range to std.utf.
> In any case during lexing we should avoid autodecoding of narrow strings
> for redundant validation.

The performance benefit comes from the fact that almost all of JSON is a 
subset of ASCII, so that lexing the input will implicitly validate it as 
correct UTF. The only places where actual UTF sequences can occur is in 
string literals outside of escape sequences. Depending on the type of 
document, that can result is a lot less conditionals compared to a full 
validation of the input.

Autodecoding during lexing is being avoided, everything happens on the 
code unit level.