RFC: std.json sucessor

Tue Aug 26 00:51:06 PDT 2014

Am 25.08.2014 23:53, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang at gmail.com>":
> On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
>> But why should UTF validation be the job of the lexer in the first place?
>
> Because you want to save time, it is faster to integrate validation? The
> most likely use scenario is to receive REST data over HTTP that needs
> validation.
>
> Well, so then I agree with Andrei… array of bytes it is. ;-)
>
>> added as a separate proxy range. But if we end up going for validating
>> in the lexer, it would indeed be enough to validate inside strings,
>> because the rest of the grammar assumes a subset of ASCII.
>
> Not assumes, but defines! :-)

I guess it depends on if you look at the grammar as productions or 
comprehensions(right term?) ;)

>
> If you have to validate UTF before lexing then you will end up
> needlessly scanning lots of ascii if the file contains lots of
> non-strings or is from a encoder that only sends pure ascii.

That's true. So the ideal solution would be to *assume* UTF-8 when the 
input is char based and to *validate* if the input is "numeric".

>
> If you want to have "plugin" validation of strings then you also need to
> differentiate strings so that the user can select which data should be
> just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing
> double validation (you have to bypass >7F followed by string-end anyway).
>
> The advantage of integrated validation is that you can use 16 bytes SIMD
> registers on the buffer.
>
> I presume you can load 16 bytes and do BITWISE-AND on the MSB, then
> match against string-end and carefully use this to boost performance of
> simultanous UTF validation, escape-scanning, and string-end scan. A bit
> tricky, of course.

Well, that's something that's definitely out of the scope of this 
proposal. Definitely an interesting direction to pursue, though.

>> At least no UTF validation is needed. Since all non-ASCII characters
>> will always be composed of bytes >0x7F, a sequence \uXXXX can be
>> assumed to be valid wherever in the string it occurs, and all other
>> bytes that don't belong to an escape sequence are just passed through
>> as-is.
>
> You cannot assume \u… to be valid if you convert it.

I meant "X" to stand for a hex digit. The point was just that you don't 
have to worry about interacting in a bad way with UTF sequences when you 
find "\uXXXX".