RFC: std.json sucessor

via Digitalmars-d digitalmars-d at puremagic.com
Mon Aug 25 14:53:48 PDT 2014


On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
> But why should UTF validation be the job of the lexer in the 
> first place?

Because you want to save time, it is faster to integrate 
validation? The most likely use scenario is to receive REST data 
over HTTP that needs validation.

Well, so then I agree with Andrei… array of bytes it is. ;-)

> added as a separate proxy range. But if we end up going for 
> validating in the lexer, it would indeed be enough to validate 
> inside strings, because the rest of the grammar assumes a 
> subset of ASCII.

Not assumes, but defines! :-)

If you have to validate UTF before lexing then you will end up 
needlessly scanning lots of ascii if the file contains lots of 
non-strings or is from a encoder that only sends pure ascii.

If you want to have "plugin" validation of strings then you also 
need to differentiate strings so that the user can select which 
data should be just ascii, utf8, numbers, ids etc. Otherwise the 
user will end up doing double validation (you have to bypass >7F 
followed by string-end anyway).

The advantage of integrated validation is that you can use 16 
bytes SIMD registers on the buffer.

I presume you can load 16 bytes and do BITWISE-AND on the MSB, 
then match against string-end and carefully use this to boost 
performance of simultanous UTF validation, escape-scanning, and 
string-end scan. A bit tricky, of course.

> At least no UTF validation is needed. Since all non-ASCII 
> characters will always be composed of bytes >0x7F, a sequence 
> \uXXXX can be assumed to be valid wherever in the string it 
> occurs, and all other bytes that don't belong to an escape 
> sequence are just passed through as-is.

You cannot assume \u… to be valid if you convert it.


More information about the Digitalmars-d mailing list