RFC: std.json sucessor
via Digitalmars-d
digitalmars-d at puremagic.com
Mon Aug 25 14:53:48 PDT 2014
On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
> But why should UTF validation be the job of the lexer in the
> first place?
Because you want to save time, it is faster to integrate
validation? The most likely use scenario is to receive REST data
over HTTP that needs validation.
Well, so then I agree with Andrei… array of bytes it is. ;-)
> added as a separate proxy range. But if we end up going for
> validating in the lexer, it would indeed be enough to validate
> inside strings, because the rest of the grammar assumes a
> subset of ASCII.
Not assumes, but defines! :-)
If you have to validate UTF before lexing then you will end up
needlessly scanning lots of ascii if the file contains lots of
non-strings or is from a encoder that only sends pure ascii.
If you want to have "plugin" validation of strings then you also
need to differentiate strings so that the user can select which
data should be just ascii, utf8, numbers, ids etc. Otherwise the
user will end up doing double validation (you have to bypass >7F
followed by string-end anyway).
The advantage of integrated validation is that you can use 16
bytes SIMD registers on the buffer.
I presume you can load 16 bytes and do BITWISE-AND on the MSB,
then match against string-end and carefully use this to boost
performance of simultanous UTF validation, escape-scanning, and
string-end scan. A bit tricky, of course.
> At least no UTF validation is needed. Since all non-ASCII
> characters will always be composed of bytes >0x7F, a sequence
> \uXXXX can be assumed to be valid wherever in the string it
> occurs, and all other bytes that don't belong to an escape
> sequence are just passed through as-is.
You cannot assume \u… to be valid if you convert it.
More information about the Digitalmars-d
mailing list