RFC: std.json sucessor

Sönke Ludwig via Digitalmars-d digitalmars-d at puremagic.com
Tue Aug 26 02:05:07 PDT 2014


Am 26.08.2014 10:24, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang at gmail.com>":
> On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:
>> That's true. So the ideal solution would be to *assume* UTF-8 when the
>> input is char based and to *validate* if the input is "numeric".
>
> I think you should validate JSON-strings to be UTF-8 encoded even if you
> allow illegal unicode values. Basically ensuring that >0x7f has the
> right number of bytes after it, so you don't get >0x7f as the last byte
> in a string etc.

I think this is a misunderstanding. What I mean is that if the input 
range passed to the lexer is char/wchar/dchar based, the lexer should 
assume that the input is well formed UTF. After all this is how D 
strings are defined.

When on the other hand a ubyte/ushort/uint range is used, the lexer 
should validate all string literals.

>
>> Well, that's something that's definitely out of the scope of this
>> proposal. Definitely an interesting direction to pursue, though.
>
> Maybe the interface/code structure is or could be designed so that the
> implementation could later be version()'ed to SIMD where possible.

I guess that shouldn't be an issue. From the outside it's just a generic 
range that is passed in and internally it's always possible to add 
special cases for array inputs. If someone else wants to play around 
with this idea, we could of course also integrate it right away, it's 
just that I personally don't have the time to go to the extreme here.

>>> You cannot assume \u… to be valid if you convert it.
>>
>> I meant "X" to stand for a hex digit. The point was just that you
>> don't have to worry about interacting in a bad way with UTF sequences
>> when you find "\uXXXX".
>
> When you convert "\uXXXX" to UTF-8 bytes, is it then validated as a
> legal code point? I guess it is not necessary.

What is validated is that it forms valid UTF-16 surrogate pairs, and 
those are converted to a single dchar instead (if applicable). This is 
necessary, because otherwise the lexer would produce invalid UTF-8 for 
valid inputs. Apart from that, the value is used verbatim as a dchar.

>
> Btw, I believe rapidJSON achieves high speed by converting strings in
> situ, so that if the prefix is escape free it just converts in place
> when it hits the first escape. Thus avoiding some moving.

The same is true for this lexer, at least for array inputs. It actually 
currently just stores a slice of the string literal in all cases and 
lazily decodes on the first access. While doing that, it first skips any 
escape sequence free prefix and returns a slice if the whole string is 
escape sequence free.


More information about the Digitalmars-d mailing list