Can std.json handle Unicode?

Jacob Carlborg doob at me.com
Sun Mar 24 02:55:40 PDT 2013


On 2013-03-23 21:08, Jonathan M Davis wrote:

> Curious. According to this page ( http://www.aivosto.com/vbtips/control-characters.html ) both space and delete are ASCII control characters (though
> neither std.ascii nor C's iscntrl deem space to be a control character), but
> neither of them are control characters according to recent Unicode standards.
> This section on DEL
>
> http://www.aivosto.com/vbtips/control-characters.html#DEL
>
> seems to say that DEL should basically be ignored. It seems to think that NUL
> should be treated the same way (and basically complains that languages like C
> ever treated it as a terminator).
>
> If I look at the RFC for json ( http://www.rfc-editor.org/rfc/rfc4627.txt ),
> it specifically lists control characters as being U+0000 through U+001F, which
> does _not_ include DEL or _any_ Unicode-specific control character. So, using
> either std.ascii or std.uni's isControl would be wrong. It specifically needs
> to check whether a character is < 32 when checking for control characters.
>
> And the grammar rule for string is
>
>           string = quotation-mark *char quotation-mark
>
>           char = unescaped /
>                  escape (
>                      %x22 /          ; "    quotation mark  U+0022
>                      %x5C /          ; \    reverse solidus U+005C
>                      %x2F /          ; /    solidus         U+002F
>                      %x62 /          ; b    backspace       U+0008
>                      %x66 /          ; f    form feed       U+000C
>                      %x6E /          ; n    line feed       U+000A
>                      %x72 /          ; r    carriage return U+000D
>                      %x74 /          ; t    tab             U+0009
>                      %x75 4HEXDIG )  ; uXXXX                U+XXXX
>
>           escape = %x5C              ; \
>
>           quotation-mark = %x22      ; "
>
>           unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
>
> So, it looks like the only characters that should be considered valid inside
> the double-quotes of a string which aren't escaped are / (which indicates the
> beginning of an escape sequence), and the characters listed in unescaped. So,
> in decimal, that would be 32 and 33, 35 - 91, and everything 93 and greater
> (up to 10FFFF). DEL is 127, so it should be considered valid.
>
> So, if std.json is using isControl, my guess is that whoever wrote that was
> not careful enough with the grammar (though it's easy enough to assume that
> everyone means the same thing by control characters), and I'd be concerned
> that std.json is not handling this set of grammar rules correctly with more
> characters than just DEL.

I see. Yes, one could think that "control character" would mean the same 
thing in every situation for a given encoding.

-- 
/Jacob Carlborg


More information about the Digitalmars-d-learn mailing list