Can std.json handle Unicode?
Jacob Carlborg
doob at me.com
Sun Mar 24 02:55:40 PDT 2013
On 2013-03-23 21:08, Jonathan M Davis wrote:
> Curious. According to this page ( http://www.aivosto.com/vbtips/control-characters.html ) both space and delete are ASCII control characters (though
> neither std.ascii nor C's iscntrl deem space to be a control character), but
> neither of them are control characters according to recent Unicode standards.
> This section on DEL
>
> http://www.aivosto.com/vbtips/control-characters.html#DEL
>
> seems to say that DEL should basically be ignored. It seems to think that NUL
> should be treated the same way (and basically complains that languages like C
> ever treated it as a terminator).
>
> If I look at the RFC for json ( http://www.rfc-editor.org/rfc/rfc4627.txt ),
> it specifically lists control characters as being U+0000 through U+001F, which
> does _not_ include DEL or _any_ Unicode-specific control character. So, using
> either std.ascii or std.uni's isControl would be wrong. It specifically needs
> to check whether a character is < 32 when checking for control characters.
>
> And the grammar rule for string is
>
> string = quotation-mark *char quotation-mark
>
> char = unescaped /
> escape (
> %x22 / ; " quotation mark U+0022
> %x5C / ; \ reverse solidus U+005C
> %x2F / ; / solidus U+002F
> %x62 / ; b backspace U+0008
> %x66 / ; f form feed U+000C
> %x6E / ; n line feed U+000A
> %x72 / ; r carriage return U+000D
> %x74 / ; t tab U+0009
> %x75 4HEXDIG ) ; uXXXX U+XXXX
>
> escape = %x5C ; \
>
> quotation-mark = %x22 ; "
>
> unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
>
> So, it looks like the only characters that should be considered valid inside
> the double-quotes of a string which aren't escaped are / (which indicates the
> beginning of an escape sequence), and the characters listed in unescaped. So,
> in decimal, that would be 32 and 33, 35 - 91, and everything 93 and greater
> (up to 10FFFF). DEL is 127, so it should be considered valid.
>
> So, if std.json is using isControl, my guess is that whoever wrote that was
> not careful enough with the grammar (though it's easy enough to assume that
> everyone means the same thing by control characters), and I'd be concerned
> that std.json is not handling this set of grammar rules correctly with more
> characters than just DEL.
I see. Yes, one could think that "control character" would mean the same
thing in every situation for a given encoding.
--
/Jacob Carlborg
More information about the Digitalmars-d-learn
mailing list