Can std.json handle Unicode?

Jonathan M Davis jmdavisProg at gmx.com
Sat Mar 23 13:08:09 PDT 2013


On Saturday, March 23, 2013 13:22:42 Jacob Carlborg wrote:
> I'm wondering because I see that std.json uses isControl, isDigit and
> isHexDigit from std.ascii and not std.uni. This also causes a problem
> with a pull request I recently made for std.net.isemail. In one of its
> unit tests the DEL character (127) is used. According to
> std.ascii.isControl this is a control character, but not according to
> std.uni.isControl. This will cause the test suite for the pull request
> not to be run since std.json chokes on the DEL character.
> 
> https://github.com/D-Programming-Language/phobos/pull/1217

Curious. According to this page ( http://www.aivosto.com/vbtips/control-characters.html ) both space and delete are ASCII control characters (though 
neither std.ascii nor C's iscntrl deem space to be a control character), but 
neither of them are control characters according to recent Unicode standards. 
This section on DEL

http://www.aivosto.com/vbtips/control-characters.html#DEL

seems to say that DEL should basically be ignored. It seems to think that NUL 
should be treated the same way (and basically complains that languages like C 
ever treated it as a terminator).

If I look at the RFC for json ( http://www.rfc-editor.org/rfc/rfc4627.txt ), 
it specifically lists control characters as being U+0000 through U+001F, which 
does _not_ include DEL or _any_ Unicode-specific control character. So, using 
either std.ascii or std.uni's isControl would be wrong. It specifically needs 
to check whether a character is < 32 when checking for control characters.

And the grammar rule for string is

         string = quotation-mark *char quotation-mark

         char = unescaped /
                escape (
                    %x22 /          ; "    quotation mark  U+0022
                    %x5C /          ; \    reverse solidus U+005C
                    %x2F /          ; /    solidus         U+002F
                    %x62 /          ; b    backspace       U+0008
                    %x66 /          ; f    form feed       U+000C
                    %x6E /          ; n    line feed       U+000A
                    %x72 /          ; r    carriage return U+000D
                    %x74 /          ; t    tab             U+0009
                    %x75 4HEXDIG )  ; uXXXX                U+XXXX

         escape = %x5C              ; \

         quotation-mark = %x22      ; "

         unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

So, it looks like the only characters that should be considered valid inside 
the double-quotes of a string which aren't escaped are / (which indicates the 
beginning of an escape sequence), and the characters listed in unescaped. So, 
in decimal, that would be 32 and 33, 35 - 91, and everything 93 and greater 
(up to 10FFFF). DEL is 127, so it should be considered valid.

So, if std.json is using isControl, my guess is that whoever wrote that was 
not careful enough with the grammar (though it's easy enough to assume that 
everyone means the same thing by control characters), and I'd be concerned 
that std.json is not handling this set of grammar rules correctly with more 
characters than just DEL.

- Jonathan M Davis


More information about the Digitalmars-d-learn mailing list