[Issue 1357] Cannot use FFFF and FFFE in Unicode escape sequences.

Mon Oct 1 04:24:58 PDT 2007

http://d.puremagic.com/issues/show_bug.cgi?id=1357

------- Comment #8 from smjg at iname.com  2007-10-01 06:24 -------
(In reply to comment #7)
> You are right.  The Phobos Unicode functions are designed not to 
> fire an exception when, for example, decoding a UTF-8 sequence 
> resulting in the codepoint U+FFFF or U+FFFE.  But the problem is 
> that the average guy/gal doesn't have the slightest clue about the 
> technicalities of Unicode, and so would assume that it's perfectly 
> fine to use those functions for normal, non-internal purposes.  So 
> in effect programs would accept illegal input and also produce 
> output with illegal UTF-8 and UTF-16 sequences as well as UTF-32 
> strings.

I think the best place to deal with this is in documentation.  The 
need to check for U+FFFF and U+FFFE exists only when processing 
input.  It would be inefficient to keep checking for these codepoints 
every time an internal conversion is performed.  It is therefore 
sensible to keep validation separate from encoding/decoding, and 
inform the library user that such validation is necessary.

> Yes, but the encoding and decoding functions in Phobos use 
> isValidDchar() to verify if the character to be encoded or the 
> character that was decoded is a valid dchar.  I'm not sure what the 
> solution could be though.  Two separate modules maybe, one that is 
> safe for data interchange and the other one for internal data 
> processing.

Or an 'internal' parameter on the translation functions.  This raises 
the question: Should this parameter be optional, and if so, what 
should the default be?

> Or perhaps add a function isEncodable() which is like isValidDchar 
> but excludes U+FFFE and U+FFFF.  This new function should be used 
> to completely disallow U+FFFE and U+FFFF to be encoded as UTF-8 or 
> UTF-16.

Uh, if we're going to use those names, ISTM the definitions should be 
the other way round.  But maybe an 'internal' parameter is the best 
solution here as well.

>> It certainly ought to be possible to include U+FFFE and U+FFFF in 
>> string literals by some means or another, as such things are 
>> necessarily being put there for internal use by the application 
>> being developed.
> 
> Maybe we should not allow the programmer to use the escape 
> sequences \uFFFE \uFFFF, \U0000FFFF etc.  Instead one could do the 
> following as Thomas suggested:
> 
> char[] str = "\xFF\xFFasdf";
> char[] str = x"FFFF""asdf"; // Adjacent strings are concatenated 
> implicitly.

But U+FFFF isn't "\xFF\xFF".  It's "\xEF\xBF\xBF".

I guess we should have whole new escapes specifically for these codepoints.

--