[Issue 1357] Cannot use FFFF and FFFE in Unicode escape sequences.

Mon Oct 1 02:53:43 PDT 2007

http://d.puremagic.com/issues/show_bug.cgi?id=1357

------- Comment #7 from aziz.kerim at gmail.com  2007-10-01 04:53 -------
(In reply to comment #6)
> When that talks of "decoders", is it talking about:
> (a) decoding Unicode text from files?
> (b) translating data used internally by an application?
I'm not sure but I guess it's (a).
> 
> If (a), then obviously it should reject U+FFFE and U+FFFF.  If (b), then it
> should allow them.  The std.utf.toUTF* functions accept these codepoints for
> this reason, and the behaviour of isValidDchar is by the same design:
You are right. The Phobos Unicode functions are designed not to fire an
exception when, for example, decoding a UTF-8 sequence resulting in the
codepoint U+FFFF or U+FFFE. But the problem is that the average guy/gal doesn't
have the slightest clue about the technicalities of Unicode, and so would
assume that it's perfectly fine to use those functions for normal, non-internal
purposes. So in effect programs would accept illegal input and also produce
output with illegal UTF-8 and UTF-16 sequences as well as UTF-32 strings.
> 
> /*******************************
>  * Test if c is a valid UTF-32 character.
>  *
>  * \uFFFE and \uFFFF are considered valid by this function,
>  * as they are permitted for internal use by an application,
>  * but they are not allowed for interchange by the Unicode standard.
>  *
>  * Returns: true if it is, false if not.
>  */
> 
> So it's not a bug in Phobos.  Just an omission - of a function to check that a
> Unicode string is valid for data interchange (i.e. contains no U+FFFE or U+FFFF
> codepoints as well as being otherwise valid).
Yes, but the encoding and decoding functions in Phobos use isValidDchar() to
verify if the character to be encoded or the character that was decoded is a
valid dchar. I'm not sure what the solution could be though. Two separate
modules maybe, one that is safe for data interchange and the other one for
internal data processing. Or perhaps add a function isEncodable() which is like
isValidDchar but excludes U+FFFE and U+FFFF. This new function should be used
to completely disallow U+FFFE and U+FFFF to be encoded as UTF-8 or UTF-16.
> 
> It certainly ought to be possible to include U+FFFE and U+FFFF in string
> literals by some means or another, as such things are necessarily being put
> there for internal use by the application being developed.
> 

Maybe we should not allow the programmer to use the escape sequences \uFFFE
\uFFFF, \U0000FFFF etc. Instead one could do the following as Thomas suggested:

char[] str = "\xFF\xFFasdf";
char[] str = x"FFFF""asdf"; // Adjacent strings are concatenated implicitly.

--