[Issue 1357] Cannot use FFFF and FFFE in Unicode escape sequences.

Tue Oct 2 05:47:52 PDT 2007

http://d.puremagic.com/issues/show_bug.cgi?id=1357

------- Comment #9 from aziz.kerim at gmail.com  2007-10-02 07:47 -------
Ok, the best thing we can do to solve this problem is actually read the Unicode
5.0 standard and determine what it actually has to say about this. I did read
the relevant parts of the standard and here is what I found out:

First of all, U+FFFE and U+FFFF are not the only code points that are intended
for internal use only.

Quoting from ch02.pdf page 27:

> Noncharacters. Sixty-six code points are not used to encode characters. Noncharacters
> consist of U+FDD0..U+FDEF and any code point ending in the value FFFE<sub>16</sub> or FFFF<sub>16</sub>—
> that is, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. (See
> Section 16.7, Noncharacters.)

A function testing for a noncharacter could look like this:
bool isNoncharacter(dchar d)
{
  return 0xFDD0 <= d && d <= 0xFDEF || // 32 code points
         d <= 0x10FFFF && (d & 0xFFFF) >= 0xFFFE; // 34 code points
}

Let us read a bit further. Quoting from ch02.pdf page 28:

> • Noncharacter code points are reserved for internal use, such as for sentinel val-
>   ues. They should never be interchanged. They do, however, have well-formed
>   representations in Unicode encoding forms and survive conversions between
>   encoding forms. This allows sentinel values to be preserved internally across
>   Unicode encoding forms, even though they are not designed to be used in open
>   interchange.

So it says that noncharacters can be encoded in UTF-8 and UTF-16. This is good
news, because this tells us that escape sequences not higher than U+10FFFF and
which are not surrogate code points (U+D800 - U+DFFF) can be encoded as UTF-8
or UTF-16. Therefore I think we should allow programmers to define such escape
sequences, even if they are noncharacters.

The next problem we need to think about is, what to do with noncharacters if
they appear as encoded characters in UTF-8 or UTF-16 source text or as code
points in UTF-32 source text.

The Unicode standard says in ch16.pdf at page 549:

> Applications are free to use any of these noncharacter code points internally but should
> never attempt to exchange them. If a noncharacter is received in open interchange, an
> application is not required to interpret it in any way. It is good practice, however, to recog-
> nize it as a noncharacter and to take appropriate action, such as removing it from the text.
> Note that Unicode conformance freely allows the removal of these characters. (See con-
> formance clause C7 in Section 3.2, Conformance Requirements.)

I guess Walter has to decide what a D lexer should do in case it encounters a
noncharacter in the source text. My suggestion would be to ignore noncharacters
in favour of a faster lexer (although probably not many people are going to
stuff their source text with unialpha identifiers and comments/strings with
Unicode characters.)

--