Code points & code units [was Re: customized "new" and pointer alignment]
Chris Paulson-Ellis
chris at edesix.com
Tue Jan 30 14:47:17 PST 2007
Jarrett Billingsley wrote:
> The size of a char variable is always 8 bits, because it's a UTF-8
> something-or-other. It's not a codepoint, it's a ...? But it's always 8
> bits.
The Unicode term is "code unit".
For the benefit of the Unicode uninitiated, the D spec could be clearer
on this point. Despite its name, a char variable does not hold a
character, but rather a single unit of the UTF-8 character encoding.
For example, the UTF-8 code unit sequence 0xE2 0x82 0xAC decodes into
U+20AC, the Unicode code point for the Euro currency symbol character, €.
Similarly, the wchar type is defined to be a UTF-16 code unit, which is
usually the same as the corresponding code point, but not for code
points > U+FFFF, which are encoded using 2 code units (called a
surrogate pair).
The dchar type is a UTF-32 code unit. These are the same as the code
points, except for values > U+10FFFF which are beyond the range of
Unicode. You are free to use out of range values to mean something
within your application, but they will never represent Unicode characters.
Another complication arises from the fact that the UTF encodings can
encode "non-character" code points (anything ending in FFFE or FFFF,
such as U+FFFE or U+3FFFF). Similarly, the "surrogates" (the code points
with the same values as the code units used by UTF-16 to encode code
points > U+FFFF) are not characters even though they can be represented
in UTF-8 or UTF-32. So even a char or wchar sequence that decodes okay
or a single dchar may not be a "character". Again, you can use these
code points within your application, but in the words of the code page
for U+FFF[EF], they are "not valid for interchange".
Nothing is ever crystal clear in Unicode land.
Chris.
More information about the Digitalmars-d
mailing list