Code points & code units [was Re: customized "new" and pointer alignment]

Chris Paulson-Ellis chris at edesix.com
Tue Jan 30 14:47:17 PST 2007


Jarrett Billingsley wrote:
> The size of a char variable is always 8 bits, because it's a UTF-8 
> something-or-other.  It's not a codepoint, it's a ...?  But it's always 8 
> bits.

The Unicode term is "code unit".

For the benefit of the Unicode uninitiated, the D spec could be clearer 
on this point. Despite its name, a char variable does not hold a 
character, but rather a single unit of the UTF-8 character encoding.

For example, the UTF-8 code unit sequence 0xE2 0x82 0xAC decodes into 
U+20AC, the Unicode code point for the Euro currency symbol character, €.

Similarly, the wchar type is defined to be a UTF-16 code unit, which is 
usually the same as the corresponding code point, but not for code 
points > U+FFFF, which are encoded using 2 code units (called a 
surrogate pair).

The dchar type is a UTF-32 code unit. These are the same as the code 
points, except for values > U+10FFFF which are beyond the range of 
Unicode. You are free to use out of range values to mean something 
within your application, but they will never represent Unicode characters.

Another complication arises from the fact that the UTF encodings can 
encode "non-character" code points (anything ending in FFFE or FFFF, 
such as U+FFFE or U+3FFFF). Similarly, the "surrogates" (the code points 
with the same values as the code units used by UTF-16 to encode code 
points > U+FFFF) are not characters even though they can be represented 
in UTF-8 or UTF-32. So even a char or wchar sequence that decodes okay 
or a single dchar may not be a "character". Again, you can use these 
code points within your application, but in the words of the code page 
for U+FFF[EF], they are "not valid for interchange".

Nothing is ever crystal clear in Unicode land.

Chris.



More information about the Digitalmars-d mailing list