Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

Steven Schveighoffer schveiguy at gmail.com
Fri Dec 2 22:33:39 UTC 2022


On 12/2/22 4:18 PM, thebluepandabear wrote:
> Hello (noob question),
> 
> I am reading a book about D by Ali, and he talks about the different 
> char types: char, wchar, and dchar. He says that char stores a UTF-8 
> code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 
> code unit, this makes sense.
> 
> He then goes on to say that:
> 
> "Contrary to some other programming languages, characters in D may 
> consist of
> different numbers of bytes. For example, because 'Ğ' must be represented 
> by at
> least 2 bytes in Unicode, it doesn't fit in a variable of type char. On 
> the other
> hand, because dchar consists of 4 bytes, it can hold any Unicode 
> character."
> 
> It's his explanation as to why this code doesn't compile even though Ğ 
> is a UTF-8 code unit:
> 
> ```D
> char utf8 = 'Ğ';
> ```
> 
> But I don't really understand this? What does it mean that it 'must be 
> represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so 
> I am confused why it doesn't fit, I don't think it was explained well in 
> the book.
> 
> Any help would be appreciated.
> 


a *code point* is a value out of the unicode standard. [Code 
points](https://en.wikipedia.org/wiki/Code_point) represent glyphs, 
combining marks, or other things (not sure of the full list) that reside 
in the standard. When you want to figure out, "hmm... what value does 
the emoji 👍 have?" It's a *code point*. This is a number from 0 to 
0x10FFFF for Unicode. (BTW, it's 0x14ffd)

UTF-X are various *encodings* of unicode. UTF8 is an encoding of unicode 
where 1 to 4 bytes (called *code units*) encode a single unicode *code 
point*.

There are various encodings, and all can be decoded to the same list of 
*code points*. The most direct form is UTF-32, where each *code point* 
is also a *code unit*.

`char` is a UTF-8 code unit. `wchar` is a UTF-16 code unit, and `dchar` 
is a UTF-32 code unit.

The reason why you can't encode a Ğ into a single `char` is because it's 
code point is 0x11e, which does not fit into a single `char`. Therefore, 
an encoding scheme is used to put it into 2 `char`.

Hope this helps.

-Steve


More information about the Digitalmars-d-learn mailing list