Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
H. S. Teoh
hsteoh at qfbox.info
Fri Dec 2 22:28:19 UTC 2022
On Fri, Dec 02, 2022 at 09:18:44PM +0000, thebluepandabear via Digitalmars-d-learn wrote:
> Hello (noob question),
>
> I am reading a book about D by Ali, and he talks about the different
> char types: char, wchar, and dchar. He says that char stores a UTF-8
> code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32
> code unit, this makes sense.
>
> He then goes on to say that:
>
> "Contrary to some other programming languages, characters in D may
> consist of different numbers of bytes. For example, because 'Ğ' must
> be represented by at least 2 bytes in Unicode, it doesn't fit in a
> variable of type char. On the other hand, because dchar consists of 4
> bytes, it can hold any Unicode character."
>
> It's his explanation as to why this code doesn't compile even though Ğ
> is a UTF-8 code unit:
>
> ```D
> char utf8 = 'Ğ';
> ```
>
> But I don't really understand this? What does it mean that it 'must be
> represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes
> so I am confused why it doesn't fit, I don't think it was explained
> well in the book.
That's wrong, char.sizeof should be exactly 1 byte, no more, no less.
First, before we talk about Unicode, we need to get the terminology
straight:
Code unit = unit of storage in a particular representation (encoding) of
Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units,
a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT
confuse this with "code point", or worse, "character".
Code point = the abstract Unicode entity that occupies a single slot in
the Unicode tables. Usually written as U+xxx where xxx is some
hexadecimal number.
IMPORTANT NOTE: do NOT confuse a code point with what a normal
human being thinks of as a "character". Even though in many
cases a code point happens to represent a single "character",
this isn't always true. It's safer to understand a code point
as a single slot in one of the Unicode tables.
NOTE: a code point may be represented by multiple code units,
depending on the encoding. For example, in UTF-8, some code
points require multiple code units (multiple bytes) to
represent. This varies depending on the character; the code
point `A` needs only a single code unit, but the code point `Ш`
needs 3 bytes, and the code point `😀` requires 4 bytes. In
UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in
UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units
(4 bytes).
Note that neither code unit nor code point correspond directly with what
we normally think of as a "character". The Unicode terminology for that
is:
Grapheme = one or more code points that combine together to produce a
single visual representation. For example, the 2-code-point sequence
U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point
sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`. Note that each
code point in these sequences may require multiple code units, depending
on which encoding you're using. This email is encoded in UTF-8, so the
first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes
for the second), and the second sequence occupies 6 bytes (2 bytes per
code point).
//
OK, now let's talk about D. In D, we have 3 "character" types (I'm
putting "character" in quotes because they are actually code units, do
NOT confuse them with visual characters): char, wchar, dchar, which are
1, 2, and 4 bytes, respectively.
To find out whether something fits into a char, first you have to find
out how many code points it occupies, and second, how many code units
are required to represent those code points. For example, the character
`À` can be represented by the single code point U+00C0. However, it
requires *two* UTF-8 code units to represent (this is a consequence of
how UTF-8 represents code points), in spite of being a value that's less
than 256. So U+00C0 would not fit into a single char; you need (at
least) 2 chars to hold it.
If we were to use UTF-16 instead, U+00C0 would easily fit into a single
code unit. Each code unit in UTF-16, however, is 2 bytes, so for some
code points (such as 'a', U+0061), the UTF-8 encoding would be smaller.
A dchar always fits any Unicode code point, because code points can only
go up to 0x10FFFF (max 3 bytes). HOWEVER, using dchar does NOT
guarantee that it will hold a complete visual character, because Unicode
graphemes can be arbitrarily long. For example, the `π̯̆` grapheme above
requires at least 3 code points to represent, which means it requires at
least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however,
it occupies only 6 bytes (still the same 3 code points, just encoded
differently).
//
I hope this is clear (as mud :P -- Unicode is a complex beast). Or at
least clear*er*, anyway.
T
--
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
More information about the Digitalmars-d-learn
mailing list