Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

Fri Dec 2 22:28:19 UTC 2022

On Fri, Dec 02, 2022 at 09:18:44PM +0000, thebluepandabear via Digitalmars-d-learn wrote:
> Hello (noob question),
> 
> I am reading a book about D by Ali, and he talks about the different
> char types: char, wchar, and dchar. He says that char stores a UTF-8
> code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32
> code unit, this makes sense.
> 
> He then goes on to say that:
> 
> "Contrary to some other programming languages, characters in D may
> consist of different numbers of bytes. For example, because 'Ğ' must
> be represented by at least 2 bytes in Unicode, it doesn't fit in a
> variable of type char. On the other hand, because dchar consists of 4
> bytes, it can hold any Unicode character."
> 
> It's his explanation as to why this code doesn't compile even though Ğ
> is a UTF-8 code unit:
> 
> ```D
> char utf8 = 'Ğ';
> ```
> 
> But I don't really understand this? What does it mean that it 'must be
> represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes
> so I am confused why it doesn't fit, I don't think it was explained
> well in the book.

That's wrong, char.sizeof should be exactly 1 byte, no more, no less.

First, before we talk about Unicode, we need to get the terminology
straight:

Code unit = unit of storage in a particular representation (encoding) of
Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units,
a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT
confuse this with "code point", or worse, "character".

Code point = the abstract Unicode entity that occupies a single slot in
the Unicode tables.  Usually written as U+xxx where xxx is some
hexadecimal number.

	IMPORTANT NOTE: do NOT confuse a code point with what a normal
	human being thinks of as a "character".  Even though in many
	cases a code point happens to represent a single "character",
	this isn't always true.  It's safer to understand a code point
	as a single slot in one of the Unicode tables.

	NOTE: a code point may be represented by multiple code units,
	depending on the encoding. For example, in UTF-8, some code
	points require multiple code units (multiple bytes) to
	represent. This varies depending on the character; the code
	point `A` needs only a single code unit, but the code point `Ш`
	needs 3 bytes, and the code point `😀` requires 4 bytes. In
	UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in
	UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units
	(4 bytes).

Note that neither code unit nor code point correspond directly with what
we normally think of as a "character".  The Unicode terminology for that
is:

Grapheme = one or more code points that combine together to produce a
single visual representation.  For example, the 2-code-point sequence
U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point
sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`.  Note that each
code point in these sequences may require multiple code units, depending
on which encoding you're using.  This email is encoded in UTF-8, so the
first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes
for the second), and the second sequence occupies 6 bytes (2 bytes per
code point).

//

OK, now let's talk about D.  In D, we have 3 "character" types (I'm
putting "character" in quotes because they are actually code units, do
NOT confuse them with visual characters): char, wchar, dchar, which are
1, 2, and 4 bytes, respectively.

To find out whether something fits into a char, first you have to find
out how many code points it occupies, and second, how many code units
are required to represent those code points.  For example, the character
`À` can be represented by the single code point U+00C0. However, it
requires *two* UTF-8 code units to represent (this is a consequence of
how UTF-8 represents code points), in spite of being a value that's less
than 256.  So U+00C0 would not fit into a single char; you need (at
least) 2 chars to hold it.

If we were to use UTF-16 instead, U+00C0 would easily fit into a single
code unit.  Each code unit in UTF-16, however, is 2 bytes, so for some
code points (such as 'a', U+0061), the UTF-8 encoding would be smaller.

A dchar always fits any Unicode code point, because code points can only
go up to 0x10FFFF (max 3 bytes).  HOWEVER, using dchar does NOT
guarantee that it will hold a complete visual character, because Unicode
graphemes can be arbitrarily long.  For example, the `π̯̆` grapheme above
requires at least 3 code points to represent, which means it requires at
least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however,
it occupies only 6 bytes (still the same 3 code points, just encoded
differently).

//

I hope this is clear (as mud :P -- Unicode is a complex beast). Or at
least clear*er*, anyway.

T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG