Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
Ali Çehreli
acehreli at yahoo.com
Fri Dec 2 22:49:49 UTC 2022
On 12/2/22 13:18, thebluepandabear wrote:
> But I don't really understand this? What does it mean that it 'must be
> represented by at least 2 bytes'?
The integral value of Ğ in unicode is 286.
https://unicodeplus.com/U+011E
Since 'char' is 8 bits, it cannot store 286.
At first, that sounds like a hopeless situation, making one think that Ğ
cannot be represented in a string. The concept of encoding to the
rescue: Ğ can be encoded by 2 chars:
import std.stdio;
void main() {
foreach (c; "Ğ") {
writefln!"%b"(c);
}
}
That program prints
11000100
10011110
Articles like the following explain well how that second byte is a
continuation byte:
https://en.wikipedia.org/wiki/UTF-8#Encoding
(It's a continuation byte because it starts with the bits 10).
> I don't think it was explained well in
> the book.
Coincidentally, according to another recent feedback I received, unicode
and UTF are introduced way too early for such a book. I agree. I hadn't
understood a single thing when the first time smart people were trying
to explain unicode and UTF encodings to the company where I worked at. I
had years of programming experience back then. (Although, I now think
the instructors were not really good; and the company was pretty bad as
well. :) )
> Any help would be appreciated.
I recommend the Wikipedia page I linked above. It is enlightening to
understand how about 150K unicode characters can be encoded with units
of 8 bits.
You can safely ignore wchar, dchar, wstring, and dstring for daily
coding. Only special programs may need to deal with those types. 'char'
and string are what we need and do use predominantly in D.
Ali
More information about the Digitalmars-d-learn
mailing list