Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

Ali Çehreli acehreli at yahoo.com
Fri Dec 2 22:49:49 UTC 2022


On 12/2/22 13:18, thebluepandabear wrote:

 > But I don't really understand this? What does it mean that it 'must be
 > represented by at least 2 bytes'?

The integral value of Ğ in unicode is 286.

   https://unicodeplus.com/U+011E

Since 'char' is 8 bits, it cannot store 286.

At first, that sounds like a hopeless situation, making one think that Ğ 
cannot be represented in a string. The concept of encoding to the 
rescue: Ğ can be encoded by 2 chars:

import std.stdio;

void main() {
     foreach (c; "Ğ") {
         writefln!"%b"(c);
     }
}

That program prints

11000100
10011110

Articles like the following explain well how that second byte is a 
continuation byte:

   https://en.wikipedia.org/wiki/UTF-8#Encoding

(It's a continuation byte because it starts with the bits 10).

 > I don't think it was explained well in
 > the book.

Coincidentally, according to another recent feedback I received, unicode 
and UTF are introduced way too early for such a book. I agree. I hadn't 
understood a single thing when the first time smart people were trying 
to explain unicode and UTF encodings to the company where I worked at. I 
had years of programming experience back then. (Although, I now think 
the instructors were not really good; and the company was pretty bad as 
well. :) )

 > Any help would be appreciated.

I recommend the Wikipedia page I linked above. It is enlightening to 
understand how about 150K unicode characters can be encoded with units 
of 8 bits.

You can safely ignore wchar, dchar, wstring, and dstring for daily 
coding. Only special programs may need to deal with those types. 'char' 
and string are what we need and do use predominantly in D.

Ali



More information about the Digitalmars-d-learn mailing list