How to print unicode characters (no library)?

Patrick Schluter Patrick.Schluter at bbox.fr
Tue Dec 28 12:20:17 UTC 2021


On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:
>
> I don't understand that. Based on your calculations, the 
> results should have been different. Also how are the numbers 
> fixed? Like you said the amount of bytes of each encoding is 
> not always standard for every character. Even if they were 
> fixed this means 2-bytes for each UTF-16 character and 4-bytes 
> for each UTF-32 character so still the numbers doesn't make 
> sense to me. So still the number of the "length" property 
> should have been the same for every encoding or at least for 
> UTF-16 and UTF-32. So are the sizes of every character fixed or 
> not?
>

Your string is represented by 8 codepoints. The number of 
codeunits to represent them in memory depends on the encoding. D 
supports to work with 3 different encodings (in the Unicode 
standard there are more than these 3)

     string  utf8s  = "Hello 😂\n";
     wstring utf16s = "Hello 😂\n"w;
     dstring utf32s = "Hello 😂\n"d;

Here the canonical Unicode representation of your string

        H      e      l      l      o             😂     \n
     U+0048 U+0065 U+006C U+006C U+006F U+0020 U+1F602 U+000a

let's see how these 3 variable are represented in memory:

     utf8s : 48 65 6C 6C 6F 20 F0 9F 98 82 0a
11 char in memory using 11 bytes

     utf16s: 0048 0065 006C 006C 006F 0020 D83D DE02 000A
9 wchar in memory using 18 bytes

     utf16s: 00000048 00000065 0000006C 0000006C 0000006F 00000020 
0001F602 0000000A
8 dchar in memory using 32 bytes

As you can see, the most compact form is generally UTF-8, that's 
why it is the preferred encoding for Unicode.

UTF-16 is supported because of legacy support reason like it is 
used in the Windows API and also internally in Java.

UTF-32 has one advantage, in that it has a 1 to 1 mapping between 
codepoint and array index. In practice it is not that much of an 
advantage as codepoints and characters are disjoint concepts. 
UTF-32 uses a lot of memory for practically no benefit (when you 
read in the forum about the big auto-decode error of D it is 
linked to this).


More information about the Digitalmars-d-learn mailing list