How to print unicode characters (no library)?

rempas rempas at tutanota.com
Sun Dec 26 20:50:39 UTC 2021


Hi! I'm trying to print some Unicode characters using UTF-8 
(char), UTF-16 (wchar) and UTF-32 (dchar). I want to do this 
without using any library by using the "write" system call 
directly with 64-bit Linux. Only the UTF-8 solution seems to be 
working as expected. The other solutions will not print the 
unicode characters (I'm using an emoji in my case for example). 
Another thing I noticed is the size of the strings. From what I 
know (and tell me if I'm mistaken), UTF-16 and UTF-32 have fixed 
size lengths for their characters. UTF-16 uses 2 bytes (16 bits) 
and UTF-32 uses 4 bytes (32 bits) without treating any character 
specially. This doesn't seem to be the case for me however. 
Consider my code:

```
import core.stdc.stdio;

void exit(ulong code) {
   asm {
     "syscall"
     : : "a" (60), "D" (code);
   }
}

void write(T)(int fd, const T buf, ulong len) {
   asm {
     "syscall"
     : : "a" (1), "D" (1), "S" (buf), "d" (len)
     : "memory", "rcx";
   }
}

extern (C) void main() {
   string  utf8s  = "Hello 😂\n";
   write(1, utf8s.ptr, utf8s.length);

   wstring utf16s = "Hello 😂\n"w;
   write(1, utf16s.ptr, utf16s.length * 2);

   dstring utf32s = "Hello 😂\n"d;
   write(1, utf32s.ptr, utf32s.length * 4);

   printf("\nutf8s.length = %lu\nutf16s.length = 
%lu\nutf32s.length = %lu\n",
       utf8s.length, utf16s.length, utf32s.length);

   exit(0);
}
```

And its output:

```
Hello 😂
Hello =��
Hello �

utf8s.length = 11
utf16s.length = 9
utf32s.length = 8
```

Now the UTF-8 string will report 11 characters and print them 
normally. So it treats every character that is 127 or less as if 
it was an ascii character and uses 1-byte for it. Characters 
above that range, are either a 2-byte or 4-byte unicode 
characters. So it works as I expected based on what I've read/saw 
for UTF-8 (now I understand why everyone loves it, lol :P)!

Now what about the other two? I was expecting UTF-16 to report 16 
characters and UTF-32 to report 32 characters. Also why the 
characters are not shown as expected? Isn't the "write" system 
call just writing a sequence of characters without caring which 
they are? So if I just give it the right length, shouldn't it 
just work? I'm pretty much sure that this is not as I expect it 
and it doesn't work like that. Anyone has an idea?


More information about the Digitalmars-d-learn mailing list