How to print unicode characters (no library)?
rempas
rempas at tutanota.com
Sun Dec 26 20:50:39 UTC 2021
Hi! I'm trying to print some Unicode characters using UTF-8
(char), UTF-16 (wchar) and UTF-32 (dchar). I want to do this
without using any library by using the "write" system call
directly with 64-bit Linux. Only the UTF-8 solution seems to be
working as expected. The other solutions will not print the
unicode characters (I'm using an emoji in my case for example).
Another thing I noticed is the size of the strings. From what I
know (and tell me if I'm mistaken), UTF-16 and UTF-32 have fixed
size lengths for their characters. UTF-16 uses 2 bytes (16 bits)
and UTF-32 uses 4 bytes (32 bits) without treating any character
specially. This doesn't seem to be the case for me however.
Consider my code:
```
import core.stdc.stdio;
void exit(ulong code) {
asm {
"syscall"
: : "a" (60), "D" (code);
}
}
void write(T)(int fd, const T buf, ulong len) {
asm {
"syscall"
: : "a" (1), "D" (1), "S" (buf), "d" (len)
: "memory", "rcx";
}
}
extern (C) void main() {
string utf8s = "Hello 😂\n";
write(1, utf8s.ptr, utf8s.length);
wstring utf16s = "Hello 😂\n"w;
write(1, utf16s.ptr, utf16s.length * 2);
dstring utf32s = "Hello 😂\n"d;
write(1, utf32s.ptr, utf32s.length * 4);
printf("\nutf8s.length = %lu\nutf16s.length =
%lu\nutf32s.length = %lu\n",
utf8s.length, utf16s.length, utf32s.length);
exit(0);
}
```
And its output:
```
Hello 😂
Hello =��
Hello �
utf8s.length = 11
utf16s.length = 9
utf32s.length = 8
```
Now the UTF-8 string will report 11 characters and print them
normally. So it treats every character that is 127 or less as if
it was an ascii character and uses 1-byte for it. Characters
above that range, are either a 2-byte or 4-byte unicode
characters. So it works as I expected based on what I've read/saw
for UTF-8 (now I understand why everyone loves it, lol :P)!
Now what about the other two? I was expecting UTF-16 to report 16
characters and UTF-32 to report 32 characters. Also why the
characters are not shown as expected? Isn't the "write" system
call just writing a sequence of characters without caring which
they are? So if I just give it the right length, shouldn't it
just work? I'm pretty much sure that this is not as I expect it
and it doesn't work like that. Anyone has an idea?
More information about the Digitalmars-d-learn
mailing list