How to print unicode characters (no library)?

Adam D Ruppe destructionator at gmail.com
Mon Dec 27 14:23:37 UTC 2021


On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:
> Oh yeah. About that, I wasn't given a demonstration of how it 
> works so I forgot about it. I saw that in Unicode you can 
> combine some code points to get different results but I never 
> saw how that happens in practice.

The emoji is one example, the one you posted is two code points. 
Some other common ones are accented letters will SOMETIMES - 
there's exceptions - be created by the letter followed by an 
accent mark.

Some of those complicated emojis are several points with optional 
changes. Like it might be "woman" followed by "skin tone 2" 👩🏽. 
Some of them are "dancing" followed by "skin tone 0" followed by 
"male" and such.

So it displays as one thing, but it is composed by 2 or more code 
points, and each code point might be composed from several code 
units, depending on the encoding.

Again, think of it more as a little virtual machine building up a 
thing. A lot of these are actually based on combinations of old 
typewriters and old state machine terminal hardware.

Like the reason "a" followed by "backspace" followed by "_" - 
SOMETIMES, it depends on the receiving program, this isn't a 
unicode thing - might sometimes be an underlined a because of 
think about typing that on a typewriter with a piece of paper.

The "a" gets stamped on the paper. Backspace just moves back, but 
since the "a" is already on the paper, it isn't going to be 
erased. So when you type the _, it gets stamped on the paper 
along with the a. So some programs emulate that concept.

The emoji thing is the same basic idea (though it doesn't use 
backspace): start by drawing a woman, then modify it with a skin 
color. Or start by drawing a person, then draw another person, 
then add a skin color, then make them female, and you have a 
family emoji. Impossible to do that stamping paper, but a little 
computer VM can understand this and build up the glyph.

> Yes, that's a great way of seeing it. I suppose that this all 
> happens under the hood and it is OS specific so why have to 
> know how the OS we are working with works under the hood to 
> fully understand how this happens.
9
Well, it isn't necessarily OS, any program can do its own thing. 
Of course, the OS can define something: Windows, for example, 
defines its things are UTF-16, or you can use a translation layer 
which does its own things for a great many functions. But still 
applications might treat it differently.

For example, the xterm terminal emulator can be configured to use 
utf-8 or something else. It can be configured to interpret them 
in a way that emulated certain old terminals, including ones that 
work like a printer or the state machine things.

> However, do you know what we do from cross compatibility then? 
> Because this sounds like a HUGE mess real world applications

Yeah, it is a complete mess, especially on Linux. But even on 
Windows where Microsoft standardized on utf-16 for text 
functions, there's still weird exceptions. Like writing to the 
console vs piping to an application can be different. If you've 
ever written a single character to a windows pipe and seen 
different reults than if you wrote two, now you get an idea 
why.... it is trying to auto-detect if it is two-byte characters 
or one-byte streams.

I wrote a little bit about this on my public blog: 
http://dpldocs.info/this-week-in-d/Blog.Posted_2019_11_25.html

Or view the source of my terminal.d to see some of the "fun" in 
decoding all this nonsense.

http://arsd-official.dpldocs.info/arsd.terminal.html

The module there does a lot more than just the basics, but still 
most the top half of the file is all about this stuff. Mouse 
input might be encoded as utf characters, then you gotta change 
the mode and check various detection tricks. Ugh.

> I don't understand that. Based on your calculations, the 
> results should have been different. Also how are the numbers 
> fixed? Like you said the amount of bytes of each encoding is 
> not always standard for every character. Even if they were 
> fixed this means 2-bytes for each UTF-16 character and 4-bytes 
> for each UTF-32 character so still the numbers doesn't make 
> sense to me.

They're not characters, they're code points. Remember, multiple 
code points can be combined to form one character on screen.

Let's look at:

"Hello 😂\n";

This is actually a series of 8 code points:

H, e, l, l, o, <space>, <crying face>, <new line>

Those code points can themselves be encoded in three different 
ways:

dstring: encodes each code point as a single element. That's why 
dstring there length is 8. Each *element* of this though is 32 
bits which you see if you cast it to ubyte[], the length in bytes 
is 4x the length of the dstring, but dstring.length returns the 
number of units, not the number of bytes.

So here one unit = one point, but remember each *point* is NOT 
necessarily anything you see on screen. It represents just one 
complete instruction to the VM.

wstring: encodes each code point as one or two elements. If its 
value is in the lower half of the space (< 64k about), it gets 
one element. If it is in the upper half (> 64k) it gets two 
elements, one just saying "the next element should be combined 
with this one".

That's why its length is 9. It kinda looks like:

H, e, l, l, o, <space>, <next element is a point in the upper 
half of the space>, <crying face>, <new line>

That "next element" unit is an additional element that is 
processed to figure out which points we get (which, again, are 
then feed into the VM thingy to be executed to actually produce 
something on string).

So when you see that "next element is a point..." thing, it puts 
that in a buffer and pulls another element off the stream to 
produce the next VM instruction. After it comes in, that 
instruction gets executed and added to the next buffer.

Each element in this array is 16 bits, meaning if you cast it to 
ubyte[], you'll see the length double.

Finally, there's "string", which is utf-8, meaning each element 
is 8 bits, but again, there is a buffer you need to build up to 
get the code points you feed into that VM.

Like we saw with 16 bits, there's now additional elements that 
tell you when a thing goes over. Any value < 128 gets a single 
element, then the next set gets two elements you do some bit 
shifts and bitwise-or to recombine, then another set with three 
elements and even a set with four elements. The first element 
tells you how many more elements you need to build up the point 
buffer.

H, e, l, l, o, <space>, <next point is combined by these bits 
PLUS THREE MORE elements>, <this is a work-in-progress element 
and needs two more>, <this is a work-in-progress element and 
needs one more>, <this is the final work-in-progress element>, 
<new line>


And now you see why it came to length == 11 - that emoji needed 
enough bits to build up the code point that it had to be spread 
across 4 bytes.

Notice how each element here told you how many elements are left. 
This is encoded into the bit pattern and is part of why it took 4 
elements instead of just three; there's some error-checking 
redundancy in there. This is a nice part of the design allowing 
you to validate a utf-8 stream more reliably and even recover if 
you jumped somewhere in the middle of a multi-byte sequence.

But anyway, that's kinda an implementation detail - the big point 
here is just that each element of the string array has pieces it 
needs to recombine to make the unicode code points. Then, the 
unicode code points are instructions that are fed into a VM kind 
of thing to actually produce output, and this will sometimes vary 
depending on what the target program doing the interpreting is.

So the layers are:

1) bytes build up into string/wstring/dstring array elements (aka 
"code units")
2) those code unit element arrays are decoded into code point 
instructions
3) those code point instructions are run to produce output.

(or of course when you get to a human reader, they can interpret 
it differently too but obviously human language is a whole other 
mess lol)


More information about the Digitalmars-d-learn mailing list