How to print unicode characters (no library)?

rempas rempas at tutanota.com
Mon Dec 27 07:12:24 UTC 2021


On Sunday, 26 December 2021 at 21:22:42 UTC, Adam Ruppe wrote:
> write just transfers a sequence of bytes. It doesn't know nor 
> care what they represent - that's for the receiving end to 
> figure out.
>
Oh, so it was as I expected :P

> You are mistaken. There's several exceptions, utf-16 can come 
> in pairs, and even utf-32 has multiple "characters" that 
> combine onto one thing on screen.
>
Oh yeah. About that, I wasn't given a demonstration of how it 
works so I forgot about it. I saw that in Unicode you can combine 
some code points to get different results but I never saw how 
that happens in practice. If you combine two code points, you get 
another different graph. So yeah that one thing I don't 
understand...

> I prefer to think of a string as a little virtual machine that 
> can be run to produce output rather than actually being 
> "characters". Even with plain ascii, consider the backspace 
> "character" - it is more an instruction to go back than it is a 
> thing that is displayed on its own.
>
Yes, that's a great way of seeing it. I suppose that this all 
happens under the hood and it is OS specific so why have to know 
how the OS we are working with works under the hood to fully 
understand how this happens. Also the idea of some "characters" 
been "instructions" is very interesting. Now from what I've seen, 
non-printable characters are always instructions (except for the 
"space" character) so another way to think about this is by 
thinking that every character can have one instruction and this 
is either to get written (displayed) in the file or to do another 
modification in the text but without getting displayed itself as 
a character. Of course, I don't suppose that's what happening 
under the hood but it's an interesting way of describe it.

> This is because the *receiving program* treats them as utf-8 
> and runs it accordingly. Not all terminals will necessarily do 
> this, and programs you pipe to can do it very differently.
>
That's pretty interesting actually. Terminals (and don't forget 
shells) are programs themselves so they choose the encoding 
themselves. However, do you know what we do from cross 
compatibility then? Because this sounds like a HUGE mess real 
world applications

> The [w|d|]string.length function returns the number of elements 
> in there, which is bytes for string, 16 bit elements for 
> wstring (so bytes / 2), or 32 bit elements for dstring (so 
> bytes / 4).
>
> This is not necessarily related to the number of characters 
> displayed.
>
I don't understand that. Based on your calculations, the results 
should have been different. Also how are the numbers fixed? Like 
you said the amount of bytes of each encoding is not always 
standard for every character. Even if they were fixed this means 
2-bytes for each UTF-16 character and 4-bytes for each UTF-32 
character so still the numbers doesn't make sense to me. So still 
the number of the "length" property should have been the same for 
every encoding or at least for UTF-16 and UTF-32. So are the 
sizes of every character fixed or not?

Damn you guys should got paid for the help you are giving in this 
forum


More information about the Digitalmars-d-learn mailing list