How to print unicode characters (no library)?
rempas
rempas at tutanota.com
Mon Dec 27 07:12:24 UTC 2021
On Sunday, 26 December 2021 at 21:22:42 UTC, Adam Ruppe wrote:
> write just transfers a sequence of bytes. It doesn't know nor
> care what they represent - that's for the receiving end to
> figure out.
>
Oh, so it was as I expected :P
> You are mistaken. There's several exceptions, utf-16 can come
> in pairs, and even utf-32 has multiple "characters" that
> combine onto one thing on screen.
>
Oh yeah. About that, I wasn't given a demonstration of how it
works so I forgot about it. I saw that in Unicode you can combine
some code points to get different results but I never saw how
that happens in practice. If you combine two code points, you get
another different graph. So yeah that one thing I don't
understand...
> I prefer to think of a string as a little virtual machine that
> can be run to produce output rather than actually being
> "characters". Even with plain ascii, consider the backspace
> "character" - it is more an instruction to go back than it is a
> thing that is displayed on its own.
>
Yes, that's a great way of seeing it. I suppose that this all
happens under the hood and it is OS specific so why have to know
how the OS we are working with works under the hood to fully
understand how this happens. Also the idea of some "characters"
been "instructions" is very interesting. Now from what I've seen,
non-printable characters are always instructions (except for the
"space" character) so another way to think about this is by
thinking that every character can have one instruction and this
is either to get written (displayed) in the file or to do another
modification in the text but without getting displayed itself as
a character. Of course, I don't suppose that's what happening
under the hood but it's an interesting way of describe it.
> This is because the *receiving program* treats them as utf-8
> and runs it accordingly. Not all terminals will necessarily do
> this, and programs you pipe to can do it very differently.
>
That's pretty interesting actually. Terminals (and don't forget
shells) are programs themselves so they choose the encoding
themselves. However, do you know what we do from cross
compatibility then? Because this sounds like a HUGE mess real
world applications
> The [w|d|]string.length function returns the number of elements
> in there, which is bytes for string, 16 bit elements for
> wstring (so bytes / 2), or 32 bit elements for dstring (so
> bytes / 4).
>
> This is not necessarily related to the number of characters
> displayed.
>
I don't understand that. Based on your calculations, the results
should have been different. Also how are the numbers fixed? Like
you said the amount of bytes of each encoding is not always
standard for every character. Even if they were fixed this means
2-bytes for each UTF-16 character and 4-bytes for each UTF-32
character so still the numbers doesn't make sense to me. So still
the number of the "length" property should have been the same for
every encoding or at least for UTF-16 and UTF-32. So are the
sizes of every character fixed or not?
Damn you guys should got paid for the help you are giving in this
forum
More information about the Digitalmars-d-learn
mailing list