Why UTF-8/16 character encodings?

Diggory diggsey at googlemail.com
Sat May 25 11:56:41 PDT 2013


"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for 
windows which uses UTF-16 is hardly a failure...

I really don't understand your hatred for UTF-8 - it's simple to 
decode and encode, fast and space-efficient. Fixed width 
encodings are not inherently fast, the only thing they are faster 
at is if you want to randomly access the Nth character instead of 
the Nth byte. In the rare cases that you need to do a lot of this 
kind of random access there exists UTF-32...

Any fixed width encoding which can encode every unicode character 
must use at least 3 bytes, and using 4 bytes is probably going to 
be faster because of alignment, so I don't see what the great 
improvement over UTF-32 is going to be.

> slicing does require decoding
Nope.

> I didn't mean that people are literally keeping code pages.  I 
> meant that there's not much of a difference between code pages 
> with 2 bytes per char and the language character sets in UCS.

Unicode doesn't have "language character sets". The different 
planes only exist for organisational purposes they don't affect 
how characters are encoded.

> ?!  It's okay because you deem it "coherent in its scheme?"  I 
> deem headers much more coherent. :)

Sure if you change the word "coherent" to mean something 
completely different... Coherent means that you store related 
things together, ie. everything that you need to decode a 
character in the same place, not spread out between part of a 
character and a header.

> but I suspect substring search not requiring decoding is the 
> exception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do some 
transformation that depends on the code point such as converting 
case or identifying which character class a particular character 
belongs to. Appending, slicing, copying, searching, replacing, 
etc. basically all the most common text operations can all be 
done without any encoding or decoding.


More information about the Digitalmars-d mailing list