Why UTF-8/16 character encodings?
Diggory
diggsey at googlemail.com
Sat May 25 11:56:41 PDT 2013
"limited success of UTF-8"
Becoming the de-facto standard encoding EVERYWERE except for
windows which uses UTF-16 is hardly a failure...
I really don't understand your hatred for UTF-8 - it's simple to
decode and encode, fast and space-efficient. Fixed width
encodings are not inherently fast, the only thing they are faster
at is if you want to randomly access the Nth character instead of
the Nth byte. In the rare cases that you need to do a lot of this
kind of random access there exists UTF-32...
Any fixed width encoding which can encode every unicode character
must use at least 3 bytes, and using 4 bytes is probably going to
be faster because of alignment, so I don't see what the great
improvement over UTF-32 is going to be.
> slicing does require decoding
Nope.
> I didn't mean that people are literally keeping code pages. I
> meant that there's not much of a difference between code pages
> with 2 bytes per char and the language character sets in UCS.
Unicode doesn't have "language character sets". The different
planes only exist for organisational purposes they don't affect
how characters are encoded.
> ?! It's okay because you deem it "coherent in its scheme?" I
> deem headers much more coherent. :)
Sure if you change the word "coherent" to mean something
completely different... Coherent means that you store related
things together, ie. everything that you need to decode a
character in the same place, not spread out between part of a
character and a header.
> but I suspect substring search not requiring decoding is the
> exception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do some
transformation that depends on the code point such as converting
case or identifying which character class a particular character
belongs to. Appending, slicing, copying, searching, replacing,
etc. basically all the most common text operations can all be
done without any encoding or decoding.
More information about the Digitalmars-d
mailing list