Why UTF-8/16 character encodings?

Joakim joakim at airpost.net
Sat May 25 04:07:53 PDT 2013


On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev 
wrote:
> You don't need to do that to slice a string. I think you mean 
> to say that you need to decode each character if you want to 
> slice the string at the N-th code point? But this is exactly 
> what I'm trying to point out: how would you find this N? How 
> would you know if it makes sense, taking into account combining 
> characters, and all the other complexities of Unicode?
Slicing a string implies finding the N-th code point, what other 
way would you slice and have it make any sense?  Finding the N-th 
point is much simpler with a constant-width encoding.

I'm leaving aside combining characters and those intrinsic 
language complexities baked into unicode in my previous analysis, 
but if you want to bring those in, that's actually an argument in 
favor of my encoding.  With my encoding, you know up front if 
you're using languages that have such complexity- just check the 
header- whereas with a chunk of random UTF-8 text, you cannot 
ever know that unless you decode the entire string once and 
extract knowledge of all the languages that are embedded.

For another similar example, let's say you want to run toUpper on 
a multi-language string, which contains English in the first half 
and some Asian script that doesn't define uppercase in the second 
half.  With my format, toUpper can check the header, then process 
the English half and skip the Asian half (I'm assuming that the 
substring indices for each language would be stored in this more 
complex header).  With UTF-8, you have to process the entire 
string, because you never know what random languages might be 
packed in there.

UTF-8 is riddled with such performance bottlenecks, all to make 
if self-synchronizing.  But is anybody really using its less 
compact encoding to do some "self-synchronized" integrity 
checking?  I suspect almost nobody is.

> If you want to split a string by ASCII whitespace (newlines, 
> tabs and spaces), it makes no difference whether the string is 
> in ASCII or UTF-8 - the code will behave correctly in either 
> case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode 
while splitting, when compared to a single-byte encoding.

>> You cannot honestly look at those multiple state diagrams and 
>> tell me it's "simple."
>
> I meant that it's simple to implement (and adapt/port to other 
> languages). I would say that UTF-8 is quite cleverly designed, 
> so I wouldn't say it's simple by itself.
Perhaps, maybe decoding is not so bad for the type of people who 
write the fundamental UTF-8 libraries.  But implementation does 
not merely refer to the UTF-8 libraries, but also all the code 
that tries to build on it for internationalized apps.  And with 
all the unnecessary additional complexity added by UTF-8, 
wrapping the average programmer's head around this mess likely 
leads to as many problems as broken code pages implementations 
did back in the day. ;)


More information about the Digitalmars-d mailing list