Why UTF-8/16 character encodings?

Vladimir Panteleev vladimir at thecybershadow.net
Sat May 25 03:33:11 PDT 2013


On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote:
>> Can you post some specific cases where the benefits of a 
>> constant-width encoding are obvious and, in your opinion, make 
>> constant-width encodings more useful than all the benefits of 
>> UTF-8?
> Let's take one you listed above, slicing a string.  You have to 
> either translate your entire string into UTF-32 so it's 
> constant-width, which is apparently what Phobos does, or decode 
> every single UTF-8 character along the way, every single time.  
> A constant-width, single-byte encoding would be much easier to 
> slice, while still using at most half the space.

You don't need to do that to slice a string. I think you mean to 
say that you need to decode each character if you want to slice 
the string at the N-th code point? But this is exactly what I'm 
trying to point out: how would you find this N? How would you 
know if it makes sense, taking into account combining characters, 
and all the other complexities of Unicode?

If you want to split a string by ASCII whitespace (newlines, tabs 
and spaces), it makes no difference whether the string is in 
ASCII or UTF-8 - the code will behave correctly in either case, 
variable-width-encodings regardless.

> You cannot honestly look at those multiple state diagrams and 
> tell me it's "simple."

I meant that it's simple to implement (and adapt/port to other 
languages). I would say that UTF-8 is quite cleverly designed, so 
I wouldn't say it's simple by itself.


More information about the Digitalmars-d mailing list