Why UTF-8/16 character encodings?

Sun May 26 02:52:14 PDT 2013

On Saturday, 25 May 2013 at 19:58:25 UTC, Dmitry Olshansky wrote:
> Runs away in horror :) It's mess even before you've got to 
> details.
Perhaps it's fatally flawed, but I don't see an argument for why, 
so I'll assume you can't find such a flaw.  It is still _much 
less_ messy than UTF-8, that is the critical distinction.

> Another point about using sometimes a 2-byte encoding - welcome 
> to the nice world of BigEndian/LittleEndian i.e. the very trap 
> UTF-16 has stepped into.
I don't think this is a sizable obstacle.  It takes some 
coordination, but it is a minor issue.

On Saturday, 25 May 2013 at 20:20:11 UTC, Juan Manuel Cabo wrote:
> You obviously are not thinking it through. Such encoding would 
> have a O(n^2) complexity for appending a character/symbol in a 
> different language to the string, since you would have to 
> update the beginning of the string, and move the contents 
> forward to make room. Not to mention that it wouldn't be 
> backwards compatible with ascii routines, and the complexity of 
> such a header would be have to be carried all the way to font 
> rendering routines in the OS.
You obviously have not read the rest of the thread, both your 
non-font-related assertions have been addressed earlier.  I see 
no reason why a single-byte encoding of UCS would have to be 
carried to "font rendering routines" but UTF-8 wouldn't be.

> Multiple languages/symbols in one string is a blessing of 
> modern humane computing. It is the norm more than the exception 
> in most of the world.
I disagree, but in any case, most of this thread refers to 
multi-language strings.  The argument is about how best to encode 
them.

On Saturday, 25 May 2013 at 20:47:25 UTC, Peter Alexander wrote:
> On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
>> On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander 
>> wrote:
>>> I suggest you read up on UTF-8. You really don't understand 
>>> it. There is no need to decode, you just treat the UTF-8 
>>> string as if it is an ASCII string.
>> Not being aware of this shortcut doesn't mean not 
>> understanding UTF-8.
>
> It's not just a shortcut, it is absolutely fundamental to the 
> design of UTF-8. It's like saying you understand Lisp without 
> being aware that everything is a list.
It is an accidental shortcut because of the encoding scheme 
chosen for UTF-8 and, as I've noted, still less efficient than 
similarly searching a single-byte encoding.  The fact that you 
keep trumpeting this silly detail as somehow "fundamental" 
suggests you have no idea what you're talking about.

> Also, you continuously keep stating disadvantages to UTF-8 that 
> are completely false, like "slicing does require decoding". 
> Again, completely missing the point of UTF-8. I cannot conceive 
> how you can claim to understand how UTF-8 works yet repeatedly 
> demonstrating that you do not.
Slicing on code points requires decoding, I'm not sure how you 
don't know that.  If you mean slicing by byte, that is not only 
useless, but _every_ encoding can do that.  I cannot conceive how 
you claim to defend UTF-8, yet keep making such stupid points, 
that you don't even bother backing up.

> You are either ignorant or a successful troll. In either case, 
> I'm done here.
Must be nice to just insult someone who has demolished your 
arguments and leave.  Good riddance, you weren't adding anything.