Why UTF-8/16 character encodings?

Sat May 25 11:26:52 PDT 2013

On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
> 25-May-2013 10:44, Joakim пишет:
>> Yes, on the encoding, if it's a variable-length encoding like 
>> UTF-8, no,
>> on the code space.  I was originally going to title my post, 
>> "Why
>> Unicode?" but I have no real problem with UCS, which merely 
>> standardized
>> a bunch of pre-existing code pages.  Perhaps there are a lot 
>> of problems
>> with UCS also, I just haven't delved into it enough to know.
>
> UCS is dead and gone. Next in line to "640K is enough for 
> everyone".
I think you are confused.  UCS refers to the Universal Character 
Set, which is the backbone of Unicode:

http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings, 
which I have never referred to.

>>> Separate code spaces were the case before Unicode (and 
>>> utf-8). The
>>> problem is not only that without header text is meaningless 
>>> (no easy
>>> slicing) but the fact that encoding of data after header 
>>> strongly
>>> depends a variety of factors -  a list of encodings actually. 
>>> Now
>>> everybody has to keep a (code) page per language to at least 
>>> know if
>>> it's 2 bytes per char or 1 byte per char or whatever. And you 
>>> still
>>> work on a basis that there is no combining marks and regional 
>>> specific
>>> stuff :)
>> Everybody is still keeping code pages, UTF-8 hasn't changed 
>> that.
>
> Legacy. Hard to switch overnight. There are graphs that 
> indicate that few years from now you might never encounter a 
> legacy encoding anymore, only UTF-8/UTF-16.
I didn't mean that people are literally keeping code pages.  I 
meant that there's not much of a difference between code pages 
with 2 bytes per char and the language character sets in UCS.

>> Does
>> UTF-8 not need "to at least know if it's 2 bytes per char or 1 
>> byte per
>> char or whatever?"
>
> It's coherent in its scheme to determine that. You don't need 
> extra information synced to text unlike header stuff.
?!  It's okay because you deem it "coherent in its scheme?"  I 
deem headers much more coherent. :)

>> It has to do that also. Everyone keeps talking about
>> "easy slicing" as though UTF-8 provides it, but it doesn't.  
>> Phobos
>> turns UTF-8 into UTF-32 internally for all that ease of use, 
>> at least
>> doubling your string size in the process.  Correct me if I'm 
>> wrong, that
>> was what I read on the newsgroup sometime back.
>
> Indeed you are - searching for UTF-8 substring in UTF-8 string 
> doesn't do any decoding and it does return you a slice of a 
> balance of original.
Perhaps substring search doesn't strictly require decoding but 
you have changed the subject: slicing does require decoding and 
that's the use case you brought up to begin with.  I haven't 
looked into it, but I suspect substring search not requiring 
decoding is the exception for UTF-8 algorithms, not the rule.

> ??? Simply makes no sense. There is no intersection between 
> some legacy encodings as of now. Or do you want to add N*(N-1) 
> cross-encodings for any combination of 2? What about 3 in one 
> string?
I sketched two possible encodings above, none of which would 
require "cross-encodings."

>>> We want monoculture! That is to understand each without all 
>>> these
>>> "par-le-vu-france?" and codepages of various 
>>> complexity(insanity).
>> I hate monoculture, but then I haven't had to decipher some 
>> screwed-up
>> codepage in the middle of the night. ;)
>
> So you never had trouble of internationalization? What 
> languages do you use (read/speak/etc.)?
This was meant as a point in your favor, conceding that I haven't 
had to code with the terrible code pages system from the past.  I 
can read and speak multiple languages, but I don't use anything 
other than English text.

>>That said, you could standardize
>> on UCS for your code space without using a bad encoding like 
>> UTF-8, as I
>> said above.
>
> UCS is a myth as of ~5 years ago. Early adopters of Unicode 
> fell into that trap (Java, Windows NT). You shouldn't.
UCS, the character set, as noted above.  If that's a myth, 
Unicode is a myth. :)

> This is it but it's far more flexible in a sense that it allows 
> multi-linguagal strings just fine and lone full-with unicode 
> codepoints as well.
That's only because it uses a more complex header than a single 
byte for the language, which I noted could be done with my 
scheme, by adding a more complex header, long before you 
mentioned this unicode compression scheme.

>> But I get the impression that it's only for sending over
>> the wire, ie transmision, so all the processing issues that 
>> UTF-8
>> introduces would still be there.
>
> Use mime-type etc. Standards are always a bit stringy and 
> suboptimal, their acceptance rate is one of chief advantages 
> they have. Unicode has horrifically large momentum now and not 
> a single organization aside from them tries to do this dirty 
> work (=i18n).
You misunderstand.  I was saying that this unicode compression 
scheme doesn't help you with string processing, it is only for 
transmission and is probably fine for that, precisely because it 
seems to implement some version of my single-byte encoding 
scheme!  You do raise a good point: the only reason why we're 
likely using such a bad encoding in UTF-8 is that nobody else 
wants to tackle this hairy problem.

> Consider adding another encoding for "Tuva" for isntance. Now 
> you have to add 2*n conversion routines to match it to other 
> codepages/locales.
Not sure what you're referring to here.

> Beyond that - there are many things to consider in 
> internationalization and you would have to special case them 
> all by codepage.
Not necessarily.  But that is actually one of the advantages of 
single-byte encodings, as I have noted above.  toUpper is a NOP 
for a single-byte encoding string with an Asian script, you can't 
do that with a UTF-8 string.

>> If they're screwing up something so simple,
>> imagine how much worse everyone is screwing up something 
>> complex like
>> UTF-8?
>
> UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] 
> to a sequence of octets. It does it pretty well and compatible 
> with ASCII, even the little rant you posted acknowledged that. 
> Now you are either against Unicode as whole or what?
The BOM link I gave notes that UTF-8 isn't always 
ASCII-compatible.

There are two parts to Unicode.  I don't know enough about UCS, 
the character set, ;) to be for it or against it, but I 
acknowledge that a standardized character set may make sense.  I 
am dead set against the UTF-8 variable-width encoding, for all 
the reasons listed above.

On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
> 25-May-2013 13:05, Joakim пишет:
>> Nobody is talking about going back to code pages.  I'm talking 
>> about
>> going to single-byte encodings, which do not imply the 
>> problems that you
>> had with code pages way back when.
>
> Problem is what you outline is isomorphic with code-pages. 
> Hence the grief of accumulated experience against them.
They may seem superficially similar but they're not.  For 
example, from the beginning, I have suggested a more complex 
header that can enable multi-language strings, as one possible 
solution.  I don't think code pages provided that.

> Well if somebody get a quest to redefine UTF-8 they *might* 
> come up with something that is a bit faster to decode but 
> shares the same properties. Hardly a life saver anyway.
Perhaps not, but I suspect programmers will flock to a 
constant-width encoding that is much simpler and more efficient 
than UTF-8.  Programmer productivity is the biggest loss from the 
complexity of UTF-8, as I've noted before.

>> The world may not "abandon Unicode," but it will abandon 
>> UTF-8, because
>> it's a dumb idea.  Unfortunately, such dumb ideas- XML 
>> anyone?- often
>> proliferate until someone comes up with something better to 
>> show how
>> dumb they are.
>
> Even children know XML is awful redundant shit as interchange 
> format. The hierarchical document is a nice idea anyway.
_We_ both know that, but many others don't, or XML wouldn't be as 
popular as it is. ;) I'm making a similar point about the more 
limited success of UTF-8, ie it's still shit.