Why UTF-8/16 character encodings?

Dmitry Olshansky dmitry.olsh at gmail.com
Sat May 25 12:03:50 PDT 2013


25-May-2013 22:26, Joakim пишет:
> On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
>> 25-May-2013 10:44, Joakim пишет:
>>> Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
>>> on the code space.  I was originally going to title my post, "Why
>>> Unicode?" but I have no real problem with UCS, which merely standardized
>>> a bunch of pre-existing code pages.  Perhaps there are a lot of problems
>>> with UCS also, I just haven't delved into it enough to know.
>>
>> UCS is dead and gone. Next in line to "640K is enough for everyone".
> I think you are confused.  UCS refers to the Universal Character Set,
> which is the backbone of Unicode:
>
> http://en.wikipedia.org/wiki/Universal_Character_Set
>
> You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which
> I have never referred to.

Yeah got confused. So sorry about that.

>
>>>> Separate code spaces were the case before Unicode (and utf-8). The
>>>> problem is not only that without header text is meaningless (no easy
>>>> slicing) but the fact that encoding of data after header strongly
>>>> depends a variety of factors -  a list of encodings actually. Now
>>>> everybody has to keep a (code) page per language to at least know if
>>>> it's 2 bytes per char or 1 byte per char or whatever. And you still
>>>> work on a basis that there is no combining marks and regional specific
>>>> stuff :)
>>> Everybody is still keeping code pages, UTF-8 hasn't changed that.
>>
>> Legacy. Hard to switch overnight. There are graphs that indicate that
>> few years from now you might never encounter a legacy encoding
>> anymore, only UTF-8/UTF-16.
> I didn't mean that people are literally keeping code pages.  I meant
> that there's not much of a difference between code pages with 2 bytes
> per char and the language character sets in UCS.

You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header that 
denotes a set of windows in code space? I still fail to see how that 
would scale see below.

>>> It has to do that also. Everyone keeps talking about
>>> "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos
>>> turns UTF-8 into UTF-32 internally for all that ease of use, at least
>>> doubling your string size in the process.  Correct me if I'm wrong, that
>>> was what I read on the newsgroup sometime back.
>>
>> Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't
>> do any decoding and it does return you a slice of a balance of original.
> Perhaps substring search doesn't strictly require decoding but you have
> changed the subject: slicing does require decoding and that's the use
> case you brought up to begin with.  I haven't looked into it, but I
> suspect substring search not requiring decoding is the exception for
> UTF-8 algorithms, not the rule.

Mm... strictly speaking (let's turn that argument backwards) - what are 
algorithms that require slicing say [5..$] of string without ever 
looking at it left to right, searching etc.?

>> ??? Simply makes no sense. There is no intersection between some
>> legacy encodings as of now. Or do you want to add N*(N-1)
>> cross-encodings for any combination of 2? What about 3 in one string?
> I sketched two possible encodings above, none of which would require
> "cross-encodings."
>
>>>> We want monoculture! That is to understand each without all these
>>>> "par-le-vu-france?" and codepages of various complexity(insanity).
>>> I hate monoculture, but then I haven't had to decipher some screwed-up
>>> codepage in the middle of the night. ;)
>>
>> So you never had trouble of internationalization? What languages do
>> you use (read/speak/etc.)?
> This was meant as a point in your favor, conceding that I haven't had to
> code with the terrible code pages system from the past.  I can read and
> speak multiple languages, but I don't use anything other than English text.

Okay then.

>>> That said, you could standardize
>>> on UCS for your code space without using a bad encoding like UTF-8, as I
>>> said above.
>>
>> UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into
>> that trap (Java, Windows NT). You shouldn't.
> UCS, the character set, as noted above.  If that's a myth, Unicode is a
> myth. :)

Yeah, that was a mishap on my behalf. I think I've seen your 2 byte 
argument way to often and it got concatenated to UCS forming UCS-2 :)

>
>> This is it but it's far more flexible in a sense that it allows
>> multi-linguagal strings just fine and lone full-with unicode
>> codepoints as well.
> That's only because it uses a more complex header than a single byte for
> the language, which I noted could be done with my scheme, by adding a
> more complex header,

How would it look like? Or how the processing will go?

> long before you mentioned this unicode compression
> scheme.

It does inline headers or rather tags. That hop between fixed char 
windows. It's not random-access nor claims to be.

>
>>> But I get the impression that it's only for sending over
>>> the wire, ie transmision, so all the processing issues that UTF-8
>>> introduces would still be there.
>>
>> Use mime-type etc. Standards are always a bit stringy and suboptimal,
>> their acceptance rate is one of chief advantages they have. Unicode
>> has horrifically large momentum now and not a single organization
>> aside from them tries to do this dirty work (=i18n).
> You misunderstand.  I was saying that this unicode compression scheme
> doesn't help you with string processing, it is only for transmission and
> is probably fine for that, precisely because it seems to implement some
> version of my single-byte encoding scheme!  You do raise a good point:
> the only reason why we're likely using such a bad encoding in UTF-8 is
> that nobody else wants to tackle this hairy problem.

Yup, where have you been say almost 10 years ago? :)

>> Consider adding another encoding for "Tuva" for isntance. Now you have
>> to add 2*n conversion routines to match it to other codepages/locales.
> Not sure what you're referring to here.
>
If you adopt the "map to UCS policy" then nothing.

>> Beyond that - there are many things to consider in
>> internationalization and you would have to special case them all by
>> codepage.
> Not necessarily.  But that is actually one of the advantages of
> single-byte encodings, as I have noted above.  toUpper is a NOP for a
> single-byte encoding string with an Asian script, you can't do that with
> a UTF-8 string.

But you have to check what encoding it's in and given that not all 
codepages are that simple to upper case some generic algorithm is required.

>>> If they're screwing up something so simple,
>>> imagine how much worse everyone is screwing up something complex like
>>> UTF-8?
>>
>> UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a
>> sequence of octets. It does it pretty well and compatible with ASCII,
>> even the little rant you posted acknowledged that. Now you are either
>> against Unicode as whole or what?
> The BOM link I gave notes that UTF-8 isn't always ASCII-compatible.
>
> There are two parts to Unicode.  I don't know enough about UCS, the
> character set, ;) to be for it or against it, but I acknowledge that a
> standardized character set may make sense.  I am dead set against the
> UTF-8 variable-width encoding, for all the reasons listed above.

Okay we are getting somewhere, now that I understand your position and 
got myself confused in the midway there.

> On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
>> 25-May-2013 13:05, Joakim пишет:
>>> Nobody is talking about going back to code pages.  I'm talking about
>>> going to single-byte encodings, which do not imply the problems that you
>>> had with code pages way back when.
>>
>> Problem is what you outline is isomorphic with code-pages. Hence the
>> grief of accumulated experience against them.
> They may seem superficially similar but they're not.  For example, from
> the beginning, I have suggested a more complex header that can enable
> multi-language strings, as one possible solution.  I don't think code
> pages provided that.

The problem is how would you define an uppercase algorithm for 
multilingual string with 3 distinct 256 codespaces (windows)? I bet it's 
won't be pretty.

>> Well if somebody get a quest to redefine UTF-8 they *might* come up
>> with something that is a bit faster to decode but shares the same
>> properties. Hardly a life saver anyway.
> Perhaps not, but I suspect programmers will flock to a constant-width
> encoding that is much simpler and more efficient than UTF-8.  Programmer
> productivity is the biggest loss from the complexity of UTF-8, as I've
> noted before.

I still don't see how your solution scales to beyond 256 different 
codepoints per string (= multiple pages/parts of UCS ;) ).

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list