Why UTF-8/16 character encodings?

Sat May 25 12:51:42 PDT 2013

On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
> You can map a codepage to a subset of UCS :)
> That's what they do internally anyway.
> If I take you right you propose to define string as a header 
> that denotes a set of windows in code space? I still fail to 
> see how that would scale see below.
Something like that.  For a multi-language string encoding, the 
header would contain a single byte for every language used in the 
string, along with multiple index bytes to signify the start and 
finish of every run of single-language characters in the string.  
So, a list of languages and a list of pure single-language 
substrings.  This is just off the top of my head, I'm not 
suggesting it is definitive.

> Mm... strictly speaking (let's turn that argument backwards) - 
> what are algorithms that require slicing say [5..$] of string 
> without ever looking at it left to right, searching etc.?
Don't know, I was just pointing out that all the claims of easy 
slicing with UTF-8 are wrong.  But a single-byte encoding would 
be scanned much faster also, as I've noted above, no decoding 
necessary and single bytes will always be faster than multiple 
bytes, even without decoding.

> How would it look like? Or how the processing will go?
Detailed a bit above.  As I mentioned earlier in this thread, 
functions like toUpper would execute much faster because you 
wouldn't have to scan substrings containing languages that don't 
have uppercase, which you have to scan in UTF-8.

>> long before you mentioned this unicode compression
>> scheme.
>
> It does inline headers or rather tags. That hop between fixed 
> char windows. It's not random-access nor claims to be.
I wasn't criticizing it, just saying that it seems to be 
superficially similar to my scheme. :)

>> version of my single-byte encoding scheme!  You do raise a 
>> good point:
>> the only reason why we're likely using such a bad encoding in 
>> UTF-8 is
>> that nobody else wants to tackle this hairy problem.
>
> Yup, where have you been say almost 10 years ago? :)
I was in grad school, avoiding writing my thesis. :) I'd never 
have thought I'd be discussing Unicode today, didn't even know 
what it was back then.

>> Not necessarily.  But that is actually one of the advantages of
>> single-byte encodings, as I have noted above.  toUpper is a 
>> NOP for a
>> single-byte encoding string with an Asian script, you can't do 
>> that with
>> a UTF-8 string.
>
> But you have to check what encoding it's in and given that not 
> all codepages are that simple to upper case some generic 
> algorithm is required.
You have to check the language, but my point is that you can look 
at the header and know that toUpper has to do nothing for a 
single-byte-encoded string of an Asian script which doesn't have 
uppercase characters.  With UTF-8, you have to decode the entire 
string to find that out.


>> They may seem superficially similar but they're not.  For 
>> example, from
>> the beginning, I have suggested a more complex header that can 
>> enable
>> multi-language strings, as one possible solution.  I don't 
>> think code
>> pages provided that.
>
> The problem is how would you define an uppercase algorithm for 
> multilingual string with 3 distinct 256 codespaces (windows)? I 
> bet it's won't be pretty.
How is it done now?  It isn't pretty with UTF-8 now either, as 
some languages have uppercase characters and others don't.  The 
version of toUpper for my encoding will be similar, but it will 
do less work, because it doesn't have to be invoked for every 
character in the string.

> I still don't see how your solution scales to beyond 256 
> different codepoints per string (= multiple pages/parts of UCS 
> ;) ).
I assume you're talking about Chinese, Korean, etc. alphabets?  I 
mentioned those to Walter earlier, they would have a two-byte 
encoding.  No way around that, but they would still be easier to 
deal with than UTF-8, because of the header.