Why UTF-8/16 character encodings?

Sat May 25 12:02:42 PDT 2013

On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
> On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
>> On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
>>> I think you are a little confused about what unicode actually 
>>> is... Unicode has nothing to do with code pages and nobody 
>>> uses code pages any more except for compatibility with legacy 
>>> applications (with good reason!).
>> Incorrect.
>>
>> "Unicode is an effort to include all characters from previous 
>> code pages into a single character enumeration that can be 
>> used with a number of encoding schemes... In practice the 
>> various Unicode character set encodings have simply been 
>> assigned their own code page numbers, and all the other code 
>> pages have been technically redefined as encodings for various 
>> subsets of Unicode."
>> http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
>>
>
> That confirms exactly what I just said...
No, that directly _contradicts_ what you said about Unicode 
having "nothing to do with code pages."  All UCS did is take a 
bunch of existing code pages and standardize them into one 
massive character set.  For example, ISCII was a pre-existing 
single-byte encoding and Unicode "largely preserves the ISCII 
layout within each block."
http://en.wikipedia.org/wiki/ISCII

All a code page is is a table of mappings, UCS is just a much 
larger, standardized table of such mappings.

>>> You said that phobos converts UTF-8 strings to UTF-32 before 
>>> operating on them but that's not true. As it iterates over 
>>> UTF-8 strings it iterates over dchars rather than chars, but 
>>> that's not in any way inefficient so I don't really see the 
>>> problem.
>> And what's a dchar?  Let's check:
>>
>> dchar : unsigned 32 bit UTF-32
>> http://dlang.org/type.html
>>
>> Of course that's inefficient, you are translating your whole 
>> encoding over to a 32-bit encoding every time you need to 
>> process it.  Walter as much as said so up above.
>
> Given that all the machine registers are at least 32-bits 
> already it doesn't make the slightest difference. The only 
> additional operations on top of ascii are when it's a 
> multi-byte character, and even then it's some simple bit 
> manipulation which is as fast as any variable width encoding is 
> going to get.
I see you've abandoned without note your claim that phobos 
doesn't convert UTF-8 to UTF-32 internally.  Perhaps converting 
to UTF-32 is "as fast as any variable width encoding is going to 
get" but my claim is that single-byte encodings will be faster.

> The only alternatives to a variable width encoding I can see 
> are:
> - Single code page per string
> This is completely useless because now you can't concatenate 
> strings of different code pages.
I wouldn't be so fast to ditch this.  There is a real argument to 
be made that strings of different languages are sufficiently 
different that there should be no multi-language strings.  Is 
this the best route?  I'm not sure, but I certainly wouldn't 
dismiss it out of hand.

> - Multiple code pages per string
> This just makes everything overly complicated and is far slower 
> to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8, 
particularly if you designed your header right.

> - String with escape sequences to change code page
> Can no longer access characters in the middle or end of the 
> string, you have to parse the entire string every time which 
> completely negates the benefit of a fixed width encoding.
I didn't think of this possibility, but you may be right that 
it's sub-optimal.

>>> Also your complaint that UTF-8 reserves the short characters 
>>> for the english alphabet is not really relevant - the 
>>> characters with longer encodings tend to be rarer (such as 
>>> special symbols) or carry more information (such as chinese 
>>> characters where the same sentence takes only about 1/3 the 
>>> number of characters).
>> The vast majority of non-english alphabets in UCS can be 
>> encoded in a single byte.  It is your exceptions that are not 
>> relevant.
>
> Well obviously... That's like saying "if you know what the 
> exact contents of a file are going to be anyway you can 
> compress it to a single byte!"
>
> ie. It's possible to devise an encoding which will encode any 
> given string to an arbitrarily small size. It's still 
> completely useless because you'd have to know the string in 
> advance...
No, it's not the same at all.  The contents of an 
arbitrary-length file cannot be compressed to a single byte, you 
would have collisions galore.  But since most non-english 
alphabets are less than 256 characters, they can all be uniquely 
encoded in a single byte per character, with the header 
determining what language's code page to use.  I don't understand 
your analogy whatsoever.

> - A useful encoding has to be able to handle every unicode 
> character
> - As I've shown the only space-efficient way to do this is 
> using a variable length encoding like UTF-8
You haven't shown this.

> - Given the frequency distribution of unicode characters, UTF-8 
> does a pretty good job at encoding higher frequency characters 
> in fewer bytes.
No, it does a very bad job of this.  Every non-ASCII character 
takes at least two bytes to encode, whereas my single-byte 
encoding scheme would encode every alphabet with less than 256 
characters in a single byte.

> - Yes you COULD encode non-english alphabets in a single byte 
> but doing so would be inefficient because it would mean the 
> more frequently used characters take more bytes to encode.
Not sure what you mean by this.