Why UTF-8/16 character encodings?

Sat May 25 11:09:25 PDT 2013

On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
>> I think you are a little confused about what unicode actually 
>> is... Unicode has nothing to do with code pages and nobody 
>> uses code pages any more except for compatibility with legacy 
>> applications (with good reason!).
> Incorrect.
>
> "Unicode is an effort to include all characters from previous 
> code pages into a single character enumeration that can be used 
> with a number of encoding schemes... In practice the various 
> Unicode character set encodings have simply been assigned their 
> own code page numbers, and all the other code pages have been 
> technically redefined as encodings for various subsets of 
> Unicode."
> http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
>

That confirms exactly what I just said...

>> You said that phobos converts UTF-8 strings to UTF-32 before 
>> operating on them but that's not true. As it iterates over 
>> UTF-8 strings it iterates over dchars rather than chars, but 
>> that's not in any way inefficient so I don't really see the 
>> problem.
> And what's a dchar?  Let's check:
>
> dchar : unsigned 32 bit UTF-32
> http://dlang.org/type.html
>
> Of course that's inefficient, you are translating your whole 
> encoding over to a 32-bit encoding every time you need to 
> process it.  Walter as much as said so up above.

Given that all the machine registers are at least 32-bits already 
it doesn't make the slightest difference. The only additional 
operations on top of ascii are when it's a multi-byte character, 
and even then it's some simple bit manipulation which is as fast 
as any variable width encoding is going to get.

The only alternatives to a variable width encoding I can see are:
- Single code page per string
This is completely useless because now you can't concatenate 
strings of different code pages.

- Multiple code pages per string
This just makes everything overly complicated and is far slower 
to decode what the actual character is than UTF-8.

- String with escape sequences to change code page
Can no longer access characters in the middle or end of the 
string, you have to parse the entire string every time which 
completely negates the benefit of a fixed width encoding.

- An encoding wide enough to store every character
This is just UTF-32.

>
>> Also your complaint that UTF-8 reserves the short characters 
>> for the english alphabet is not really relevant - the 
>> characters with longer encodings tend to be rarer (such as 
>> special symbols) or carry more information (such as chinese 
>> characters where the same sentence takes only about 1/3 the 
>> number of characters).
> The vast majority of non-english alphabets in UCS can be 
> encoded in a single byte.  It is your exceptions that are not 
> relevant.

Well obviously... That's like saying "if you know what the exact 
contents of a file are going to be anyway you can compress it to 
a single byte!"

ie. It's possible to devise an encoding which will encode any 
given string to an arbitrarily small size. It's still completely 
useless because you'd have to know the string in advance...

- A useful encoding has to be able to handle every unicode 
character
- As I've shown the only space-efficient way to do this is using 
a variable length encoding like UTF-8
- Given the frequency distribution of unicode characters, UTF-8 
does a pretty good job at encoding higher frequency characters in 
fewer bytes.
- Yes you COULD encode non-english alphabets in a single byte but 
doing so would be inefficient because it would mean the more 
frequently used characters take more bytes to encode.