Why UTF-8/16 character encodings?

Diggory diggsey at googlemail.com
Sat May 25 12:42:40 PDT 2013


On Saturday, 25 May 2013 at 19:02:43 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
>> On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
>>> On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
>>>> I think you are a little confused about what unicode 
>>>> actually is... Unicode has nothing to do with code pages and 
>>>> nobody uses code pages any more except for compatibility 
>>>> with legacy applications (with good reason!).
>>> Incorrect.
>>>
>>> "Unicode is an effort to include all characters from previous 
>>> code pages into a single character enumeration that can be 
>>> used with a number of encoding schemes... In practice the 
>>> various Unicode character set encodings have simply been 
>>> assigned their own code page numbers, and all the other code 
>>> pages have been technically redefined as encodings for 
>>> various subsets of Unicode."
>>> http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
>>>
>>
>> That confirms exactly what I just said...
> No, that directly _contradicts_ what you said about Unicode 
> having "nothing to do with code pages."  All UCS did is take a 
> bunch of existing code pages and standardize them into one 
> massive character set.  For example, ISCII was a pre-existing 
> single-byte encoding and Unicode "largely preserves the ISCII 
> layout within each block."
> http://en.wikipedia.org/wiki/ISCII
>
> All a code page is is a table of mappings, UCS is just a much 
> larger, standardized table of such mappings.

UCS does have nothing to do with code pages, it was designed as a 
replacement for them. A codepage is a strict subset of the 
possible characters, UCS is the entire set of possible characters.
>
>>>> You said that phobos converts UTF-8 strings to UTF-32 before 
>>>> operating on them but that's not true. As it iterates over 
>>>> UTF-8 strings it iterates over dchars rather than chars, but 
>>>> that's not in any way inefficient so I don't really see the 
>>>> problem.
>>> And what's a dchar?  Let's check:
>>>
>>> dchar : unsigned 32 bit UTF-32
>>> http://dlang.org/type.html
>>>
>>> Of course that's inefficient, you are translating your whole 
>>> encoding over to a 32-bit encoding every time you need to 
>>> process it.  Walter as much as said so up above.
>>
>> Given that all the machine registers are at least 32-bits 
>> already it doesn't make the slightest difference. The only 
>> additional operations on top of ascii are when it's a 
>> multi-byte character, and even then it's some simple bit 
>> manipulation which is as fast as any variable width encoding 
>> is going to get.
> I see you've abandoned without note your claim that phobos 
> doesn't convert UTF-8 to UTF-32 internally.  Perhaps converting 
> to UTF-32 is "as fast as any variable width encoding is going 
> to get" but my claim is that single-byte encodings will be 
> faster.

I haven't "abandoned my claim". It's a simple fact that phobos 
does not convert UTF-8 string to UTF-32 strings before it uses 
them.

ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
     process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
     dchar current = decode(mystr, i);
     process(current);
}

And if you can't see why the latter example is far more efficient 
I give up...

>
>> The only alternatives to a variable width encoding I can see 
>> are:
>> - Single code page per string
>> This is completely useless because now you can't concatenate 
>> strings of different code pages.
> I wouldn't be so fast to ditch this.  There is a real argument 
> to be made that strings of different languages are sufficiently 
> different that there should be no multi-language strings.  Is 
> this the best route?  I'm not sure, but I certainly wouldn't 
> dismiss it out of hand.
>
>> - Multiple code pages per string
>> This just makes everything overly complicated and is far 
>> slower to decode what the actual character is than UTF-8.
> I disagree, this would still be far faster than UTF-8, 
> particularly if you designed your header right.

The cache misses alone caused by simply accessing the separate 
headers would be a larger overhead than decoding UTF-8 which 
takes a few assembly instructions and has perfect locality and 
can be efficiently pipelined by the CPU.

Then there's all the extra processing involved combining the 
headers when you concatenate strings. Plus you lose the one 
benefit a fixed width encoding has because random access is no 
longer possible without first finding out which header controls 
the location you want to access.

>
>> - String with escape sequences to change code page
>> Can no longer access characters in the middle or end of the 
>> string, you have to parse the entire string every time which 
>> completely negates the benefit of a fixed width encoding.
> I didn't think of this possibility, but you may be right that 
> it's sub-optimal.
>
>>>> Also your complaint that UTF-8 reserves the short characters 
>>>> for the english alphabet is not really relevant - the 
>>>> characters with longer encodings tend to be rarer (such as 
>>>> special symbols) or carry more information (such as chinese 
>>>> characters where the same sentence takes only about 1/3 the 
>>>> number of characters).
>>> The vast majority of non-english alphabets in UCS can be 
>>> encoded in a single byte.  It is your exceptions that are not 
>>> relevant.
>>
>> Well obviously... That's like saying "if you know what the 
>> exact contents of a file are going to be anyway you can 
>> compress it to a single byte!"
>>
>> ie. It's possible to devise an encoding which will encode any 
>> given string to an arbitrarily small size. It's still 
>> completely useless because you'd have to know the string in 
>> advance...
> No, it's not the same at all.  The contents of an 
> arbitrary-length file cannot be compressed to a single byte, 
> you would have collisions galore.  But since most non-english 
> alphabets are less than 256 characters, they can all be 
> uniquely encoded in a single byte per character, with the 
> header determining what language's code page to use.  I don't 
> understand your analogy whatsoever.

It's very simple - the more information about the type of data 
you are compressing you have at the time of writing the algorithm 
the better compression ration you can get, to the point that if 
you know exactly what the file is going to contain you can 
compress it to nothing. This is why you have specialised 
compression algorithms for images, video, audio, etc.

It doesn't matter how few characters non-english alphabets have - 
unless you know WHICH alphabet it is before-hand you can't store 
it in a single byte. Since any given character could be in any 
alphabet the best you can do is look at the probabilities of 
different characters appearing and use shorter representations 
for more common ones. (This is the basis for all lossless 
compression) The english alphabet plus 0-9 and basic punctuation 
are by far the most common characters used on computers so it 
makes sense to use one byte for those and multiple bytes for 
rarer characters.

>
>> - A useful encoding has to be able to handle every unicode 
>> character
>> - As I've shown the only space-efficient way to do this is 
>> using a variable length encoding like UTF-8
> You haven't shown this.
If you had thought through your suggestion of multiple code pages 
per string you would see that I had.

>
>> - Given the frequency distribution of unicode characters, 
>> UTF-8 does a pretty good job at encoding higher frequency 
>> characters in fewer bytes.
> No, it does a very bad job of this.  Every non-ASCII character 
> takes at least two bytes to encode, whereas my single-byte 
> encoding scheme would encode every alphabet with less than 256 
> characters in a single byte.

And strings with mixed characters would use lots of memory and be 
extremely slow. Common when using proper names, quotes, inline 
translations, graphical characters, etc. etc. Not to mention the 
added complexity to actually implement the algorithms.



More information about the Digitalmars-d mailing list