Why UTF-8/16 character encodings?

Sat May 25 15:41:58 PDT 2013

On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:
> I have noted from the beginning that these large alphabets have 
> to be encoded to two bytes, so it is not a true constant-width 
> encoding if you are mixing one of those languages into a 
> single-byte encoded string.  But this "variable length" 
> encoding is so much simpler than UTF-8, there's no comparison.

All I can say is if you think that is simpler than UTF-8 then you 
have completely the wrong idea about UTF-8.

Let me explain:

1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte - 
this is how many bytes make up the character (there's even an ASM 
instruction to do this)
4) This first byte will look like '1110xxxx' for a 3 byte 
character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset 
to get the code point

Note that this is CONSTANT TIME, O(1) with minimal branching so 
well suited to pipelining (after the initial byte the other bytes 
can all be processed in parallel by the CPU) and only sequential 
memory access so no cache misses, and zero additional memory 
requirements

Now compare your encoding:
1) Look up the offset in the header using binary search: O(log N) 
lots of branching
2) Look up the code page ID in a massive array of code pages to 
work out how many bytes per character
3) Hope this array hasn't been paged out and is still in the cache
4) Extract that many bytes from the string and combine them into 
a number
5) Look up this new number in yet another large array specific to 
the code page
6) Hope this array hasn't been paged out and is still in the 
cache too

This is O(log N) has lots of branching so no pipelining (every 
stage depends on the result of the stage before), lots of random 
memory access so lots of cache misses, lots of additional memory 
requirements to store all those tables, and an algorithm that 
isn't even any easier to understand.

Plus every other algorithm to operate on it except for decoding 
is insanely complicated.