Why UTF-8/16 character encodings?
Diggory
diggsey at googlemail.com
Sat May 25 15:41:58 PDT 2013
On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:
> I have noted from the beginning that these large alphabets have
> to be encoded to two bytes, so it is not a true constant-width
> encoding if you are mixing one of those languages into a
> single-byte encoded string. But this "variable length"
> encoding is so much simpler than UTF-8, there's no comparison.
All I can say is if you think that is simpler than UTF-8 then you
have completely the wrong idea about UTF-8.
Let me explain:
1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte -
this is how many bytes make up the character (there's even an ASM
instruction to do this)
4) This first byte will look like '1110xxxx' for a 3 byte
character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset
to get the code point
Note that this is CONSTANT TIME, O(1) with minimal branching so
well suited to pipelining (after the initial byte the other bytes
can all be processed in parallel by the CPU) and only sequential
memory access so no cache misses, and zero additional memory
requirements
Now compare your encoding:
1) Look up the offset in the header using binary search: O(log N)
lots of branching
2) Look up the code page ID in a massive array of code pages to
work out how many bytes per character
3) Hope this array hasn't been paged out and is still in the cache
4) Extract that many bytes from the string and combine them into
a number
5) Look up this new number in yet another large array specific to
the code page
6) Hope this array hasn't been paged out and is still in the
cache too
This is O(log N) has lots of branching so no pipelining (every
stage depends on the result of the stage before), lots of random
memory access so lots of cache misses, lots of additional memory
requirements to store all those tables, and an algorithm that
isn't even any easier to understand.
Plus every other algorithm to operate on it except for decoding
is insanely complicated.
More information about the Digitalmars-d
mailing list