Why UTF-8/16 character encodings?

Joakim joakim at airpost.net
Sat May 25 07:58:01 PDT 2013


On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev 
wrote:
> On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
>> Are you sure _you_ understand it properly?  Both encodings 
>> have to check every single character to test for whitespace, 
>> but the single-byte encoding simply has to load each byte in 
>> the string and compare it against the whitespace-signifying 
>> bytes, while the variable-length code has to first load and 
>> parse potentially 4 bytes before it can compare, because it 
>> has to go through the state machine that you linked to above.  
>> Obviously the constant-width encoding will be faster.  Did I 
>> really need to explain this?
>
> It looks like you've missed an important property of UTF-8: 
> lower ASCII remains encoded the same, and UTF-8 code units 
> encoding non-ASCII characters cannot be confused with ASCII 
> characters. Code that does not need Unicode code points can 
> treat UTF-8 strings as ASCII strings, and does not need to 
> decode each character individually - because a 0x20 byte will 
> mean "space" regardless of context. That's why a function that 
> splits a string by ASCII whitespace does NOT need do perform 
> UTF-8 decoding.
>
> I hope this clears up the misunderstanding :)
OK, you got me with this particular special case: it is not 
necessary to decode every UTF-8 character if you are simply 
comparing against ASCII space characters.  My mixup is because I 
was unaware if every language used its own space character in 
UTF-8 or if they reuse the ASCII space character, apparently it's 
the latter.

However, my overall point stands.  You still have to check 2-4 
times as many bytes if you do it the way Peter suggests, as 
opposed to a single-byte encoding.  There is a shortcut: you 
could also check the first byte to see if it's ASCII or not and 
then skip the right number of ensuing bytes in a character's 
encoding if it isn't ASCII, but at that point you have begun 
partially decoding the UTF-8 encoding, which you claimed wasn't 
necessary and which will degrade performance anyway.

On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
> I suggest you read up on UTF-8. You really don't understand it. 
> There is no need to decode, you just treat the UTF-8 string as 
> if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding 
UTF-8.

> This code will count all spaces in a string whether it is 
> encoded as ASCII or UTF-8:
>
> int countSpaces(const(char)* c)
> {
>     int n = 0;
>     while (*c)
>         if (*c == ' ')
>             ++n;
>     return n;
> }
>
> I repeat: there is no need to decode. Please read up on UTF-8. 
> You do not understand it. The reason you don't need to decode 
> is because UTF-8 is self-synchronising.
Not quite.  The reason you don't need to decode is because of the 
particular encoding scheme chosen for UTF-8, a side effect of 
ASCII backwards compatibility and reusing the ASCII space 
character; it has nothing to do with whether it's 
self-synchronizing or not.

> The code above tests for spaces only, but it works the same 
> when searching for any substring or single character. It is no 
> slower than fixed-width encoding for these operations.
It doesn't work the same "for any substring or single character," 
it works the same for any single ASCII character.

Of course it's slower than a fixed-width single-byte encoding.  
You have to check every single byte of a non-ASCII character in 
UTF-8, whereas a single-byte encoding only has to check a single 
byte per language character.  There is a shortcut if you 
partially decode the first byte in UTF-8, mentioned above, but 
you seem dead-set against decoding. ;)

> Again, I urge you, please read up on UTF-8. It is very well 
> designed.
I disagree.  It is very badly designed, but the ASCII 
compatibility does hack in some shortcuts like this, which still 
don't save its performance.


More information about the Digitalmars-d mailing list