Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Sat May 25 07:58:01 PDT 2013
On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev
wrote:
> On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
>> Are you sure _you_ understand it properly? Both encodings
>> have to check every single character to test for whitespace,
>> but the single-byte encoding simply has to load each byte in
>> the string and compare it against the whitespace-signifying
>> bytes, while the variable-length code has to first load and
>> parse potentially 4 bytes before it can compare, because it
>> has to go through the state machine that you linked to above.
>> Obviously the constant-width encoding will be faster. Did I
>> really need to explain this?
>
> It looks like you've missed an important property of UTF-8:
> lower ASCII remains encoded the same, and UTF-8 code units
> encoding non-ASCII characters cannot be confused with ASCII
> characters. Code that does not need Unicode code points can
> treat UTF-8 strings as ASCII strings, and does not need to
> decode each character individually - because a 0x20 byte will
> mean "space" regardless of context. That's why a function that
> splits a string by ASCII whitespace does NOT need do perform
> UTF-8 decoding.
>
> I hope this clears up the misunderstanding :)
OK, you got me with this particular special case: it is not
necessary to decode every UTF-8 character if you are simply
comparing against ASCII space characters. My mixup is because I
was unaware if every language used its own space character in
UTF-8 or if they reuse the ASCII space character, apparently it's
the latter.
However, my overall point stands. You still have to check 2-4
times as many bytes if you do it the way Peter suggests, as
opposed to a single-byte encoding. There is a shortcut: you
could also check the first byte to see if it's ASCII or not and
then skip the right number of ensuing bytes in a character's
encoding if it isn't ASCII, but at that point you have begun
partially decoding the UTF-8 encoding, which you claimed wasn't
necessary and which will degrade performance anyway.
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
> I suggest you read up on UTF-8. You really don't understand it.
> There is no need to decode, you just treat the UTF-8 string as
> if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding
UTF-8.
> This code will count all spaces in a string whether it is
> encoded as ASCII or UTF-8:
>
> int countSpaces(const(char)* c)
> {
> int n = 0;
> while (*c)
> if (*c == ' ')
> ++n;
> return n;
> }
>
> I repeat: there is no need to decode. Please read up on UTF-8.
> You do not understand it. The reason you don't need to decode
> is because UTF-8 is self-synchronising.
Not quite. The reason you don't need to decode is because of the
particular encoding scheme chosen for UTF-8, a side effect of
ASCII backwards compatibility and reusing the ASCII space
character; it has nothing to do with whether it's
self-synchronizing or not.
> The code above tests for spaces only, but it works the same
> when searching for any substring or single character. It is no
> slower than fixed-width encoding for these operations.
It doesn't work the same "for any substring or single character,"
it works the same for any single ASCII character.
Of course it's slower than a fixed-width single-byte encoding.
You have to check every single byte of a non-ASCII character in
UTF-8, whereas a single-byte encoding only has to check a single
byte per language character. There is a shortcut if you
partially decode the first byte in UTF-8, mentioned above, but
you seem dead-set against decoding. ;)
> Again, I urge you, please read up on UTF-8. It is very well
> designed.
I disagree. It is very badly designed, but the ASCII
compatibility does hack in some shortcuts like this, which still
don't save its performance.
More information about the Digitalmars-d
mailing list