Why UTF-8/16 character encodings?

Peter Alexander peter.alexander.au at gmail.com
Sat May 25 07:16:20 PDT 2013


On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
> wrote:
>> On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
>>>> If you want to split a string by ASCII whitespace (newlines, 
>>>> tabs and spaces), it makes no difference whether the string 
>>>> is in ASCII or UTF-8 - the code will behave correctly in 
>>>> either case, variable-width-encodings regardless.
>>> Except that a variable-width encoding will take longer to 
>>> decode while splitting, when compared to a single-byte 
>>> encoding.
>>
>> No. Are you sure you understand UTF-8 properly?
> Are you sure _you_ understand it properly?  Both encodings have 
> to check every single character to test for whitespace, but the 
> single-byte encoding simply has to load each byte in the string 
> and compare it against the whitespace-signifying bytes, while 
> the variable-length code has to first load and parse 
> potentially 4 bytes before it can compare, because it has to go 
> through the state machine that you linked to above.  Obviously 
> the constant-width encoding will be faster.  Did I really need 
> to explain this?

I suggest you read up on UTF-8. You really don't understand it. 
There is no need to decode, you just treat the UTF-8 string as if 
it is an ASCII string.

This code will count all spaces in a string whether it is encoded 
as ASCII or UTF-8:

int countSpaces(const(char)* c)
{
     int n = 0;
     while (*c)
         if (*c == ' ')
             ++n;
     return n;
}

I repeat: there is no need to decode. Please read up on UTF-8. 
You do not understand it. The reason you don't need to decode is 
because UTF-8 is self-synchronising.

The code above tests for spaces only, but it works the same when 
searching for any substring or single character. It is no slower 
than fixed-width encoding for these operations.

Again, I urge you, please read up on UTF-8. It is very well 
designed.


More information about the Digitalmars-d mailing list