Why UTF-8/16 character encodings?
Vladimir Panteleev
vladimir at thecybershadow.net
Sat May 25 07:18:31 PDT 2013
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev
> wrote:
>> On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
>>>> If you want to split a string by ASCII whitespace (newlines,
>>>> tabs and spaces), it makes no difference whether the string
>>>> is in ASCII or UTF-8 - the code will behave correctly in
>>>> either case, variable-width-encodings regardless.
>>> Except that a variable-width encoding will take longer to
>>> decode while splitting, when compared to a single-byte
>>> encoding.
>>
>> No. Are you sure you understand UTF-8 properly?
> Are you sure _you_ understand it properly? Both encodings have
> to check every single character to test for whitespace, but the
> single-byte encoding simply has to load each byte in the string
> and compare it against the whitespace-signifying bytes, while
> the variable-length code has to first load and parse
> potentially 4 bytes before it can compare, because it has to go
> through the state machine that you linked to above. Obviously
> the constant-width encoding will be faster. Did I really need
> to explain this?
It looks like you've missed an important property of UTF-8: lower
ASCII remains encoded the same, and UTF-8 code units encoding
non-ASCII characters cannot be confused with ASCII characters.
Code that does not need Unicode code points can treat UTF-8
strings as ASCII strings, and does not need to decode each
character individually - because a 0x20 byte will mean "space"
regardless of context. That's why a function that splits a string
by ASCII whitespace does NOT need do perform UTF-8 decoding.
I hope this clears up the misunderstanding :)
More information about the Digitalmars-d
mailing list