Why UTF-8/16 character encodings?

Joakim joakim at airpost.net
Sat May 25 06:47:41 PDT 2013


On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
wrote:
> On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
>>> If you want to split a string by ASCII whitespace (newlines, 
>>> tabs and spaces), it makes no difference whether the string 
>>> is in ASCII or UTF-8 - the code will behave correctly in 
>>> either case, variable-width-encodings regardless.
>> Except that a variable-width encoding will take longer to 
>> decode while splitting, when compared to a single-byte 
>> encoding.
>
> No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly?  Both encodings have 
to check every single character to test for whitespace, but the 
single-byte encoding simply has to load each byte in the string 
and compare it against the whitespace-signifying bytes, while the 
variable-length code has to first load and parse potentially 4 
bytes before it can compare, because it has to go through the 
state machine that you linked to above.  Obviously the 
constant-width encoding will be faster.  Did I really need to 
explain this?

On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu 
wrote:
> On 5/25/13 3:33 AM, Joakim wrote:
>> On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
>>> This is more a problem with the algorithms taking the easy 
>>> way than a
>>> problem with UTF-8. You can do all the string algorithms, 
>>> including
>>> regex, by working with the UTF-8 directly rather than 
>>> converting to
>>> UTF-32. Then the algorithms work at full speed.
>> I call BS on this. There's no way working on a variable-width 
>> encoding
>> can be as "full speed" as a constant-width encoding. Perhaps 
>> you mean
>> that the slowdown is minimal, but I doubt that also.
>
> You mentioned this a couple of times, and I wonder what makes 
> you so sure. On contemporary architectures small is fast and 
> large is slow; betting on replacing larger data with more 
> computation is quite often a win.
When has small ever been slow and large fast? ;) I'm talking 
about replacing larger data _and_ more computation, ie UTF-8, 
with smaller data and less computation, ie single-byte encodings, 
so it is an unmitigated win in that regard. :)


More information about the Digitalmars-d mailing list