Why UTF-8/16 character encodings?

Vladimir Panteleev vladimir at thecybershadow.net
Sat May 25 01:58:56 PDT 2013


On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
>> This is more a problem with the algorithms taking the easy way 
>> than a problem with UTF-8. You can do all the string 
>> algorithms, including regex, by working with the UTF-8 
>> directly rather than converting to UTF-32. Then the algorithms 
>> work at full speed.
> I call BS on this.  There's no way working on a variable-width 
> encoding can be as "full speed" as a constant-width encoding.  
> Perhaps you mean that the slowdown is minimal, but I doubt that 
> also.

For the record, I noticed that programmers (myself included) that 
had an incomplete understanding of Unicode / UTF exaggerate this 
point, and sometimes needlessly assume that their code needs to 
operate on individual characters (code points), when it is in 
fact not so - and that code will work just fine as if it was 
written to handle ASCII. The example Walter quoted (regex - 
assuming you don't want Unicode ranges or case-insensitivity) is 
one such case.

Another thing I noticed: sometimes when you think you really need 
to operate on individual characters (and that your code will not 
be correct unless you do that), the assumption will be incorrect 
due to the existence of combining characters in Unicode. Two of 
the often-quoted use cases of working on individual code points 
is calculating the string width (assuming a fixed-width font), 
and slicing the string - both of these will break with combining 
characters if those are not accounted for. I believe the proper 
way to approach such tasks is to implement the respective Unicode 
algorithms for it, which I believe are non-trivial and for which 
the relative impact for the overhead of working with a 
variable-width encoding is acceptable.

Can you post some specific cases where the benefits of a 
constant-width encoding are obvious and, in your opinion, make 
constant-width encodings more useful than all the benefits of 
UTF-8?

Also, I don't think this has been posted in this thread. Not sure 
if it answers your points, though:

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/


More information about the Digitalmars-d mailing list