Why UTF-8/16 character encodings?
Vladimir Panteleev
vladimir at thecybershadow.net
Sat May 25 01:58:56 PDT 2013
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
>> This is more a problem with the algorithms taking the easy way
>> than a problem with UTF-8. You can do all the string
>> algorithms, including regex, by working with the UTF-8
>> directly rather than converting to UTF-32. Then the algorithms
>> work at full speed.
> I call BS on this. There's no way working on a variable-width
> encoding can be as "full speed" as a constant-width encoding.
> Perhaps you mean that the slowdown is minimal, but I doubt that
> also.
For the record, I noticed that programmers (myself included) that
had an incomplete understanding of Unicode / UTF exaggerate this
point, and sometimes needlessly assume that their code needs to
operate on individual characters (code points), when it is in
fact not so - and that code will work just fine as if it was
written to handle ASCII. The example Walter quoted (regex -
assuming you don't want Unicode ranges or case-insensitivity) is
one such case.
Another thing I noticed: sometimes when you think you really need
to operate on individual characters (and that your code will not
be correct unless you do that), the assumption will be incorrect
due to the existence of combining characters in Unicode. Two of
the often-quoted use cases of working on individual code points
is calculating the string width (assuming a fixed-width font),
and slicing the string - both of these will break with combining
characters if those are not accounted for. I believe the proper
way to approach such tasks is to implement the respective Unicode
algorithms for it, which I believe are non-trivial and for which
the relative impact for the overhead of working with a
variable-width encoding is acceptable.
Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?
Also, I don't think this has been posted in this thread. Not sure
if it answers your points, though:
http://www.utf8everywhere.org/
And here's a simple and correct UTF-8 decoder:
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
More information about the Digitalmars-d
mailing list