Why UTF-8/16 character encodings?

Dmitry Olshansky dmitry.olsh at gmail.com
Sat May 25 10:35:03 PDT 2013


25-May-2013 12:58, Vladimir Panteleev пишет:
> On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
>>> This is more a problem with the algorithms taking the easy way than a
>>> problem with UTF-8. You can do all the string algorithms, including
>>> regex, by working with the UTF-8 directly rather than converting to
>>> UTF-32. Then the algorithms work at full speed.
>> I call BS on this.  There's no way working on a variable-width
>> encoding can be as "full speed" as a constant-width encoding. Perhaps
>> you mean that the slowdown is minimal, but I doubt that also.
>
> For the record, I noticed that programmers (myself included) that had an
> incomplete understanding of Unicode / UTF exaggerate this point, and
> sometimes needlessly assume that their code needs to operate on
> individual characters (code points), when it is in fact not so - and
> that code will work just fine as if it was written to handle ASCII. The
> example Walter quoted (regex - assuming you don't want Unicode ranges or
> case-insensitivity) is one such case.

+1
BTW regex even with Unicode ranges and case-insensitivity is doable just 
not easy (yet).

> Another thing I noticed: sometimes when you think you really need to
> operate on individual characters (and that your code will not be correct
> unless you do that), the assumption will be incorrect due to the
> existence of combining characters in Unicode. Two of the often-quoted
> use cases of working on individual code points is calculating the string
> width (assuming a fixed-width font), and slicing the string - both of
> these will break with combining characters if those are not accounted
> for.  I believe the proper way to approach such tasks is to implement the
> respective Unicode algorithms for it, which I believe are non-trivial
> and for which the relative impact for the overhead of working with a
> variable-width encoding is acceptable.

Another plus one. Algorithms defined on code point basis are quite 
complex so that benefit of not decoding won't be that large. The benefit 
of transparently special-casing ASCII in UTF-8 is far larger.

> Can you post some specific cases where the benefits of a constant-width
> encoding are obvious and, in your opinion, make constant-width encodings
> more useful than all the benefits of UTF-8?
>
> Also, I don't think this has been posted in this thread. Not sure if it
> answers your points, though:
>
> http://www.utf8everywhere.org/
>
> And here's a simple and correct UTF-8 decoder:
>
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list