The Case Against Autodecode

Tue May 31 09:54:03 PDT 2016

On Tuesday, May 31, 2016 07:17:03 default0 via Digitalmars-d wrote:
> Thinking about this a bit more - what algorithms are actually
> correct when implemented on the level of code units?
> Off the top of my head I can only really think of copying and
> hashing, since you want to do that on the byte level anyways.
> I would also think that if you know your strings are normalized
> in the same normalization form (for example because they come
> from the same normalized source), you can check two strings for
> equality on the code unit level, but my understanding of unicode
> is still quite lacking, so I'm not sure on that.

Equality does not require decoding. Similarly, functions like find don't
either. Something like filter generally would, but it's also not
particularly normal to filter a string on a by-character basis. You'd
probably want to get to at least the word level in that case.

To make matters worse, functions like find or splitter are frequently used
to look for ASCII delimiters, even when the strings themselves contain
Unicode characters. So, even if decoding were necessary when looking for a
Unicode character, it's utterly wasteful when the character you're looking
for is ASCII. But searching generally does not require decoding so long as
the same character is always encoded the same way. So, Unicode normalization
_can_ be a problem, but that's a problem with code points as well as code
units (since the normalization has to do with the order of code points when
multiple code points make up a single grapheme). You'd have to go to the
grapheme level to avoid that problem. And that's why at least some of the
time, string-processing code is going to need to normalize its strings
before doing searches. But the searches themselves can then operate at the
code unit level.

- Jonathan M Davis