Unicode Normalization (and graphemes and locales)

Fri Jun 3 05:06:33 PDT 2016

On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
> But consider the case where you are searching the string: "cassé"
>
> for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
> succeed when you should fail! However, it may be that you actually want
> to find specifically any code points with 'e', including ones with
> combining characters. This is why we really need more discretion from
> Phobos, and less hand-holding.
>
> There are certainly searches that will be correct. For example,
> searching for newline should always work in code-point space. Actually,
> what happens when you use a combining character on newline? Is it an
> invalid unicode sequence? Does it matter? :)
>
> A nice function to determine whether code points or graphemes are
> required for comparison given a needle may be useful.

Well, if you know that you're dealing with a grapheme that has that problem,
you can just iterate by graphemes rather than code units like find would
normally. Otherwise, what you probably end up doing is searching for the
needle and then verifying that the resultant range starts with the right
grapheme and not just the right code point and then call find again to
search further into the range if it was just the right code point.
Regardless, I don't see how find is really going to solve this for you
unless it either assumes that you want to deal with graphemes and converts
everything to graphemes, or it assumes that you want graphemes and converts
to graphemes when it finds a possible match and the only considers it a
match if it's a match at the graphem level. The latter wouldn't be expensive
in most cases, but it _would_ be assuming that you want to operate on
graphemes even though you have a range of code units or code points, and
that's not necessarily the case. You might actually want to find the code
units or code points in question and not care about graphemes (much as
that's not likely to be typical). That could still be acceptable if we
decided that you needed to use a range of ubyte/ushort/uint rather than a
range of char/wchar/dchar in the case where you actually want to look for
code units or code points rather than searching for a grapheme within a
range of code units or code points.

But even if we don't take graphemes into account at all with a function like
find, encoding the needle and searching with code units shouldn't be a
problem. It's just that the programmer needs to be aware that they might end
up finding only a partial grapheme if they're not careful. The alternative
is to not allow searching for needles of one character type inside a
haystack of another character type and force the programmer to to the
encoding rather than having find to it. And that wouldn't be the end of the
world, but it wouldn't be as user-friendly, and I'm not sure that it would
be a great idea given that we currently can do those comparisons thanks to
auto-decoding, and we'd effectively be losing functionality if it didn't
work with other ranges of characters (or with strings if/once auto-decoding
is killed off).

Ultimately, we need to make sure that we don't prevent the programmer for
handling Unicode correctly or make it more difficult in an attempt to make
it easier for the programmer (which is essentially what auto-decoding does),
but that doesn't mean that there aren't cases where we can bake-in some
Unicode handling into functions to increase efficiency without losing out on
correctness. And making find encode the needle so that it can compare at the
code unit level doesn't lose out on correctness. It just isn't sufficient
for full correctness on its own.

- Jonathan M Davis