Why the hell doesn't foreach decode strings

Sat Oct 29 08:04:00 PDT 2011

On Saturday, October 29, 2011 09:42:54 Andrei Alexandrescu wrote:
> On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
> > On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
> > 
> > <simen.kjaras at gmail.com> wrote:
> >> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
> >> 
> >> <schveiguy at yahoo.com> wrote:
> >>> Plus, a combining character (such as an umlaut or accent) is part of
> >>> a
> >>> character, but may be a separate code point.
> >> 
> >> If this is correct (and it is), then decoding to dchar is simply not
> >> enough.
> >> You seem to advocate decoding to graphemes, which is a whole different
> >> matter.
> > 
> > I am advocating that. And it's a matter of perception. D can say "we
> > only support code-point decoding" and what that means to a user is, "we
> > don't support language as you know it." Sure it's a part of unicode, but
> > it takes that extra piece to make it actually usable to people who
> > require unicode.
> > 
> > Even in English, fiancé has an accent. To say D supports unicode, but
> > then won't do a simple search on a file which contains a certain *valid*
> > encoding of that word is disingenuous to say the least.
> 
> Why doesn't that simple search work?
> 
> foreach (line; stdin.byLine()) {
>      if (line.canFind("fiancé")) {
>         writeln("There it is.");
>      }
> }

If the strings aren't normalized the same way, then it might not find fiancé. If 
they _are_ normalized the same way and fiancé is in there except that the é is 
actually modified by another code point after it (e.g. a subscript of 2 - not 
exactly likely in this case but certainly possible), then that string would   
be found when it shouldn't be. The bigger problem though, I think, is when 
you're searching for a string which is the same without the modifiers - which 
would be fiance in this case - since then if the modfiying code points are 
after, then find will think that it found the string that you were looking for 
when it didn't.

Once you're dealing with modifying code points, in the general case, you 
_must_ operate on the grapheme level to ensure that you find exactly what 
you're looking for and only what you're looking for. If we assume that all 
strings are normalized the same way and pick the right normalization for it 
(and provide a function to normalize strings that way of course), then we 
could probably make that work 100% of the time (assuming that there's a 
normalized form with all of the modifying code points being _before_ the code 
point that we modify and that no modifying code point can be a character on 
its own), but I'd have to study up on it more to be sure.

Regardless, while searching for fiancé has a decent chance of success 
(especially if programs generall favor using  single code points instead of 
multiple code points wherever possible), it's still a risky proposition 
without at least doing unicode normalization if not outright using a range of 
graphemes rather than code points.

- Jonathan M Davis