Why the hell doesn't foreach decode strings

Wed Oct 26 04:50:32 PDT 2011

On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"  
> <schveiguy at yahoo.com> said:
>
>> What if the source character is encoded differently than the search
>> string?  This is basic unicode stuff.  See my example with fiancé.
>
> The more I think about it, the more I think it should work like this:  
> just like we assume they contain well-formed UTF sequences, char[],  
> wchar[], and dchar[] should also be assumed to contain **normalized**  
> unicode strings. Which normalization form to use? no idea. Just pick a  
> sensible one in the four.
> <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>
>
> Once we know all strings are normalized in the same way, we can then  
> compare two strings bitwise to check if they're the same. And we can  
> check for a substring in the same manner except we need to insert a  
> simple check after the match to verify that it isn't surrounded by  
> combining marks applying to the first or last character. And it'll also  
> deeply simplify any proper Unicode string handling code we'll add in the  
> future.
>
>  - - -
>
> That said, I fear that forcing a specific normalization might be  
> problematic. You don't always want to have to normalize everything...
> <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>
>
> So perhaps we could simplify things a bit more: don't pick a standard  
> normalization form. Just assume that both strings being used are in the  
> same normalization form. Comparison will work, searching for substring  
> in the way specified above will work, and other functions could document  
> which normalization form they accept. Problem solved... somewhat.

It's even easier than this:

a) you want to do a proper string comparison not knowing what state the  
unicode strings are in, use the full-fledged decode-when-needed string  
type, and its associated str.find method.
b) you know they are both the same normalized form and want to optimize,  
use std.algorithm.find(haystack.asArray, needle.asArray).

-Steve