Why the hell doesn't foreach decode strings

Mon Oct 24 16:58:59 PDT 2011

On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer" 
<schveiguy at yahoo.com> said:

> What if the source character is encoded differently than the search
> string?  This is basic unicode stuff.  See my example with fiancé.

The more I think about it, the more I think it should work like this: 
just like we assume they contain well-formed UTF sequences, char[], 
wchar[], and dchar[] should also be assumed to contain **normalized** 
unicode strings. Which normalization form to use? no idea. Just pick a 
sensible one in the four.
<http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>

Once we know all strings are normalized in the same way, we can then 
compare two strings bitwise to check if they're the same. And we can 
check for a substring in the same manner except we need to insert a 
simple check after the match to verify that it isn't surrounded by 
combining marks applying to the first or last character. And it'll also 
deeply simplify any proper Unicode string handling code we'll add in 
the future.

 - - -

That said, I fear that forcing a specific normalization might be 
problematic. You don't always want to have to normalize everything...
<http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>

So 

perhaps we could simplify things a bit more: don't pick a standard 
normalization form. Just assume that both strings being used are in the 
same normalization form. Comparison will work, searching for substring 
in the way specified above will work, and other functions could 
document which normalization form they accept. Problem solved... 
somewhat.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/