Why the hell doesn't foreach decode strings
Michel Fortin
michel.fortin at michelf.com
Mon Oct 24 16:58:59 PDT 2011
On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"
<schveiguy at yahoo.com> said:
> What if the source character is encoded differently than the search
> string? This is basic unicode stuff. See my example with fiancé.
The more I think about it, the more I think it should work like this:
just like we assume they contain well-formed UTF sequences, char[],
wchar[], and dchar[] should also be assumed to contain **normalized**
unicode strings. Which normalization form to use? no idea. Just pick a
sensible one in the four.
<http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>
Once we know all strings are normalized in the same way, we can then
compare two strings bitwise to check if they're the same. And we can
check for a substring in the same manner except we need to insert a
simple check after the match to verify that it isn't surrounded by
combining marks applying to the first or last character. And it'll also
deeply simplify any proper Unicode string handling code we'll add in
the future.
- - -
That said, I fear that forcing a specific normalization might be
problematic. You don't always want to have to normalize everything...
<http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>
So
perhaps we could simplify things a bit more: don't pick a standard
normalization form. Just assume that both strings being used are in the
same normalization form. Comparison will work, searching for substring
in the way specified above will work, and other functions could
document which normalization form they accept. Problem solved...
somewhat.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list