Why the hell doesn't foreach decode strings
Steven Schveighoffer
schveiguy at yahoo.com
Wed Oct 26 04:50:32 PDT 2011
On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin
<michel.fortin at michelf.com> wrote:
> On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"
> <schveiguy at yahoo.com> said:
>
>> What if the source character is encoded differently than the search
>> string? This is basic unicode stuff. See my example with fiancé.
>
> The more I think about it, the more I think it should work like this:
> just like we assume they contain well-formed UTF sequences, char[],
> wchar[], and dchar[] should also be assumed to contain **normalized**
> unicode strings. Which normalization form to use? no idea. Just pick a
> sensible one in the four.
> <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>
>
> Once we know all strings are normalized in the same way, we can then
> compare two strings bitwise to check if they're the same. And we can
> check for a substring in the same manner except we need to insert a
> simple check after the match to verify that it isn't surrounded by
> combining marks applying to the first or last character. And it'll also
> deeply simplify any proper Unicode string handling code we'll add in the
> future.
>
> - - -
>
> That said, I fear that forcing a specific normalization might be
> problematic. You don't always want to have to normalize everything...
> <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>
>
> So perhaps we could simplify things a bit more: don't pick a standard
> normalization form. Just assume that both strings being used are in the
> same normalization form. Comparison will work, searching for substring
> in the way specified above will work, and other functions could document
> which normalization form they accept. Problem solved... somewhat.
It's even easier than this:
a) you want to do a proper string comparison not knowing what state the
unicode strings are in, use the full-fledged decode-when-needed string
type, and its associated str.find method.
b) you know they are both the same normalized form and want to optimize,
use std.algorithm.find(haystack.asArray, needle.asArray).
-Steve
More information about the Digitalmars-d
mailing list