Why the hell doesn't foreach decode strings
Michel Fortin
michel.fortin at michelf.com
Wed Oct 26 07:04:08 PDT 2011
On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer"
<schveiguy at yahoo.com> said:
> It's even easier than this:
>
> a) you want to do a proper string comparison not knowing what state the
> unicode strings are in, use the full-fledged decode-when-needed string
> type, and its associated str.find method.
> b) you know they are both the same normalized form and want to optimize,
> use std.algorithm.find(haystack.asArray, needle.asArray).
Well, treating the string as an array of dchar doesn't work in the
general case, even with strings normalized the same way your fiancé
example can break. So should never treat them as plain arrays unless
I'm sure I have no combining marks in the string.
I'm not opposed to having a new string type being developed, but I'm
skeptical about its inclusion in the language. We already have three
string types which can be assumed to contain valid UTF sequences. I
think the first thing to do is not to develop a new string type, but to
develop the normalization and grapheme splitting algorithms, and those
to find a substring, using the existing char[], wchar[] and dchar[]
types. Then write a program with proper handling of Unicode using
those and hand-optimize it. If that proves to be a pain (it might well
be), write a new string type, rewrite the program using it, do some
benchmarks and then we'll know if it's a good idea, and will be able to
quantify the drawbacks.
But right now, all this arguing for or against a new string type is
stacking hypothesis against other hypothesis, it won't lead anywhere.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list