Why the hell doesn't foreach decode strings

Michel Fortin michel.fortin at michelf.com
Wed Oct 26 07:04:08 PDT 2011


On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" 
<schveiguy at yahoo.com> said:

> It's even easier than this:
> 
> a) you want to do a proper string comparison not knowing what state the
> unicode strings are in, use the full-fledged decode-when-needed string
> type, and its associated str.find method.
> b) you know they are both the same normalized form and want to optimize,
> use std.algorithm.find(haystack.asArray, needle.asArray).

Well, treating the string as an array of dchar doesn't work in the 
general case, even with strings normalized the same way your fiancé 
example can break. So should never treat them as plain arrays unless 
I'm sure I have no combining marks in the string.

I'm not opposed to having a new string type being developed, but I'm 
skeptical about its inclusion in the language. We already have three 
string types which can be assumed to contain valid UTF sequences. I 
think the first thing to do is not to develop a new string type, but to 
develop the normalization and grapheme splitting algorithms, and those 
to find a substring, using the existing char[], wchar[] and dchar[] 
types.  Then write a program with proper handling of Unicode using 
those and hand-optimize it. If that proves to be a pain (it might well 
be), write a new string type, rewrite the program using it, do some 
benchmarks and then we'll know if it's a good idea, and will be able to 
quantify the drawbacks.

But right now, all this arguing for or against a new string type is 
stacking hypothesis against other hypothesis, it won't lead anywhere.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



More information about the Digitalmars-d mailing list