Why the hell doesn't foreach decode strings

Steven Schveighoffer schveiguy at yahoo.com
Wed Oct 26 07:20:54 PDT 2011


On Wed, 26 Oct 2011 10:04:08 -0400, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer"  
> <schveiguy at yahoo.com> said:
>
>> It's even easier than this:
>>  a) you want to do a proper string comparison not knowing what state the
>> unicode strings are in, use the full-fledged decode-when-needed string
>> type, and its associated str.find method.
>> b) you know they are both the same normalized form and want to optimize,
>> use std.algorithm.find(haystack.asArray, needle.asArray).
>
> Well, treating the string as an array of dchar doesn't work in the  
> general case, even with strings normalized the same way your fiancé  
> example can break. So should never treat them as plain arrays unless I'm  
> sure I have no combining marks in the string.

If you have the same normalization, you would at least find all the valid  
instances.  You'd have to eliminate false positives, by decoding the next  
dchar after the match to see if it's a combining character.  This would be  
a simple function to write.

All I'm saying is, doing a search on the array should be available.

>
> I'm not opposed to having a new string type being developed, but I'm  
> skeptical about its inclusion in the language. We already have three  
> string types which can be assumed to contain valid UTF sequences. I  
> think the first thing to do is not to develop a new string type, but to  
> develop the normalization and grapheme splitting algorithms, and those  
> to find a substring, using the existing char[], wchar[] and dchar[]  
> types.  Then write a program with proper handling of Unicode using those  
> and hand-optimize it. If that proves to be a pain (it might well be),  
> write a new string type, rewrite the program using it, do some  
> benchmarks and then we'll know if it's a good idea, and will be able to  
> quantify the drawbacks.
>
> But right now, all this arguing for or against a new string type is  
> stacking hypothesis against other hypothesis, it won't lead anywhere.
>

I have a half-implemented string type which does just this.  I need to  
finish it.  There are some language barriers that need to be sorted out  
(such as how to deal with tail-const for arbitrary types).

I think a user who no longer posts here (spir) had some full  
implementation of unicode strings using classes, though I don't see that  
being comparable in performance.

-Steve


More information about the Digitalmars-d mailing list