Why the hell doesn't foreach decode strings

Fri Oct 21 02:51:11 PDT 2011

On Fri, 21 Oct 2011 06:39:56 +0200, Walter Bright  
<newshound2 at digitalmars.com> wrote:

> On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
>> And why would you iterate over a string with foreach without decoding it
>> unless you specifically need to operate on code units (which I would  
>> expect to
>> be _very_ rare)? Sure, copying doesn't require decoding, but searching  
>> sure
>> does
>
> No, it doesn't. If I'm searching for a dchar, I'll be searching for a  
> substring in the UTF-8 string. It's far, FAR more efficient to search as  
> a substring rather than decoding while searching.
>
> Even more, 99.9999% of searches involve an ascii search string. It is  
> simply not necessary to decode the searched string, as encoded chars  
> cannot be ascii. For example:
>
>     foreach (c; somestring)
>           if (c == '+')
> 		found it!
>
> gains absolutely nothing by decoding somestring.
>
>
>> (unless you're specifically looking for a code unit rather than a code
>> point, which would not be normal). Most anything which needs to operate  
>> on the
>> characters of a string needs to decode them. And iterating over them to  
>> do
>> much of anything would require decoding, since otherwise you're  
>> operating on
>> code units, and how often does anyone do that unless they're  
>> specifically
>> messing around with character encodings?
>
> What you write sounds intuitively correct, but in my experience writing  
> Unicode processing code, it simply isn't true. One rarely needs to  
> decode.
>
>
>> However, in most cases, that is _not_
>> what the programmer actually wants. They want to iterate over  
>> characters, not
>> pieces of characters. So, the default at this point is _wrong_ in the  
>> common
>> case.
>
> This is simply not my experience when working with Unicode. Performance  
> takes a big hit when one structures an algorithm to require  
> decoding/encoding. Doing the algorithm using substrings is a huge win.
>
> Take a look at dmd's lexer, it handles Unicode correctly and avoids  
> doing decoding as much as possible.
>
You have a good point here. I would have immediately thrown out the loop  
AFTER profiling.
What hits me here is that I had an incorrect program with built-in unicode  
aware strings.
This is counterintuitive to correct unicode handling throughout the std  
library,
and even more to the complementary operation of appending any char type to  
strings.

martin