Why the hell doesn't foreach decode strings

Martin Nowak dawg at dawgfoto.de
Fri Oct 21 02:51:11 PDT 2011


On Fri, 21 Oct 2011 06:39:56 +0200, Walter Bright  
<newshound2 at digitalmars.com> wrote:

> On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
>> And why would you iterate over a string with foreach without decoding it
>> unless you specifically need to operate on code units (which I would  
>> expect to
>> be _very_ rare)? Sure, copying doesn't require decoding, but searching  
>> sure
>> does
>
> No, it doesn't. If I'm searching for a dchar, I'll be searching for a  
> substring in the UTF-8 string. It's far, FAR more efficient to search as  
> a substring rather than decoding while searching.
>
> Even more, 99.9999% of searches involve an ascii search string. It is  
> simply not necessary to decode the searched string, as encoded chars  
> cannot be ascii. For example:
>
>     foreach (c; somestring)
>           if (c == '+')
> 		found it!
>
> gains absolutely nothing by decoding somestring.
>
>
>> (unless you're specifically looking for a code unit rather than a code
>> point, which would not be normal). Most anything which needs to operate  
>> on the
>> characters of a string needs to decode them. And iterating over them to  
>> do
>> much of anything would require decoding, since otherwise you're  
>> operating on
>> code units, and how often does anyone do that unless they're  
>> specifically
>> messing around with character encodings?
>
> What you write sounds intuitively correct, but in my experience writing  
> Unicode processing code, it simply isn't true. One rarely needs to  
> decode.
>
>
>> However, in most cases, that is _not_
>> what the programmer actually wants. They want to iterate over  
>> characters, not
>> pieces of characters. So, the default at this point is _wrong_ in the  
>> common
>> case.
>
> This is simply not my experience when working with Unicode. Performance  
> takes a big hit when one structures an algorithm to require  
> decoding/encoding. Doing the algorithm using substrings is a huge win.
>
> Take a look at dmd's lexer, it handles Unicode correctly and avoids  
> doing decoding as much as possible.
>
You have a good point here. I would have immediately thrown out the loop  
AFTER profiling.
What hits me here is that I had an incorrect program with built-in unicode  
aware strings.
This is counterintuitive to correct unicode handling throughout the std  
library,
and even more to the complementary operation of appending any char type to  
strings.

martin


More information about the Digitalmars-d mailing list