Why the hell doesn't foreach decode strings

Thu Oct 20 21:39:56 PDT 2011

On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
> And why would you iterate over a string with foreach without decoding it
> unless you specifically need to operate on code units (which I would expect to
> be _very_ rare)? Sure, copying doesn't require decoding, but searching sure
> does

No, it doesn't. If I'm searching for a dchar, I'll be searching for a substring 
in the UTF-8 string. It's far, FAR more efficient to search as a substring 
rather than decoding while searching.

Even more, 99.9999% of searches involve an ascii search string. It is simply not 
necessary to decode the searched string, as encoded chars cannot be ascii. For 
example:

    foreach (c; somestring)
          if (c == '+')
		found it!

gains absolutely nothing by decoding somestring.

> (unless you're specifically looking for a code unit rather than a code
> point, which would not be normal). Most anything which needs to operate on the
> characters of a string needs to decode them. And iterating over them to do
> much of anything would require decoding, since otherwise you're operating on
> code units, and how often does anyone do that unless they're specifically
> messing around with character encodings?

What you write sounds intuitively correct, but in my experience writing Unicode 
processing code, it simply isn't true. One rarely needs to decode.

> However, in most cases, that is _not_
> what the programmer actually wants. They want to iterate over characters, not
> pieces of characters. So, the default at this point is _wrong_ in the common
> case.

This is simply not my experience when working with Unicode. Performance takes a 
big hit when one structures an algorithm to require decoding/encoding. Doing the 
algorithm using substrings is a huge win.

Take a look at dmd's lexer, it handles Unicode correctly and avoids doing 
decoding as much as possible.