Why the hell doesn't foreach decode strings
Walter Bright
newshound2 at digitalmars.com
Thu Oct 20 21:39:56 PDT 2011
On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
> And why would you iterate over a string with foreach without decoding it
> unless you specifically need to operate on code units (which I would expect to
> be _very_ rare)? Sure, copying doesn't require decoding, but searching sure
> does
No, it doesn't. If I'm searching for a dchar, I'll be searching for a substring
in the UTF-8 string. It's far, FAR more efficient to search as a substring
rather than decoding while searching.
Even more, 99.9999% of searches involve an ascii search string. It is simply not
necessary to decode the searched string, as encoded chars cannot be ascii. For
example:
foreach (c; somestring)
if (c == '+')
found it!
gains absolutely nothing by decoding somestring.
> (unless you're specifically looking for a code unit rather than a code
> point, which would not be normal). Most anything which needs to operate on the
> characters of a string needs to decode them. And iterating over them to do
> much of anything would require decoding, since otherwise you're operating on
> code units, and how often does anyone do that unless they're specifically
> messing around with character encodings?
What you write sounds intuitively correct, but in my experience writing Unicode
processing code, it simply isn't true. One rarely needs to decode.
> However, in most cases, that is _not_
> what the programmer actually wants. They want to iterate over characters, not
> pieces of characters. So, the default at this point is _wrong_ in the common
> case.
This is simply not my experience when working with Unicode. Performance takes a
big hit when one structures an algorithm to require decoding/encoding. Doing the
algorithm using substrings is a huge win.
Take a look at dmd's lexer, it handles Unicode correctly and avoids doing
decoding as much as possible.
More information about the Digitalmars-d
mailing list