Why the hell doesn't foreach decode strings

Jonathan M Davis jmdavisProg at gmx.com
Thu Oct 20 20:58:50 PDT 2011


On Thursday, October 20, 2011 20:37:40 Walter Bright wrote:
> On 10/20/2011 7:37 PM, Jonathan M Davis wrote:
> > True, but if the default were dchar, then the common case would be have
> > fewer bugs
> 
> Is that really the common case? It's certainly the *slow* case. Common
> string operations like searching, copying, etc., do not require decoding.

And why would you iterate over a string with foreach without decoding it 
unless you specifically need to operate on code units (which I would expect to 
be _very_ rare)? Sure, copying doesn't require decoding, but searching sure 
does (unless you're specifically looking for a code unit rather than a code 
point, which would not be normal). Most anything which needs to operate on the 
characters of a string needs to decode them. And iterating over them to do 
much of anything would require decoding, since otherwise you're operating on 
code units, and how often does anyone do that unless they're specifically 
messing around with character encodings?

Sure, if you _know_ that you're dealing with a string with only ASCII, it's 
faster to just iterate over chars, but then you can explicitly give the type 
of the foreach variable as char, but normally what people care about is 
iterating over characters, not pieces of characters. So, I would expect the 
case where people _want_ to iterate over chars to be rare. In most cases, 
iterating over a string as chars is a bug - one which in many cases won't be 
quickly caught, because the programmer is English speaking and uses almost 
exclusively ASCII for whatever testing that they do.

The default for string handling really should be to treat them as ranges of 
dchar but still make it easy for them to be treated as arrays of code units 
when necessary. There's plenty of code in Phobos which is able to special case 
strings and make operating on them more efficient when it's not necessary to 
operate on them as ranges of dchar or when decoding the string explicitly with 
functions such as stride. But the default is still to operate on them as 
ranges of dchar, because that is what is normally correct. Defaulting to the 
guaranteed correct handling of characters and special casing when it's 
possible to write code more efficiently than that is definitely the way to go 
about it, and it's how Phobos generally does it. The fact that foreach doesn't 
is incongruous with how strings are handled in most other cases.

> > (still allowing you to explicitly use char or wchar when you want to).
> > At
> > minimum, I think that it would be a good idea to implement
> > http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning
> > not to explicitly give the type with foreach for arrays of char or
> > wchar. It would catch bugs without changing the behavior of any
> > existing code, and it still allows you to iterate over either code
> > units or code points.
> 
> I like the type deduction feature of foreach, and don't think it should be
> removed for strings. Currently, it's consistent - T[] gets an element type
> of T.

Sure, the type deduction of foreach is great, and it's completely consistent 
that iterating over an array of chars would iterate over chars rather than 
dchars when you don't give the type. However, in most cases, that is _not_ 
what the programmer actually wants. They want to iterate over characters, not 
pieces of characters. So, the default at this point is _wrong_ in the common 
case. As such, I'm very leery of any code which uses foreach over a string 
without specifying the iteration type. And in fact, unless the code is clearly 
intended to operate on code units, I would expect a comment indicating that 
the use of char instead of dchar was intentional, or I'd still consider it 
likely that it's a bug and a mistake on the programmer's part (likely due to a 
misunderstanding of unicode and how D deals with it).

> I want to reiterate that there's no way to program strings in D without
> being cognizant of them being a multibyte representation. D is both a high
> level and a low level language, and you can pick which to use, but you
> still gotta pick.

I fully agree that programmers need to properly understand unicode to use 
strings in D properly. However, the problem is that the default handling of 
strings with foreach is _not_ what programmers are going to normally want, so 
the default will cause bugs. If strings defaulted to iterating as ranges of 
dchar, or if programmers had to say what type they wanted to iterate over when 
dealing with strings (or at least got a warning if they didn't), then there 
would be fewer bugs. Pretty much every time that the use of strings with 
foreach comes up on this list, most everyone agrees that it's a wart in the 
language that the default is to iterate over chars rather than dchars. Not 
everyone agrees on the best way to fix the problem, but most everyone agrees 
that it _is_ a problem.

- Jonathan M Davis


More information about the Digitalmars-d mailing list