Why foreach(c; someString) must yield dchar

Jonathan M Davis jmdavisprog at gmail.com
Thu Aug 19 11:39:49 PDT 2010


On Thursday, August 19, 2010 07:13:25 Kagamin wrote:
> Jonathan Davis Wrote:
> > bytes and shorts are legitimate values on their own, so it wouldn't
> > make sense to give the type to foreach as long.
> 
> Having wider integer always has sense.
> 
> > byte or short on its own just fine.
> 
> Yes, but odds are that it's a bug. You can easily hit an overflow.

No, it doesn't hurt to have the iteration type larger than the actual type, but 
you're not going to have overflow. The value is in the array already. Sure, you 
could have had overflow putting it in, but when you're taking it out, you know 
that it fits because it was already in there. You could have overflow issues with 
math or whatnot inside the body of your loop if you're assigning to the foreach 
variable, but that has nothing to do with what you're getting out of the loop. 
With string and wstring, you're almost certainly getting a type that is 
inappropriate to process by itself.

> 
> > So, it's almost a guarantee that the correct type for iterating over a
> > string or wstring is dchar, not char or wchar. String types are just
> > weird that way due to how multibyte unicode encodings work.
> 
> If you don't like narrow strings, don't use them. Use dstring. You are free
> to write what you want.

It's fine with me to use narrow strings. Much as I'd love to avoid a lot of these 
issues, dstrings take up too much memory if you're going to be doing a lot of 
string processing. I'm aware of the issues and can program around them. The 
problem is that the default behavior is the abnormal (and therefore almost 
certainly buggy) behavior. Generally D tries to make the normal behavior the 
behavior that is less likely to cause bugs. Obviously, it doesn't always 
succeed, and this case is one of them. Very few people are actually going to 
want to deal with code points. They want characters. The result is that it 
becomes very easy to make mistakes with strings if you ever try and manipulate 
them character-by-character.

> 
> > So, since it makes so little sense to iterate over chars or wchars by
> > default, it would make sense to make the default dchar.
> 
> It's an iteration over array items. This makes perfect sense.

It makes perfect sense for general arrays. It makes perfect sense if you don't 
really care about the contents of the array for your algorithm (that is, whether 
they're code points or characters or just bytes in memory doesn't matter for 
what you're doing). However, if you're actually processing characters, it makes 
no sense at all. This mess with foreach and strings is one of the big reasons 
why foreach tends to be avoided in std.algorithm.

The reality of the matter is that what the container conceptually contains 
(characters) and what it actually contains aren't the same. That causes problems 
all over the place. Some reasonable workarounds have been found (for instance, 
strings are special-cased so that they're not random access ranges), but you 
have to special case string all over the place. The only way to avoid it 
completely is to just use dstring everywhere, but that doesn't necessarily scale 
well, and given the fact that the string module deals almost exclusively with 
string rather than wstring or dstring, it really doesn't make sense to use 
dstrings in the general case. Not to mention, the Linux I/O stuff uses UTF-8, and 
the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing with 
I/O.

Even just making it an error - or at least a warning - to not give the type for 
foreach when iterating over UTF-8 and UTF-16 string types would help a lot in 
fixing string-related coding errors (so, they can choose char, wchar, or dchar, 
but they can't forget to put in the type and get shot in the foot because what 
they almost certainly wanted was dchar). However, there's a lot of generic code 
which runs into trouble because of this as well. The result is that you 
generally have to avoid foreach in generic code.

Perhaps what we need is some way to distinguish between the exact element type 
on an array and the conceptual element type. So, for most arrays, they'd both be 
whatever the element type of the array is, but for strings the exact element 
type would be char, whchar, or dchar while the conceptual type would be dchar. 
That way, algorithms that don't care what the actual contents mean can use the 
exact element type, and the algorithms that actually care about processing the 
contents can use the conceptual element type.

- Jonathan M Davis


More information about the Digitalmars-d mailing list