Why foreach(c; someString) must yield dchar
Jonathan M Davis
jmdavisprog at gmail.com
Thu Aug 19 11:39:49 PDT 2010
On Thursday, August 19, 2010 07:13:25 Kagamin wrote:
> Jonathan Davis Wrote:
> > bytes and shorts are legitimate values on their own, so it wouldn't
> > make sense to give the type to foreach as long.
>
> Having wider integer always has sense.
>
> > byte or short on its own just fine.
>
> Yes, but odds are that it's a bug. You can easily hit an overflow.
No, it doesn't hurt to have the iteration type larger than the actual type, but
you're not going to have overflow. The value is in the array already. Sure, you
could have had overflow putting it in, but when you're taking it out, you know
that it fits because it was already in there. You could have overflow issues with
math or whatnot inside the body of your loop if you're assigning to the foreach
variable, but that has nothing to do with what you're getting out of the loop.
With string and wstring, you're almost certainly getting a type that is
inappropriate to process by itself.
>
> > So, it's almost a guarantee that the correct type for iterating over a
> > string or wstring is dchar, not char or wchar. String types are just
> > weird that way due to how multibyte unicode encodings work.
>
> If you don't like narrow strings, don't use them. Use dstring. You are free
> to write what you want.
It's fine with me to use narrow strings. Much as I'd love to avoid a lot of these
issues, dstrings take up too much memory if you're going to be doing a lot of
string processing. I'm aware of the issues and can program around them. The
problem is that the default behavior is the abnormal (and therefore almost
certainly buggy) behavior. Generally D tries to make the normal behavior the
behavior that is less likely to cause bugs. Obviously, it doesn't always
succeed, and this case is one of them. Very few people are actually going to
want to deal with code points. They want characters. The result is that it
becomes very easy to make mistakes with strings if you ever try and manipulate
them character-by-character.
>
> > So, since it makes so little sense to iterate over chars or wchars by
> > default, it would make sense to make the default dchar.
>
> It's an iteration over array items. This makes perfect sense.
It makes perfect sense for general arrays. It makes perfect sense if you don't
really care about the contents of the array for your algorithm (that is, whether
they're code points or characters or just bytes in memory doesn't matter for
what you're doing). However, if you're actually processing characters, it makes
no sense at all. This mess with foreach and strings is one of the big reasons
why foreach tends to be avoided in std.algorithm.
The reality of the matter is that what the container conceptually contains
(characters) and what it actually contains aren't the same. That causes problems
all over the place. Some reasonable workarounds have been found (for instance,
strings are special-cased so that they're not random access ranges), but you
have to special case string all over the place. The only way to avoid it
completely is to just use dstring everywhere, but that doesn't necessarily scale
well, and given the fact that the string module deals almost exclusively with
string rather than wstring or dstring, it really doesn't make sense to use
dstrings in the general case. Not to mention, the Linux I/O stuff uses UTF-8, and
the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing with
I/O.
Even just making it an error - or at least a warning - to not give the type for
foreach when iterating over UTF-8 and UTF-16 string types would help a lot in
fixing string-related coding errors (so, they can choose char, wchar, or dchar,
but they can't forget to put in the type and get shot in the foot because what
they almost certainly wanted was dchar). However, there's a lot of generic code
which runs into trouble because of this as well. The result is that you
generally have to avoid foreach in generic code.
Perhaps what we need is some way to distinguish between the exact element type
on an array and the conceptual element type. So, for most arrays, they'd both be
whatever the element type of the array is, but for strings the exact element
type would be char, whchar, or dchar while the conceptual type would be dchar.
That way, algorithms that don't care what the actual contents mean can use the
exact element type, and the algorithms that actually care about processing the
contents can use the conceptual element type.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list