Why foreach(c; someString) must yield dchar

Jonathan Davis jmdavisprog at gmail.com
Thu Aug 19 02:56:44 PDT 2010


On 8/19/10, Kagamin <spam at here.lot> wrote:
> Jonathan M Davis Wrote:
>
>> Considering that in all likelihood 99+% of the cases where someone is
>> iterating
>> over char, they really want dchar
>
> And when someone is iterating over byte[] or short[], they want long, right?
> Yeah, why not?
>

The problem is that chars are not characters. They are UTF-8 code
units. If all you're using is ASCII, you can get away with treating
them like one byte characters, but that doesn't work if you have any
characters which aren't in ASCII. dchars _are_ characters. The correct
way to iterate over a string or wstring if you want to treat the
elements as characters is to give the type as dchar.

foreach(dchar c; mystring)
{
    //...
}

If you use char or wchar, you're going to iterate over code units,
which is completely different. It is not generally the case that that
is the correct thing to do. If someone does that in their code, odds
are that it's a bug.

bytes and shorts are legitimate values on their own, so it wouldn't
make sense to give the type to foreach as long. You can deal with each
byte or short on its own just fine. You can't safely do that with code
units unless for some reason, you actually want to operate on code
units (which is unlikely), or you don't actually care about the
contents of the string for whatever you're doing (since some
algorithms don't actually care about the contents of the arrays/ranges
that they're dealing with).

So, it's almost a guarantee that the correct type for iterating over a
string or wstring is dchar, not char or wchar. String types are just
weird that way due to how multibyte unicode encodings work. So, since
it makes so little sense to iterate over chars or wchars by default,
it would make sense to make the default dchar.

- Jonathan M Davis


More information about the Digitalmars-d mailing list