Why the hell doesn't foreach decode strings
Martin Nowak
dawg at dawgfoto.de
Thu Oct 20 16:57:06 PDT 2011
On Thu, 20 Oct 2011 21:58:20 +0200, Jonathan M Davis <jmdavisProg at gmx.com>
wrote:
> On Thursday, October 20, 2011 21:37:56 Martin Nowak wrote:
>> It just took me over one hour to find out the unthinkable.
>> foreach(c; str) will deduce c to immutable(char) and doesn't care about
>> unicode.
>> Now there is so many unicode transcoding happening in the language that
>> it
>> starts to get annoying,
>> but the most basic string iteration doesn't support it by default?
>
> Walter won't change it, because it would silently change too much code.
> Now,
> I'm willing to bet that in 99.9999999% of cases, it would _fix_ the code
> rather
> than break it, but still, he won't do it. However, the behavior _is_
> completely consistent with the rest of the language, since it's the
> range-
> based stuff which decodes arrays of chars or wchars as characters. And it
> _would_ be inconsistent with all other uses of foreach for arrays of
> char or
> wchar to be iterated over as ranges of dchar. But still, it's a bug
> waiting to
> happen which doesn't really benefit anyone.
>
> I've suggested that there should be a warning when code uses a foreach
> over an
> array of char or wchar without specifying the iteration type (
> http://d.puremagic.com/issues/show_bug.cgi?id=4483 ). That way, you can
> specify char or wchar if you really want it, but anyone who forgets to
> explicitly use dchar (or doesn't realize that they should) is warned.
> But that
> hasn't been implemented as of yet, and I don't believe that Walter has
> voiced
> his opinion on it.
>
> - Jonathan M Davis
At least it was your ∞ that revealed my bug.
Incidentally this has brought me a nice idea.
You need to combine the foreach loop 'bug' with the ability to alter the
index variable
(http://d.puremagic.com/issues/show_bug.cgi?id=6652).
Then you can construct a terrifically fast, still correct, utf8 decoder.
foreach(i, c; s)
{
if (c < 0x80)
outp.put(c);
else
(outp.put(std.utf.decode(s, i)), --i);
}
But you better write foreach(ref i, char c; s).
More information about the Digitalmars-d
mailing list