[Issue 4483] foreach over string or wstring, where element type not specified, does not support unicode

d-bugmail at puremagic.com d-bugmail at puremagic.com
Tue Jan 21 10:07:30 PST 2014


https://d.puremagic.com/issues/show_bug.cgi?id=4483


monarchdodra at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |monarchdodra at gmail.com


--- Comment #3 from monarchdodra at gmail.com 2014-01-21 10:07:21 PST ---
(In reply to comment #2)
> I took the liberty to remove the suggested solution from the title, since I
> think there are a couple of possible fixes here:
> 
> 1. Issue a warning (original suggestion)
> 2. Issue an error, always require a value type (breaking change)
> 3. Infer the value type as "dchar" in all cases (breaking change)
> 4. Throw an exception at runtime when >char, >wchar unicode is encountered
> (breaking change)
> 
> I think this issue is serious enough to warrant a breaking change. I taught a D
> workshop, in China, and everybody expected foreach to "just work", and
> rightfully so.
> 
> foreach(c; "你好") {}
> 
> This should just work! And it's hard to explain people why it doesn't, without
> getting into Unicode encoding issues, which no user wants to care about.
> 
> I'm going to argue for fix 3. and I'd say it's worth taking a breaking change
> for this issue.
> 
> The breaking change is compile time only, and limited to foreach over char[] or
> wchar[], with a non-ref, inferred value type, and where the scope cares about
> the value type being char or wchar. 
> 
> That last part is important: In all of druntime and phobos there were only 2
> places where that was the case. All others, including all tests(!), compiled
> (and ran) successfully without changes. The two places were fixed by adding the
> appropriate type, in both cases "char". A nice side effect of this change is
> that it makes it immediately obvious that the foreach does NOT handle the full
> Unicode character set. It's self-documenting, in a way.

The major issue though is that while it still compiles, it becomes a *silent*
breaking change, which is the worst kind of change.

I can think of a few places where the foreach is written "foreach(c; s)"
*because* we specifically want to iterate on the raw codeunits (not necessarily
the *char*).

Just because the code still compiles doesn't mean it's still correct. These
kinds of changes can introduce some really insidious bugs, or performance
issues.

At this point the only "solution" I really see would be to simply deprecate and
ban implicit char inference. Code would have to make an explicit choice. After
a two or three years of this, we'd be able to safely assume that the is no more
implicit char conversion out there, and then we'd be able to set the default to
dchar. Anything else (IMO) is just a disaster waiting to happen.

So I'd say I'd opt for "2)", and *once* we've had "2)" for a while, we can opt
to add "3)".

The question though is is it all worth it... I don't know?

> Note that we might still choose a runtime exception. It's hardly useful to get
> a char with value 0xE8 out of a char[]. But throwing a sudden exception is a
> breaking change that might be too risky to take on.

Except if you are explicitly iterating the codeunits, because you know your
needle is ASCII. Besides,
1) You'd *kill* performance.
2) You'd make a fair amount of nothrow functions throw.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------


More information about the Digitalmars-d-bugs mailing list