[Issue 14519] [Enh] foreach on strings should return replacementDchar rather than throwing

Wed Apr 29 03:56:40 PDT 2015

https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #15 from Walter Bright <bugzilla at digitalmars.com> ---
(In reply to Vladimir Panteleev from comment #14)
> If I understand correctly, throwing Error instead of Exception will also
> solve the performance issues

It still allocates memory. But it's worth thinking about. Maybe assert()?

> Ditto, but the @nogc aspect can also be solved with the refcounted
> exceptions spec, which will fix the problem in general.

We'll see. That's still a ways off.

> > 2. Same thing. (Running normalization on passwords? What the hell?)
> 
> I did not mean Unicode normalization - it was a joke (std.algorithm will
> "normalize" invalid UTF characters to the replacement character). But since
> .front on strings autodecodes, feeding a string to any generic range
> function in std.algorithm will cause auto-decoding (and thus, character
> substitution).

That can be fixed as I suggested.

> > The replacement char thing was not invented by me, it is commonplace as
> > users don't like their documents being wholly rejected for one or two bad
> > encodings.
> I know, I agree it's useful, but it needs to be opt-in.

Global opt-in for foreach is not feasible. However, one can add an algorithm
"validate" which throws on invalid UTF, and put that at the start of a
pipeline, as in:

    text.validate.A.B.C.D;

> > I know that many programs try to guess the encoding of random text they get.
> > Doing this by only reading a few characters, and assuming the rest, is a
> > strange method if one cares about the integrity of the data.
> 
> I don't see how this is relevant, sorry.

You brought up guessing the encoding of XML text by reading the start of it:
"what if it was some 8-bit encoding that only LOOKED like valid UTF-8?"

> > Having to constantly re-sanitize data, at every step in the pipeline, is
> > going to make D programs uncompetitive speed-wise.
> 
> I don't understand what you mean by this. You could say that any way to
> handle invalid UTF can be seen as a way of sanitizing data: there will
> always be a code path for what to do when invalid UTF is encountered. I
> would interpret "no sanitization" as not handling invalid UTF in any way
> (i.e. treating it in an undefined way).

If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D never
are executed. But if A does not throw, then B.C.D guaranteed to be getting
valid UTF, but they still pay the penalty of the compiler thinking they can
allocate memory and throw.

--