[Issue 14519] [Enh] foreach on strings should return replacementDchar rather than throwing

Wed Apr 29 03:35:06 PDT 2015

https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #14 from Vladimir Panteleev <thecybershadow at gmail.com> ---
(In reply to Walter Bright from comment #13)
> Vladimir, you bring up good points. I'll try to address them. First off, why
> do this?
> 
> 1. much faster

If I understand correctly, throwing Error instead of Exception will also solve
the performance issues

> 2. string processing can be @nogc and nothrow. If you follow external
> discussions on the merits of D, the "D is no good because Phobos requires
> the GC" ALWAYS comes up, and sucks all the energy out of the conversation.

Ditto, but the @nogc aspect can also be solved with the refcounted exceptions
spec, which will fix the problem in general.

> So, on to your points:
> 
> 1. Replacement only happens when doing a UTF decoding. S+R doesn't have to
> do conversion, and that's one of the things I want to fix in std.algorithm.
> The string fixes I've done in std.string avoid decoding as much as possible.

Inevitably it is still very easy to to accidentally use something that
auto-decodes. There is no way to statically make sure that you don't (except
for using a non-string type for text, which is impractical), and with this
proposed change, there will be no run-time way to handle this either.

> 2. Same thing. (Running normalization on passwords? What the hell?)

I did not mean Unicode normalization - it was a joke (std.algorithm will
"normalize" invalid UTF characters to the replacement character). But since
.front on strings autodecodes, feeding a string to any generic range function
in std.algorithm will cause auto-decoding (and thus, character substitution).

> The replacement char thing was not invented by me, it is commonplace as
> users don't like their documents being wholly rejected for one or two bad
> encodings.

I know, I agree it's useful, but it needs to be opt-in.

> I know that many programs try to guess the encoding of random text they get.
> Doing this by only reading a few characters, and assuming the rest, is a
> strange method if one cares about the integrity of the data.

I don't see how this is relevant, sorry.

> Having to constantly re-sanitize data, at every step in the pipeline, is
> going to make D programs uncompetitive speed-wise.

I don't understand what you mean by this. You could say that any way to handle
invalid UTF can be seen as a way of sanitizing data: there will always be a
code path for what to do when invalid UTF is encountered. I would interpret "no
sanitization" as not handling invalid UTF in any way (i.e. treating it in an
undefined way).

--