[Issue 14519] [Enh] foreach on strings should return replacementDchar rather than throwing

Wed Apr 29 01:05:51 PDT 2015

https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #9 from Jonathan M Davis <issues.dlang at jmdavisProg.com> ---
Most string-based functions work perfectly well with invalid Unicode. Does find
care? Does startsWith? Does filter? The replacement character simply won't
match what you're looking for. The functions themselves don't care. The
replacement character is just another character. They need a way to deal with
invalid Unicode, but the replacement character deals with that beautifully.

The concern is whether program input is valid - whether the user manages to
type in invalid Unicode due to bad terminal settings, or whether you get junk
off a socket, or whether a file has been corrupted. Anything that cares should
be checking that when the data enters the program so that the error can be
reported to whoever or wherever the data is coming from. Having it done via
exceptions later on disconnects the reporting of the error from the point when
it can actually be handled. What do you do if you read in an XML file and
process half of it before you hit invalid Unicode? If the whole file was read
into memory, then you may not even have any any idea where that string came
from, and it's likely far too late to report to the user that they're opening a
corrupted file. That validation really needs to be done when the string enters
the program - not at some arbitrary point later in the program when the invalid
portion happens to be decoded. So, if you insist that all strings be validated,
then maybe throwing an Error makes sense, but an Exception sure doesn't. And
throwing an Error assumes that you always need to validate the Unicode in
strings, which definitely isn't the case when the replacement character is
used. So, throwing an Error is forcing everyone to validate the Unicode in
their strings whether they care or not, and using the replacement character
will work, whereas the programs that do care about validating their strings
should be doing the validation up front anyway.

So, given that the code that cares about validation needs to be validating up
front and therefore doesn't care about the replacement character being used
later and that programs that don't care about validating their Unicode input
will work just fine with the replacement character, it seems to me that it
makes perfect sense to just use the replacement character rather than throwing.

--