[Issue 14519] [Enh] foreach on strings should return replacementDchar rather than throwing

Wed Apr 29 01:30:57 PDT 2015

https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #10 from Vladimir Panteleev <thecybershadow at gmail.com> ---
OK, I see from your post that you don't see many of the problems with the
replacement character. Let me show you some example problematic situations:

1.

Bob wants to update his company's documents to use the new name for his
product. He writes a program that does a recursive pattern search & replace in
a directory. After testing the program on a few sample files, he is satisfied
with the results, and runs the program on his company's document store.

Six months later, long after the documents went out of backup rotation, Sue
finds that some important historical documents have been irreversibly corrupted
and full of Unicode replacement characters encoded as UTF-8. Why? Because these
old documents did not use UTF-8, and Bob used D.

2.

Bob is writing a secure server-side software package (let's say, a confidential
document store). He is using a std.algorithm-based hashing algorithm to store
the passwords securely. At some point, Mary signs up and creates a secure
password, which contains entirely Cyrillic letters (let's say, "ЭтоМойПароль").

Not long after, Eve successfully logs into Mary's account with the password
"ЯЯЯЯЯЯЯЯЯЯЯЯ". Why? Because the passwords just happened to be sent in some
non-UTF-8 encoding, and, since Bob used D, when "normalized" through
std.algorithm's replacement character subtitution, all Unicode-only passwords
of the same length have the same hash.

Automatic use of the replacement character will come as a surprise to many
people who come from other languages. For example, in Delphi, strings are also
the de-facto ubyte[] / void[] type - you can safely read a binary file into a
string, perform search and replace, and write it back, knowing that the result
will be exactly what you expected.

Furthermore, from your message it appears to me that you've missed the point of
my argument:

> What do you do if you read in an XML file and process half of it before you hit invalid Unicode?

You abort! This should not happen. Either the XML file is in an incorrect
encoding (which puts to question the integrity of all the data parsed so far -
what if it was some 8-bit encoding that only LOOKED like valid UTF-8?) or the
program should've sanitized the input first if it really didn't care about data
correctness. But this is an XML file, meaning it's very likely to be machine
generated - if it contains errors, it might indicate a problem somewhere else
in the system, which is why it's all the more important to abort and get the
user to figure out the true source of the problem. Ignoring the error here
reminds me of how PHP never stops on errors by default, or Basic's "ON ERROR
GOTO NEXT".

> So, throwing an Error is forcing everyone to validate the Unicode in their strings whether they care or not, and using the replacement character will work, whereas the programs that do care about validating their strings should be doing the validation up front anyway.

Yes, but then there is no way to make sure you're not accidentally corrupting
data! Whereas now we only have a runtime check against invalid UTF-8, now we
will have no check at all. With no automatic mechanism to ensure that all text
is sanitized before it gets into std.algorithm, it becomes impossible to be
sure that you're not accidentally corrupting data along the way.

--