dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Mon Nov 8 08:18:51 UTC 2021

On Monday, 8 November 2021 at 08:11:12 UTC, FeepingCreature wrote:
> On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
>> It's much better than 0.0. 0.0 is indistinguishable from valid 
>> data, and is a very common valid value.
>>
>> NaN and ReplacementChar are not valid and are easily 
>> distinguished.
>
> No, that's exactly the problem. ReplacementChar is not easily 
> distinguished, because it's a valid Unicode character - that's 
> the whole point of it. So just like nan, it can propagate 
> arbitrarily far through your processing pipeline before some 
> downstream process decides that it actually doesn't like it.

Sorry, let me expand on this because I think it's the very core 
of the disagreement.

I feel you have two options with NaN/ReplacementChar. You can 
either just accept that this is what you get, and let it 
propagate throughout your entire pipeline. In that case it's no 
better than 0.0 - actually, NaN would be *worse*, because your 
process would be completely broken with no way to fix it, whereas 
at least with 0.0 you can maybe get some reasonably-usable data 
out.

Or you can say that "we don't want to be generating 
NaN/ReplacementChar." Then where do you draw the line? At the 
process input/output boundary? But then the process needs to be 
fixed if it generates nans/fffds. So you want to move your 
signaling as close to the production site as possible. 
Preferably, you want to fail at the exact line that the 
problematic data was produced. So we're back at exceptions in 
foreach. (Actually, an exception in cast(string) would be the 
best.)

And that's why I think ReplacementChar/NaN are no better than 
0.0. You either embrace them fully as "valid" data, or you handle 
them at the site of origin; any compromise just makes you worse 
off than either extreme.