dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Mon Nov 8 08:11:12 UTC 2021

On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
> It's much better than 0.0. 0.0 is indistinguishable from valid 
> data, and is a very common valid value.
>
> NaN and ReplacementChar are not valid and are easily 
> distinguished.

No, that's exactly the problem. ReplacementChar is not easily 
distinguished, because it's a valid Unicode character - that's 
the whole point of it. So just like nan, it can propagate 
arbitrarily far through your processing pipeline before some 
downstream process decides that it actually doesn't like it. And 
at that point you generally have no chance to recover the source 
of the issue - you know that something maybe has gone wrong, but 
you don't even know if it was in your process or in the input 
data. After all, if you were screening your input data for 
ReplacementChar, you could as easily have been screening it for 
invalid UTF-8 to begin with. So while yes it's marginally better 
than 0.0, because at least you know that *something* is wrong, it 
does as little as possible to help you locate the problem while 
technically informing you. And all the workarounds for that take 
the form of "throw everywhere where a ReplacementChar could be 
generated." So imo just do the equivalent of turning on 
FE_INVALID, and do that to begin with. There's no point to 
getting rid of throw sites when you just force the user to readd 
them manually because they fulfill a genuine need.

IMO if you want to get rid of the exception overhead, I'd go the 
other way and make invalid unicode an abort(). Check your input 
data, people.