[Issue 17861] UTF Decode fails with exception

Tue Oct 3 23:05:27 UTC 2017

https://issues.dlang.org/show_bug.cgi?id=17861

--- Comment #15 from Jonathan M Davis <issues.dlang at jmdavisProg.com> ---
(In reply to Jon Degenhardt from comment #14)
> Changing the default behavior for the individual functions would cause
> backward compatibility issues. Any thoughts on having run-time selectable
> behavior that would override the defaults? The default behavior could be
> left unchanged.
> 
> The two issues that come to mind:
> - Functions currently nothrow could lose that status if throw is an option.
> - Performance: Compile-time choices are faster than run-time.
> 
> The advantage of a run-time selectable behavior is that it would support the
> need many programs have for an application specific behavior. There is no
> single default appropriate for all cases.

In general, Walter is against having flags that determine the behavior of the
language, and that's essentially what you're suggesting, even if it's set at 
runtime rather than at compile time. The reality of the matter is that as much
as the current behavior sucks, it's trivial to work around it by calling decode
yourself. So, I really don't see any reason to make it configurable. That would
just make it so that you don't know what the code is designed to do when you
look at it.

I think that it's far better to just be clear on how UTF decoding works in D
than to try and make anything at the language level configurable. The standard
library already provides the tools necessary to allow the programmer to choose
how they want to handle invalid UTF, even if the defaults aren't exactly ideal.

(In reply to Etienne from comment #13)
> You have to choose whether it's a bug or a feature. I think everyone is
> ready to live with that, but if you live up to it and consider it a feature
> it'll have to be documented. Just a 1 liner somewhere saying "Foreach
> (string) can throw unicode errors!"
> 
> That'll be a good solution to this issue, because right now everyone is
> forced to learn it the hard way. 
> 
> This being said, I don't see Google Chrome crashing every time it sees an
> invalid code point. I'm not sure anyone would think about catching that on
> the first try if they were to do an Ajax call. I'm also pretty sure they'd
> be happy with the code path where it doesn't throw when the invalid code
> point comes up. If you know of anyone doing software specifically for
> unicode valiation, maybe they'd need to be warned but that's about it for me.
> 
> So yeah, just wave it as a feature or squash the bug, but don't stay in
> between forever.

If the spec isn't clear about the fact that decoding invalid UTF with foreach
will throw an exception, then the spec needs to be updated accordingly, but the
current behavior is very much as designed and not a bug. I have no idea if the
spec says anything about invalid UTF or not. I'd have to comb through it to
know for sure. But the spec is often missing details that it should have, and
sometimes, when it does say something, it's concise enough in what it says that
it's easily missed. It wouldn't surprise me at all if it were stated somewhere
in there, and you just missed it, and it wouldn't surprise me if it's not
there. Regardless, I completely agree that the spec should be clear on the
matter.

--