[Issue 17861] UTF Decode fails with exception

Tue Oct 3 19:23:08 UTC 2017

https://issues.dlang.org/show_bug.cgi?id=17861

Jonathan M Davis <issues.dlang at jmdavisProg.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |issues.dlang at jmdavisProg.co
                   |                            |m
           Hardware|x86                         |All
                 OS|Windows                     |All

--- Comment #8 from Jonathan M Davis <issues.dlang at jmdavisProg.com> ---
This has been discussed before. There's a strong argument for making it so that
decode uses the replacement character by default (it's even what the Unicode
standard says you should do), and all string-based stuff then follows suit, at
which point anyone wanting exceptions would need to call decode manually with
the template argument indicating that that's what they wanted - which is the
opposite of what we have now. And Walter is actually in favor of using the
replacement character instead of exceptions and possibly even making the change
in spite of the issues, but there have been some folks who have been strongly
opposed to that. The problem is twofold:

1. Making the change risks silently breaking a ton of code.

2. Others (Vladimir in particular IIRC) have argued about how negative it is to
have the contents of strings silently changed, since there are cases where it
would be highly detrimental for that to happen.

And on some level, all of this gets wrapped into the auto-decoding debate,
because that's the main reason that this is out of the control of the user.
front and popFront on strings call decode for you and call it in the way that
results in exceptions on invalid UTF instead of using the replacement
character. Anyone making the calls manually has the choice.

So, I think that the chances are very high that we would go with the
replacement character by default rather than exceptions (maybe not even have
the exceptions at all) if we were starting from scratch - just like we wouldn't
have auto-decoding if we were starting from scratch. But it's highly
questionable that we can get away with making the change now due to the
ramifications that it will have on existing code.

At this point, the situation with decoding code points and not having it throw
is in pretty much the same boat as using strings with range-based code and not
auto-decoding: you have to use wrappers like byCodeUnit and/or special-case
your code on strings. And to avoid the exceptions on bad Unicode, you either
have to not be decoding code points, or you need to do so yourself with
std.utf.decode. No, that's not ideal, but no one has been able to come up with
a reasonable way to change the status quo with any kind of reasonable
deprecation process.

--