The Case Against Autodecode

Fri May 13 04:00:19 PDT 2016

On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
> Ideally, algorithms would be Unicode aware as appropriate, but 
> the default would be to operate on code units with wrappers to 
> handle decoding by code point or grapheme. Then it's easy to 
> write fast code while still allowing for full correctness. 
> Granted, it's not necessarily easy to get correct code that 
> way, but anyone who wants fully correctness without caring 
> about efficiency can just use ranges of graphemes. Ranges of 
> code points are rare regardless.

char[], wchar[] etc. can simply be made non-ranges, so that the 
user has to choose between .byCodePoint, .byCodeUnit (or 
.representation as it already exists), .byGrapheme, or even 
higher-level units like .byLine or .byWord. Ranges of char, wchar 
however stay as they are today. That way it's harder to 
accidentally get it wrong.

>
> Based on what I've seen in previous conversations on 
> auto-decoding over the past few years (be it in the newsgroup, 
> on github, or at dconf), most of the core devs think that 
> auto-decoding was a major blunder that we continue to pay for. 
> But unfortunately, even if we all agree that it was a huge 
> mistake and want to fix it, the question remains of how to do 
> that without breaking tons of code - though since AFAIK, Andrei 
> is still in favor of auto-decoding, we'd have a hard time going 
> forward with plans to get rid of it even if we had come up with 
> a good way of doing so. But I would love it if we could get rid 
> of auto-decoding and clean up string handling in D.

There is a simple deprecation path that's already been suggested. 
`isInputRange` and friends can output a helpful deprecation 
warning when they're called with a range that currently triggers 
auto-decoding.