The Case Against Autodecode

Thu May 26 21:31:49 PDT 2016

On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
>> 4. Autodecoding is slow and has no place in high speed string 
>> processing.
>
> I would agree only with the amendment "...if used naively", 
> which is important. Knowledge of how autodecoding works is a 
> prerequisite for writing fast string code in D.

It is completely wasted mental effort.

>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the 
> right thing instead of having the user wonder separately for 
> each case. These uses don't need decoding, and the standard 
> library correctly doesn't involve it (or if it currently does 
> it has a bug):
>
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

As far as I can see, the language currently does not provide the 
facilities to implement the above without autodecoding.

> However the following do require autodecoding:
>
> s.walkLength

Usage of the result of this expression will be incorrect in many 
foreseeable cases.

> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation

Ditto.

> s.count!(c => c >= 32) // non-control characters

Ditto, with a big red flag. If you are dealing with control 
characters, the code is likely low-level enough that you need to 
be explicit in what you are counting. It is likely not what 
actually needs to be counted. Such confusion can lead to security 
risks.

> Currently the standard library operates at code point level 
> even though inside it may choose to use code units when 
> admissible. Leaving such a decision to the library seems like a 
> wise thing to do.

It should be explicit.

>> 7. Autodecode cannot be used with unicode path/filenames, 
>> because it is
>> legal (at least on Linux) to have invalid UTF-8 as filenames. 
>> It turns
>> out in the wild that pure Unicode is not universal - there's 
>> lots of
>> dirty Unicode that should remain unmolested, and autocode does 
>> not play
>> with that.
>
> If paths are not UTF-8, then they shouldn't have string type 
> (instead use ubyte[] etc). More on that below.

This is not practical. Do you really see changing std.file and 
std.path to accept ubyte[] for all path arguments?

>> 8. In my work with UTF-8 streams, dealing with autodecode has 
>> caused me
>> considerably extra work every time. A convenient timesaver it 
>> ain't.
>
> Objection. Vague.

I can confirm this vague subjective observation. For example, 
DustMite reimplements some std.string functions in order to be 
able to handle D files with invalid UTF-8 characters.

>> 9. Autodecode cannot be turned off, i.e. it isn't practical to 
>> avoid
>> importing std.array one way or another, and then autodecode is 
>> there.
>
> Turning off autodecoding is as easy as inserting 
> .representation after any string. (Not to mention using 
> indexing directly.)

This is neither easy nor practical. It makes writing reliable 
string handling code a chore in D. Because it is difficult to 
find all places where this must be done, it is not possible to do 
on a program-wide scale, thus bugs can only be discovered when 
this or that component fails because it was not tested with 
Unicode strings.

>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
>> key
>> benefit of being arrays in the first place.
>
> First off, you always have the option with .representation. 
> That's a great name because it gives you the type used to 
> represent the string - i.e. an array of integers of a specific 
> width.
>
> Second, it's as it should. The entire scaffolding rests on the 
> notion that char[] is distinguished from ubyte[] by having UTF8 
> code units, not arbitrary bytes. It seems that many arguments 
> against autodecoding are in fact arguments in favor of 
> eliminating virtually all distinctions between char[] and 
> ubyte[]. Then the natural question is, what _is_ the difference 
> between char[] and ubyte[] and why do we need char as a 
> separate type from ubyte?
>
> This is a fundamental question for which we need a rigorous 
> answer.

Why?

> What is the purpose of char, wchar, and dchar? My current 
> understanding is that they're justified as pretty much 
> indistinguishable in primitives and behavior from ubyte, 
> ushort, and uint respectively, but they reflect a loose 
> subjective intent from the programmer that they hold actual UTF 
> code units. The core language does not enforce such, except it 
> does special things in random places like for loops (any other)?
>
> If char is to be distinct from ubyte, and char[] is to be 
> distinct from ubyte[], then autodecoding does the right thing: 
> it makes sure they are distinguished in behavior and embodies 
> the assumption that char is, in fact, a UTF8 code point.

I don't follow this line of reasoning at all.

>> 11. Indexing an array produces different results than 
>> autodecoding,
>> another glaring special case.
>
> This is a direct consequence of the fact that string is 
> immutable(char)[] and not a specific type. That error predates 
> autodecoding.

There is no convincing argument why indexing and slicing should 
not simply operate on code units.

> Overall, I think the one way to make real steps forward in 
> improving string processing in the D language is to give a 
> clear answer of what char, wchar, and dchar mean.

I don't follow. Though, making char implicitly convertible to 
wchar and dchar has clearly been a mistake.