The Case Against Autodecode

Andrei Alexandrescu via Digitalmars-d digitalmars-d at puremagic.com
Thu May 26 09:00:54 PDT 2016


This might be a good time to discuss this a tad further. I'd appreciate 
if the debate stayed on point going forward. Thanks!

My thesis: the D1 design decision to represent strings as char[] was 
disastrous and probably one of the largest weaknesses of D1. The 
decision in D2 to use immutable(char)[] for strings is a vast 
improvement but still has a number of issues. The approach to 
autodecoding in Phobos is an improvement on that decision. The insistent 
shunning of a user-defined type to represent strings is not good and we 
need to rid ourselves of it.

On 05/12/2016 04:15 PM, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
>  > I am as unclear about the problems of autodecoding as I am about the
> necessity
>  > to remove curl. Whenever I ask I hear some arguments that work well
> emotionally
>  > but are scant on reason and engineering. Maybe it's time to rehash
> them? I just
>  > did so about curl, no solid argument seemed to come together. I'd be
> curious of
>  > a crisp list of grievances about autodecoding. -- Andrei
>
> Here are some that are not matters of opinion.
>
> 1. Ranges of characters do not autodecode, but arrays of characters do.
> This is a glaring inconsistency.

Agreed. At the point of that decision, the party line was "arrays of 
characters are strings, nothing else is or should be". Now it is 
apparent that shouldn't have been the case.

> 2. Every time one wants an algorithm to work with both strings and
> ranges, you wind up special casing the strings to defeat the
> autodecoding, or to decode the ranges. Having to constantly special case
> it makes for more special cases when plugging together components. These
> issues often escape detection when unittesting because it is convenient
> to unittest only with arrays.

This is a consequence of 1. It is at least partially fixable.

> 3. Wrapping an array in a struct with an alias this to an array turns
> off autodecoding, another special case.

This is also a consequence of 1.

> 4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which is 
important. Knowledge of how autodecoding works is a prerequisite for 
writing fast string code in D. Also, little code should deal with one 
code unit or code point at a time; instead, it should use standard 
library algorithms for searching, matching etc. When needed, iterating 
every code unit is trivially done through indexing.

Also allow me to point that much of the slowdown can be addressed 
tactically. The test c < 0x80 is highly predictable (in ASCII-heavy 
text) and therefore easily speculated. We can and we should arrange code 
to minimize impact.

> 5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing 
instead of having the user wonder separately for each case. These uses 
don't need decoding, and the standard library correctly doesn't involve 
it (or if it currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though 
inside it may choose to use code units when admissible. Leaving such a 
decision to the library seems like a wise thing to do.

> 6. Autodecoding has two choices when encountering invalid code units -
> throw or produce an error dchar. Currently, it throws, meaning no
> algorithms using autodecode can be made nothrow.

Agreed. This is probably the most glaring mistake. I think we should 
open a discussion no fixing this everywhere in the stdlib, even at the 
cost of breaking code.

> 7. Autodecode cannot be used with unicode path/filenames, because it is
> legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
> out in the wild that pure Unicode is not universal - there's lots of
> dirty Unicode that should remain unmolested, and autocode does not play
> with that.

If paths are not UTF-8, then they shouldn't have string type (instead 
use ubyte[] etc). More on that below.

> 8. In my work with UTF-8 streams, dealing with autodecode has caused me
> considerably extra work every time. A convenient timesaver it ain't.

Objection. Vague.

> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
> importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after 
any string. (Not to mention using indexing directly.)

> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
> benefit of being arrays in the first place.

First off, you always have the option with .representation. That's a 
great name because it gives you the type used to represent the string - 
i.e. an array of integers of a specific width.

Second, it's as it should. The entire scaffolding rests on the notion 
that char[] is distinguished from ubyte[] by having UTF8 code units, not 
arbitrary bytes. It seems that many arguments against autodecoding are 
in fact arguments in favor of eliminating virtually all distinctions 
between char[] and ubyte[]. Then the natural question is, what _is_ the 
difference between char[] and ubyte[] and why do we need char as a 
separate type from ubyte?

This is a fundamental question for which we need a rigorous answer. What 
is the purpose of char, wchar, and dchar? My current understanding is 
that they're justified as pretty much indistinguishable in primitives 
and behavior from ubyte, ushort, and uint respectively, but they reflect 
a loose subjective intent from the programmer that they hold actual UTF 
code units. The core language does not enforce such, except it does 
special things in random places like for loops (any other)?

If char is to be distinct from ubyte, and char[] is to be distinct from 
ubyte[], then autodecoding does the right thing: it makes sure they are 
distinguished in behavior and embodies the assumption that char is, in 
fact, a UTF8 code point.

> 11. Indexing an array produces different results than autodecoding,
> another glaring special case.

This is a direct consequence of the fact that string is 
immutable(char)[] and not a specific type. That error predates autodecoding.

Overall, I think the one way to make real steps forward in improving 
string processing in the D language is to give a clear answer of what 
char, wchar, and dchar mean.


Andrei



More information about the Digitalmars-d mailing list