The Case Against Autodecode

Walter Bright via Digitalmars-d digitalmars-d at puremagic.com
Fri May 27 10:11:17 PDT 2016


On 5/26/2016 9:00 AM, Andrei Alexandrescu wrote:
> My thesis: the D1 design decision to represent strings as char[] was disastrous
> and probably one of the largest weaknesses of D1. The decision in D2 to use
> immutable(char)[] for strings is a vast improvement but still has a number of
> issues.

The mutable vs immutable has nothing to do with autodecoding.


> On 05/12/2016 04:15 PM, Walter Bright wrote:
>> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
>> 2. Every time one wants an algorithm to work with both strings and
>> ranges, you wind up special casing the strings to defeat the
>> autodecoding, or to decode the ranges. Having to constantly special case
>> it makes for more special cases when plugging together components. These
>> issues often escape detection when unittesting because it is convenient
>> to unittest only with arrays.
>
> This is a consequence of 1. It is at least partially fixable.

It's a consequence of autodecoding, not arrays.


>> 4. Autodecoding is slow and has no place in high speed string processing.
> I would agree only with the amendment "...if used naively", which is important.
> Knowledge of how autodecoding works is a prerequisite for writing fast string
> code in D.

Having written high speed string processing code in D, that also deals with 
unicode (i.e. Warp), the only knowledge of autodecoding needed was how to have 
it not happen. Autodecoding made it slower than necessary in every case it was 
used. I found no place in Warp where autodecoding was desirable.


> Also, little code should deal with one code unit or code point at a
> time; instead, it should use standard library algorithms for searching, matching
> etc.

That doesn't work so well. There always seems to be a need for custom string 
processing. Worse, when pipelining strings, the autodecoding changes the type to 
dchar, which then needs to be re-encoded into the result.

The std.string algorithms I wrote all work much better (i.e. faster) without 
autodecoding, while maintaining proper Unicode support. I.e. the autodecoding 
did not benefit the algorithms at all, and if the user is to use standard 
algorithms instead of custom ones, then autodecoding is not necessary.


> When needed, iterating every code unit is trivially done through indexing.

This implies replacing pipelining with loops, and also falls apart if indexing 
is redone to index by code points.


> Also allow me to point that much of the slowdown can be addressed tactically.
> The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore
> easily speculated. We can and we should arrange code to minimize impact.

I.e. special case the code to avoid autodecoding.

The trouble is that the low level code cannot avoid autodecoding, as it happens 
before the low level code gets it. This is conceptually backwards, and winds up 
requiring every algorithm to special case strings, even when completely 
unnecessary. (The 'copy' algorithm is an example of utterly unnecessary decoding.)

When teaching people how to write algorithms, having to write every one twice, 
once for ranges and arrays, and a specialization for strings even when decoding 
is never necessary (such as for 'copy'), is embarrassing.


>> 5. Very few algorithms require decoding.
> The key here is leaving it to the standard library to do the right thing instead
> of having the user wonder separately for each case. These uses don't need
> decoding, and the standard library correctly doesn't involve it (or if it
> currently does it has a bug):
>
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
>
> However the following do require autodecoding:
>
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters
>
> Currently the standard library operates at code point level even though inside
> it may choose to use code units when admissible. Leaving such a decision to the
> library seems like a wise thing to do.

Running my char[] through a pipeline and having it come out sometimes as char[] 
and sometimes dchar[] and sometimes ubyte[] is hidden and surprising behavior.


>> 6. Autodecoding has two choices when encountering invalid code units -
>> throw or produce an error dchar. Currently, it throws, meaning no
>> algorithms using autodecode can be made nothrow.
> Agreed. This is probably the most glaring mistake. I think we should open a
> discussion no fixing this everywhere in the stdlib, even at the cost of breaking
> code.

A third option is to pass the invalid code units through unmolested, which won't 
work if autodecoding is used.


>> 7. Autodecode cannot be used with unicode path/filenames, because it is
>> legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
>> out in the wild that pure Unicode is not universal - there's lots of
>> dirty Unicode that should remain unmolested, and autocode does not play
>> with that.
> If paths are not UTF-8, then they shouldn't have string type (instead use
> ubyte[] etc). More on that below.

Requiring code units to be all 100% valid is not workable, nor is redoing them 
to be ubytes. More on that below.


>> 8. In my work with UTF-8 streams, dealing with autodecode has caused me
>> considerably extra work every time. A convenient timesaver it ain't.
> Objection. Vague.

Sorry I didn't log the time I spent on it.


>> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
>> importing std.array one way or another, and then autodecode is there.
> Turning off autodecoding is as easy as inserting .representation after any
> string.

.representation changes the type to ubyte[]. All knowledge that this is a 
Unicode string then gets lost for the rest of the pipeline.


> (Not to mention using indexing directly.)

Doesn't work if you're pipelining.


>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
>> benefit of being arrays in the first place.
> First off, you always have the option with .representation. That's a great name
> because it gives you the type used to represent the string - i.e. an array of
> integers of a specific width.

I found .representation to be unworkable because it changed the type.


>> 11. Indexing an array produces different results than autodecoding,
>> another glaring special case.
> This is a direct consequence of the fact that string is immutable(char)[] and
> not a specific type. That error predates autodecoding.

Even if it is made a special type, the problem of what an index means will 
remain. Of course, indexing by code point is an O(n) operation, which I submit 
is surprising and shouldn't be supported as [i] even by a special type (for the 
same reason that indexing of linked lists is frowned upon). Giving up indexing 
means giving up efficient slicing, which would be a major downgrade for D.


> Overall, I think the one way to make real steps forward in improving string
> processing in the D language is to give a clear answer of what char, wchar, and
> dchar mean.

They mean code units. This is not ambiguous. How a code unit is different from a 
ubyte:

A. I know you hate bringing up my personal experience, but here goes. I've 
programmed in C forever. In C, char is used for both small integers and 
characters. It's always been a source of confusion, and sometimes bugs, to 
conflate the two:

      struct S { char field; };

Which is it, a character or a small integer? I have to rely on reading the code. 
It's a definite improvement in D that they are distinguished, and I feel that 
improvement every time I have to deal with C/C++ code and see 'char' used as a 
small integer instead of a character.

B. Overloading is different, and that's good. For example, writeln(T[]) produces 
different results for char[] and ubyte[], and this is unsurprising and expected. 
It "just works".

C. More overloading:

       writeln('a');

Does anyone want that to print 96? Does anyone really want 'a' to be of type 
dchar? (The trouble with that is type inference when building up more complex 
types, as you'll wind up with hidden dchar[] if not careful. My experience with 
dchar[] is it is almost never desirable, as it is too memory hungry.)



More information about the Digitalmars-d mailing list