The Case Against Autodecode

Sun May 15 16:10:38 PDT 2016

On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> > I am as unclear about the problems of autodecoding as I am
> about the necessity
> > to remove curl. Whenever I ask I hear some arguments that
> work well emotionally
> > but are scant on reason and engineering. Maybe it's time to
> rehash them? I just
> > did so about curl, no solid argument seemed to come together.
> I'd be curious of
> > a crisp list of grievances about autodecoding. -- Andrei
>

Given the importance of performance in the auto-decoding topic, 
it seems reasonable to quantify it. I took a stab at this. It 
would of course be prudent to have others conduct similar 
analysis rather than rely on my numbers alone.

Measurements were done using an artificial scenario, counting 
lower-case ascii letters. This had the effect of calling 
front/popFront many times on a long block of text. Runs were done 
both treating the text as char[] and ubyte[] and comparing the 
run times. (char[] performs auto-decoding, ubyte[] does not.)

Timings were done with DMD and LDC, and on two different data 
sets. One data set was a mix of latin languages (e.g. German, 
English, Finnish, etc.), the other non-Latin languages (e.g. 
Japanese, Chinese, Greek, etc.). The goal being to distinguish 
between scenarios with high and low Ascii character content.

The result: For DMD, auto-decoding showed a 1.6x to 2.6x cost. 
For LDC, a 12.2x to 12.9x cost.

Details:
- Test program: https://dpaste.dzfl.pl/67c7be11301f
- DMD 2.071.0. Options: -release -O -boundscheck=off -inline
- LDC 1.0.0-beta1 (based on DMD v2.070.2). Options: -release -O 
-boundscheck=off
- Machine: Macbook Pro (2.8 GHz Intel I7, 16GB ram)

Runs for each combination were done five times and the median 
times used. The median times and the char[] to ubyte[] ratio are 
below:
|          |           |    char[] |   ubyte[] |
| Compiler | Text type | time (ms) | time (ms) | ratio |
|----------+-----------+-----------+-----------+-------|
| DMD      | Latin     |      7261 |      4513 |   1.6 |
| DMD      | Non-latin |     10240 |      3928 |   2.6 |
| LDC      | Latin     |     11773 |       913 |  12.9 |
| LDC      | Non-latin |     10756 |       883 |  12.2 |

Note: The numbers above don't provide enough info to derive a 
front/popFront rate. The program artificially makes multiple 
loops to increase the run-times. (For these runs, the program's 
repeat-count was set to 20).

Characteristics of the two data sets:
|           |         |         |             | Bytes per |
| Text type |   Bytes |  DChars | Ascii Chars |     DChar | Pct 
Ascii |
|-----------+---------+---------+-------------+-----------+-----------|
| Latin     | 4156697 | 4059016 |     3965585 |     1.024 |     
97.7% |
| Non-latin | 4061554 | 1949290 |      348164 |     2.084 |     
17.9% |

Run-to-run variability - The run times recorded were quite 
stable. The largest delta between minimum and median time for any 
group was 17 milliseconds.