The Case Against Autodecode

Fri May 27 04:19:33 PDT 2016

On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
[snip]
>
> I would agree only with the amendment "...if used naively", 
> which is important. Knowledge of how autodecoding works is a 
> prerequisite for writing fast string code in D. Also, little 
> code should deal with one code unit or code point at a time; 
> instead, it should use standard library algorithms for 
> searching, matching etc. When needed, iterating every code unit 
> is trivially done through indexing.

I disagree. "if used naively" shouldn't be the default. A user 
(naively) expects string algorithms to work as efficiently as 
possible without overheads. To tell the user later that s/he 
shouldn't _naively_ have used a certain algorithm provided by the 
library is a bit cynical. Having to redesign a code base because 
of hidden behavior is a big turn off, having to go through Phobos 
to determine where the hidden pitfalls are is not the user's job.

> Also allow me to point that much of the slowdown can be 
> addressed tactically. The test c < 0x80 is highly predictable 
> (in ASCII-heavy text) and therefore easily speculated. We can 
> and we should arrange code to minimize impact.

And what if you deal with non-ASCII heavy text? Does the user 
have to guess an micro-optimize for simple use cases?

>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the 
> right thing instead of having the user wonder separately for 
> each case. These uses don't need decoding, and the standard 
> library correctly doesn't involve it (or if it currently does 
> it has a bug):
>
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
>
> However the following do require autodecoding:
>
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters
>
> Currently the standard library operates at code point level 
> even though inside it may choose to use code units when 
> admissible. Leaving such a decision to the library seems like a 
> wise thing to do.

But how is the user supposed to know without being a core 
contributor to Phobos? If using a library method that works well 
in one case can slow down your code in a slightly different case, 
something is wrong with the language/library design. For simple 
cases the burden shouldn't be on the user, or, if it is, s/he 
should be informed about it in order to be able to make 
well-informed decisions. Personally I wouldn't mind having to 
decide in each case what I want (provided I have a best practices 
cheat sheet :)), so I can get the best out of it. But to keep 
guessing, testing and benchmarking each string handling library 
function is not good at all.

[snip]