The Case Against Autodecode
Chris via Digitalmars-d
digitalmars-d at puremagic.com
Fri May 27 04:19:33 PDT 2016
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
[snip]
>
> I would agree only with the amendment "...if used naively",
> which is important. Knowledge of how autodecoding works is a
> prerequisite for writing fast string code in D. Also, little
> code should deal with one code unit or code point at a time;
> instead, it should use standard library algorithms for
> searching, matching etc. When needed, iterating every code unit
> is trivially done through indexing.
I disagree. "if used naively" shouldn't be the default. A user
(naively) expects string algorithms to work as efficiently as
possible without overheads. To tell the user later that s/he
shouldn't _naively_ have used a certain algorithm provided by the
library is a bit cynical. Having to redesign a code base because
of hidden behavior is a big turn off, having to go through Phobos
to determine where the hidden pitfalls are is not the user's job.
> Also allow me to point that much of the slowdown can be
> addressed tactically. The test c < 0x80 is highly predictable
> (in ASCII-heavy text) and therefore easily speculated. We can
> and we should arrange code to minimize impact.
And what if you deal with non-ASCII heavy text? Does the user
have to guess an micro-optimize for simple use cases?
>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the
> right thing instead of having the user wonder separately for
> each case. These uses don't need decoding, and the standard
> library correctly doesn't involve it (or if it currently does
> it has a bug):
>
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
>
> However the following do require autodecoding:
>
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters
>
> Currently the standard library operates at code point level
> even though inside it may choose to use code units when
> admissible. Leaving such a decision to the library seems like a
> wise thing to do.
But how is the user supposed to know without being a core
contributor to Phobos? If using a library method that works well
in one case can slow down your code in a slightly different case,
something is wrong with the language/library design. For simple
cases the burden shouldn't be on the user, or, if it is, s/he
should be informed about it in order to be able to make
well-informed decisions. Personally I wouldn't mind having to
decide in each case what I want (provided I have a best practices
cheat sheet :)), so I can get the best out of it. But to keep
guessing, testing and benchmarking each string handling library
function is not good at all.
[snip]
More information about the Digitalmars-d
mailing list