The Case Against Autodecode

Thu May 26 16:23:16 PDT 2016

On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
[...]
> On 05/12/2016 04:15 PM, Walter Bright wrote:
[...]
> > 4. Autodecoding is slow and has no place in high speed string processing.
> 
> I would agree only with the amendment "...if used naively", which is
> important. Knowledge of how autodecoding works is a prerequisite for
> writing fast string code in D. Also, little code should deal with one
> code unit or code point at a time; instead, it should use standard
> library algorithms for searching, matching etc. When needed, iterating
> every code unit is trivially done through indexing.
> 
> Also allow me to point that much of the slowdown can be addressed
> tactically. The test c < 0x80 is highly predictable (in ASCII-heavy
> text) and therefore easily speculated. We can and we should arrange
> code to minimize impact.

General Unicode strings have a lot of non-ASCII characters. Why are we
only optimizing for the ASCII case?

> > 5. Very few algorithms require decoding.
> 
> The key here is leaving it to the standard library to do the right
> thing instead of having the user wonder separately for each case.
> These uses don't need decoding, and the standard library correctly
> doesn't involve it (or if it currently does it has a bug):
> 
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
> 
> However the following do require autodecoding:
> 
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters

Question: what should count return, given a string containing (1)
combining diacritics, or (2) Korean text? Or (3) zero-width spaces?

> Currently the standard library operates at code point level even
> though inside it may choose to use code units when admissible. Leaving
> such a decision to the library seems like a wise thing to do.

The problem is that often such decisions can only be made by the user,
because it depends on what the user wants to accomplish.  What should
count return, given some Unicode string?  If the user wants to determine
the size of a buffer (e.g., to store a string minus some characters to
be stripped), then count should return the byte count. If the user wants
to count the number of matching visual characters, then count should
return the number of graphemes. If the user wants to determine the
visual width of the (filtered) string, then count should not be used at
all, but instead a font metric algorithm.  (I can't think of a practical
use case where you'd actually need to count code points(!).)

Having the library arbitrarily choose one use case over the others
(especially one that seems the least applicable to practical situations)
just doesn't seem right to me at all. Rather, the user ought to specify
what exactly is to be counted, i.e., s.byCodeUnit.count(),
s.byCodePoint.count(), or s.byGrapheme.count().

[...]
> > 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
> > importing std.array one way or another, and then autodecode is there.
> 
> Turning off autodecoding is as easy as inserting .representation after any
> string. (Not to mention using indexing directly.)

Therefore, instead of:

	myString.splitter!"abc".joiner!"def".count;

we have to write:

	myString.representation
		.splitter!("abc".representation)
		.joiner!("def".representation)
		.count;

Great.

[...]
> Second, it's as it should. The entire scaffolding rests on the notion
> that char[] is distinguished from ubyte[] by having UTF8 code units,
> not arbitrary bytes. It seems that many arguments against autodecoding
> are in fact arguments in favor of eliminating virtually all
> distinctions between char[] and ubyte[].

That is a strawman. We are not arguing for eliminating the distinction
between char[] and ubyte[]. Rather, the complaint is that autodecoding
represents a constant overhead in string processing that's often
*unnecessary*. Many string operations don't *need* to autodecode, and
even those that may seem like they do, are often better implemented
differently.

For example, filtering a string by a non-ASCII character can actually be
done via substring search -- expand the non-ASCII character into 1 to 6
code units, and then do the equivalent of C's strstr().  This will not
have false positives thanks to the way UTF-8 is designed.  It eliminates
the overhead of decoding every single character -- in implementational
terms, it could, for example, first scan for the first 1st byte by
linear scan through the string without decoding, which is a lot faster
than decoding every single character and then comparing with the target.
Only when the first byte matches does it need to do the slightly more
expensive operation of substring comparison.

Similarly, splitter does not need to operate on code points at all. It's
unnecessarily slow that way. Most use cases of splitter has lots of data
in between delimiters, which means most of the work done by autodecoding
is wasted.  Instead, splitter should just scan for the substring to
split on -- again the design of UTF-8 guarantees there will be no false
positives -- and only put in the effort where it's actually needed: at
the delimiters, not the data in between.

The same could be said of joiner, and many other common string
algorithms.

There aren't many algorithms that actually need to decode; decoding
should be restricted to them, rather than an overhead applied across the
board.

[...]
> Overall, I think the one way to make real steps forward in improving
> string processing in the D language is to give a clear answer of what
> char, wchar, and dchar mean.
[...]

We already have a clear definition: char, wchar, and dchar are Unicode
code units, and the latter is also Unicode code points. That's all there
is to it.

If we want Phobos to truly be able to take advantage of the fact that
char[], wchar[], dchar[] contain Unicode strings, we need to stop the
navel gazing at what byte representations and bits mean, and look at the
bigger picture.  Consider char[] as a unit in itself, a complete Unicode
string -- the actual code units don't really matter, as they are just an
implementation detail. What you want to be able to do is for a Phobos
algorithm to decide, OK, in order to produce output X, it's faster to do
substring scanning, and in order to produce output Y, it's better to
decode first.  In other words, decoding or not decoding ought to be a
decision made at the algorithm level (or higher), depending on the need
at hand. It should not be hard-boiled into the lower-level internals of
how strings are handled, such that higher-level algorithms are
straitjacketed and forced to work with the decoded stream, even when
they actually don't *need* decoding to do what they want.

In the cases where Phobos is unable to make a decision (e.g., what
should count return -- which depends on what the user is trying to
accomplish), it should be left to the user. The user shouldn't have to
work against a default setting that only works for a subset of use
cases.

T

-- 
Without geometry, life would be pointless. -- VS