The Case Against Autodecode

Fri May 27 09:40:21 PDT 2016

On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
> On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
> > > > However the following do require autodecoding:
> > > > 
> > > > s.walkLength
> > > > s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> > > > s.count!(c => c >= 32) // non-control characters
> > > > 
> > > > Currently the standard library operates at code point level even
> > > > though inside it may choose to use code units when admissible.
> > > > Leaving such a decision to the library seems like a wise thing
> > > > to do.
> > > 
> > > But how is the user supposed to know without being a core
> > > contributor to Phobos?
> > 
> > Misunderstanding. All examples work properly today because of
> > autodecoding. -- Andrei
> 
> They only work "properly" if you define "properly" as "in terms of
> code points". But working in terms of code points is usually wrong. If
> you want to count "characters", you need to work with graphemes.
> 
> https://dpaste.dzfl.pl/817dec505fd2

Exactly. And we just keep getting stuck on this point. It seems that the
message just isn't getting through. The unfounded assumption continues
to be made that iterating by code point is somehow "correct" by
definition and nobody can challenge it.

String handling, especially in the standard library, ought to be (1)
efficient where possible, and (2) be as correct as possible (meaning,
most corresponding to user expectations -- principle of least surprise).
If we can't have both, we should at least have one, right? However, the
way autodecoding is currently implemented, we have neither.

Firstly, it is beyond clear that autodecoding adds a significant amount
of overhead, and because it's automatic, it applies to ALL string
processing in D.  The only way around it is to fight against the
standard library and use workarounds to bypass all that
meticulously-crafted autodecoding code, begging the question of why
we're even spending the effort on said code in the first place.

Secondly, it violates the principle of least surprise when the user,
given a string of, say, Korean text, discovers that s.count() *doesn't*
return the correct answer.  Oh, it's "correct", all right, if your
definition of correct is "number of Unicode code points", but to a
Korean user, such an answer is completely meaningless because it has
little correspondence with what he would perceive as the number of
"characters" in the string. It might as well be a random number and it
would be just as meaningful.  It is just as wrong as s.count() returning
the number of code units, except that in the current Euro-centric D
community the wrong instances are less often encountered and so are
often overlooked. But that doesn't change the fact that code that
assumes s.count() returns anything remotely meaningful to the user is
buggy. Autodecoding into code points only serves to hide the bugs.

As has been said before already countless times, autodecoding, as
currently implemented, is neither "correct" nor efficient. Iterating by
code point is much faster, but more prone to user mistakes; whereas
iterating by grapheme more often corresponds with user expectations but
performs quite poorly. The current implementation of autodecoding
represents the worst of both worlds: it is both inefficient *and* prone
to user mistakes, and worse yet, it serves to conceal such user mistakes
by giving the false sense of security that because we're iterating by
code points we're somehow magically "correct" by definition.

The fact of the matter is that if you're going to write Unicode string
processing code, you're gonna hafta to know the dirty nitty gritty of
Unicode strings, including the fine distinctions between code units,
code points, grapheme clusters, etc.. Since this is required knowledge
anyway, why not just let the user worry about how to iterate over the
string? Let the user choose what best suits his application, whether
it's working directly with code units for speed, or iterating over
grapheme clusters for correctness (in terms of visual "characters"),
instead of choosing the pessimal middle ground that's neither efficient
nor correct?

T

-- 
Do not reason with the unreasonable; you lose by definition.