The Case Against Autodecode

Fri Jun 3 00:18:42 PDT 2016

Am Thu, 2 Jun 2016 18:54:21 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org>:

> On 06/02/2016 06:10 PM, Marco Leise wrote:
> > Am Thu, 2 Jun 2016 15:05:44 -0400
> > schrieb Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org>:
> >  
> >> On 06/02/2016 01:54 PM, Marc Schütz wrote:  
> >>> Which practical tasks are made possible (and work _correctly_) if you
> >>> decode to code points, that don't already work with code units?  
> >>
> >> Pretty much everything.
> >>
> >> s.all!(c => c == 'ö')  
> >
> > Andrei, your ignorance is really starting to grind on
> > everyones nerves.  
> 
> Indeed there seem to be serious questions about my competence, basic 
> comprehension, and now knowledge.

That's not my general impression, but something is different
with this thread.

> I understand it is tempting to assume that a disagreement is caused by 
> the other simply not understanding the matter. Even if that were true 
> it's not worth sacrificing civility over it.

Civility has had us caught in an 36 pages long, tiresome
debate with us mostly talking past each other. I was being
impolite and can't say I regret it, because I prefer this
answer over the rest of the thread. It's more informed,
elaborate and conclusive.

> > If after 350 posts you still don't see
> > why this is incorrect: s.any!(c => c == 'o'), you must be
> > actively skipping the informational content of this thread.  
> 
> Is it 'o' with an umlaut or without?
>
> At any rate, consider s of type string and x of type dchar.
> The dchar type is defined as "a Unicode code point", or at
> least my understanding that has been a reasonable definition
> to operate with in the D language ever since its first
> release. Also in the D language, the various string types
> char[], wchar[] etc. with their respective qualified
> versions are meant to hold Unicode strings with one of the
> UTF8, UTF16, and UTF32 encodings.
>
> Following these definitions, it stands to reason to infer that the call 
> s.find(c => c == x) means "find the code point x in string s and return 
> the balance of s positioned there". It's prima facie application of the 
> definitions of the entities involved.
> 
> Is this the only possible or recommended meaning? Most likely not, viz. 
> the subtle cases in which a given grapheme is represented via either one 
> or multiple code points by means of combining characters. Is it the best 
> possible meaning? It's even difficult to define what "best" means 
> (fastest, covering most languages, etc).
> 
> I'm not claiming that meaning is the only possible, the only 
> recommended, or the best possible. All I'm arguing is that it's not 
> retarded, and within a certain universe confined to operating at code 
> point level (which is reasonable per the definitions of the types 
> involved) it can be considered correct.
> 
> If at any point in the reasoning above some rampant ignorance comes 
> about, please point it out.

No, it's pretty close now. We can all agree that there is no
"best" way, only different use cases. Just defining Phobos to
work on code points gives the false illusion that it does the
correct thing in all use cases - after all D claims to support
Unicode. But in case you wanted to iterate on visual letters
it is incorrect and otherwise slow when you work on ASCII
structured formats (JSON, XML, paths, Warp, ...). Then there is
explaining the different default iteration schemes when using
foreach vs. range API (no big deal, just not easily justified)
and the cost of implementation when dealing with
char[]/wchar[].

From this observation we concluded that decoding should be
opt-in and that when we need it, it should be a conscious
decision. Unicode is quite complex and learning about the
difference between code points and grapheme clusters when
segmenting strings will benefit code quality.

As for the question, do multi-code-point graphemes ever appear
in the wild ? OS X is known to use NFD on its native file
system and there is a hint on Wikipedia that some symbols from
Thai or Hindi's Devanagari need them:
https://en.wikipedia.org/wiki/UTF-8#Disadvantages
Some form of Lithuanian seems to have a use for them, too:
http://www.unicode.org/L2/L2012/12026r-n4191-lithuanian.pdf
Aside from those there is nothing generally wrong about
decomposed letters appearing in strings, even though the
use of NFC is encouraged.

> > […harsh tone removed…] in the end we have to assume you
> > will make a decisive vote against any PR with the intent
> > to remove auto-decoding from Phobos.  
> 
> This seems to assume I have some vesting in the position
> that makes it independent of facts. That is not the case. I
> do what I think is right to do, and you do what you think is
> right to do.

Your vote outweighs that of many others for better or worse.
When a decision needs to be made and the community is divided,
we need you or Walter or anyone who is invested in the matter
to cast a ruling vote. However when several dozen people
support an idea after discussion, hearing everyones arguments
with practically no objections and you overrule everyone
tensions build up. I welcome the idea to delegate some of the
tasks to smaller groups. No single person is knowledgeable in
every area of CS and both a bus factor of 1 and too big a
group can hinder decision making.
It would help to know for the future, if you understand your
role as one with veto powers or if you could arrange with
giving up responsibilities to decisions within the community
and if so under what conditions.

> > Your so called vocal minority is actually D's panel of Unicode
> > experts who understand that auto-decoding is a false ally and
> > should be on the deprecation track.  
> 
> They have failed to convince me. But I am more convinced than before 
> that RCStr should not offer a default mode of iteration. I think its 
> impact is lost in this discussion, because once it's understood RCStr 
> will become D's recommended string type, the entire matter becomes moot.
>
> > Remember final-by-default? You promised, that your objection
> > about breaking code means that D2 will only continue to be
> > fixed in a backwards compatible way, be it the implementation
> > of shared or whatever else. Yet months later you opened a
> > thread with the title "inout must go". So that must have been
> > an appeasement back then. People don't forget these things
> > easily and RCStr seems to be a similar distraction,
> > considering we haven't looked into borrowing/scoped enough and
> > you promise wonders from it.  
> 
> What the hell is this, digging dirt on me? Paying back debts? Please 
> stop that crap.

No, that was my actual impression. I must apologize for
generalizing it to other people though. I welcome that RCStr
project and hope it will be good. At this time though it is
not yet fleshed out and we can't tell how fast its adoption
will be. Remember that DIPs on scope and RC have had the past
tendency to go into long debates with unclear outcome. Unlike
this thread, which may be the first in D's forum history with
such a high agreement across the board.

> Andrei

-- 
Marco