The Case Against Autodecode

Thu Jun 2 12:05:44 PDT 2016

On 06/02/2016 01:54 PM, Marc Schütz wrote:
> On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:
>> That's not going to work. A false impression created in this thread
>> has been that code points are useless
>
> They _are_ useless for almost anything you can do with strings. The only
> places where they should be used are std.uni and std.regex.
>
> Again: What is the justification for using code points, in your opinion?
> Which practical tasks are made possible (and work _correctly_) if you
> decode to code points, that don't already work with code units?

Pretty much everything. Consider s and s1 string variables with possibly 
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always 
false without.

* s.any!(c => c == 'ö') works only with autodecoding. It returns always 
false without.

* s.balancedParens('〈', '〉') works only with autodecoding.

* s.canFind('ö') works only with autodecoding. It returns always false 
without.

* s.commonPrefix(s1) works only if they both use the same encoding; 
otherwise it still compiles but silently produces an incorrect result.

* s.count('ö') works only with autodecoding. It returns always zero without.

* s.countUntil(s1) is really odd - without autodecoding, whether it 
works at all, and the result it returns, depends on both encodings. With 
autodecoding it always works and returns a number independent of the 
encodings.

* s.endsWith('ö') works only with autodecoding. It returns always false 
without.

* s.endsWith(s1) works only with autodecoding. Otherwise it compiles and 
runs but produces incorrect results if s and s1 have different encodings.

* s.find('ö') works only with autodecoding. It never finds it without.

* s.findAdjacent is a very interesting one. It works with autodecoding, 
but without it it just does odd things.

* s.findAmong(s1) is also interesting. It works only with autodecoding.

* s.findSkip(s1) works only if s and s1 have the same encoding. 
Otherwise it compiles and runs but produces incorrect results.

* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only 
if s and s1 have the same encoding. Otherwise they compile and run but 
produce incorrect results.

* s.minCount, s.maxCount are unlikely to be terribly useful but with 
autodecoding it consistently returns the extremum numeric code unit 
regardless of representation. Without, they just return 
encoding-dependent and meaningless numbers.

* s.minPos, s.maxPos follow a similar semantics.

* s.skipOver(s1) only works with autodecoding. Otherwise it compiles and 
runs but produces incorrect results if s and s1 have different encodings.

* s.startsWith('ö') works only with autodecoding. Otherwise it compiles 
and runs but produces incorrect results if s and s1 have different 
encodings.

* s.startsWith(s1) works only with autodecoding. Otherwise it compiles 
and runs but produces incorrect results if s and s1 have different 
encodings.

* s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it 
will span the entire range.

===

The intent of autodecoding was to make std.algorithm work meaningfully 
with strings. As it's easy to see I just went through 
std.algorithm.searching alphabetically and found issues literally with 
every primitive in there. It's an easy exercise to go forth with the others.

Andrei