The Case Against Autodecode

Thu Jun 2 13:01:54 PDT 2016

On 02.06.2016 21:05, Andrei Alexandrescu wrote:
> On 06/02/2016 01:54 PM, Marc Schütz wrote:
>> On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:
>>> That's not going to work. A false impression created in this thread
>>> has been that code points are useless
>>
>> They _are_ useless for almost anything you can do with strings. The only
>> places where they should be used are std.uni and std.regex.
>>
>> Again: What is the justification for using code points, in your opinion?
>> Which practical tasks are made possible (and work _correctly_) if you
>> decode to code points, that don't already work with code units?
>
> Pretty much everything. Consider s and s1 string variables with possibly
> different encodings (UTF8/UTF16).
>
> * s.all!(c => c == 'ö') works only with autodecoding. It returns always
> false without.
> ...

Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.)

assert("ö".all!(c => c == 'ö')); // fails

> * s.any!(c => c == 'ö') works only with autodecoding. It returns always
> false without.
> ...

Doesn't work. Shouldn't compile.

assert("ö".any!(c => c == 'ö")); // fails
assert(!"̃ö⃖".any!(c => c== 'ö')); // fails

> * s.balancedParens('〈', '〉') works only with autodecoding.
> ...

Doesn't work, e.g. s="⟨⃖". Shouldn't compile.

> * s.canFind('ö') works only with autodecoding. It returns always false
> without.
> ...

Doesn't work. Shouldn't compile.

assert("ö".canFind!(c => c == 'ö")); // fails

> * s.commonPrefix(s1) works only if they both use the same encoding;
> otherwise it still compiles but silently produces an incorrect result.
> ...

Doesn't work. Shouldn't compile.

> * s.count('ö') works only with autodecoding. It returns always zero
> without.
> ....

Doesn't work. Shouldn't compile.

> * s.countUntil(s1) is really odd - without autodecoding, whether it
> works at all, and the result it returns, depends on both encodings.  With
> autodecoding it always works and returns a number independent of the
> encodings.
> ...

Doesn't work. Shouldn't compile.

> * s.endsWith('ö') works only with autodecoding. It returns always false
> without.
> ...

Doesn't work. Shouldn't compile.

> * s.endsWith(s1) works only with autodecoding.

Doesn't work.

> Otherwise it compiles and
> runs but produces incorrect results if s and s1 have different encodings.
>...

Shouldn't compile.

> * s.find('ö') works only with autodecoding. It never finds it without.
> ...

Doesn't work. Shouldn't compile.

> * s.findAdjacent is a very interesting one. It works with autodecoding,
> but without it it just does odd things.
> ....

Doesn't work. Shouldn't compile.

> * s.findAmong(s1) is also interesting. It works only with autodecoding.
> ...

Doesn't work. Shouldn't compile.

> * s.findSkip(s1) works only if s and s1 have the same encoding.
> Otherwise it compiles and runs but produces incorrect results.
> ...

Doesn't work. Shouldn't compile.

> * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only
> if s and s1 have the same encoding.

Doesn't work.

> Otherwise they compile and run but produce incorrect results.
> ...

Shouldn't compile.

> * s.minCount, s.maxCount are unlikely to be terribly useful but with
> autodecoding it consistently returns the extremum numeric code unit
> regardless of representation. Without, they just return
> encoding-dependent and meaningless numbers.
>
> * s.minPos, s.maxPos follow a similar semantics.
> ...

Hardly a point in favour of autodecoding.

> * s.skipOver(s1) only works with autodecoding.

Doesn't work. Shouldn't compile.

> Otherwise it compiles and
> runs but produces incorrect results if s and s1 have different encodings.
> ...

Shouldn't compile.

> * s.startsWith('ö') works only with autodecoding. Otherwise it compiles
> and runs but produces incorrect results if s and s1 have different
> encodings.
> ...

Doesn't work. Shouldn't compile.

> * s.startsWith(s1) works only with autodecoding. Otherwise it compiles
> and runs but produces incorrect results if s and s1 have different
> encodings.
> ...

Doesn't work. Shouldn't compile.

> * s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it
> will span the entire range.
> ...

Doesn't work. Shouldn't compile.

> ===
>
> The intent of autodecoding was to make std.algorithm work meaningfully
> with strings. As it's easy to see I just went through
> std.algorithm.searching alphabetically and found issues literally with
> every primitive in there. It's an easy exercise to go forth with the
> others.
> ...

Basically all of those still don't work with UTF-32 (assuming your goal 
is to operate on characters). You need to normalize and possibly iterate 
on graphemes. Also, many of those functions actually have valid uses 
intentionally operating on code units.

The "shouldn't compile" remarks ideally would be handled at the language 
level: char/wchar/dchar should be incompatible types and char[], wchar[] 
and dchar[] should be handled like all arrays.