The Case Against Autodecode

deadalnix via Digitalmars-d digitalmars-d at puremagic.com
Thu Jun 2 12:34:43 PDT 2016


On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
wrote:
> Pretty much everything. Consider s and s1 string variables with 
> possibly different encodings (UTF8/UTF16).
>
> * s.all!(c => c == 'ö') works only with autodecoding. It 
> returns always false without.
>

False. Many characters can be represented by different sequences 
of codepoints. For instance, ê can be ê as one codepoint or ^ as 
a modifier followed by e. ö is one such character.

> * s.any!(c => c == 'ö') works only with autodecoding. It 
> returns always false without.
>

False. (while this is pretty much the same as 1, one can come up 
with with as many example as wished by tweaking the same one to 
produce endless variations).

> * s.balancedParens('〈', '〉') works only with autodecoding.
>

Not sure, so I'll say OK.

> * s.canFind('ö') works only with autodecoding. It returns 
> always false without.
>

False.

> * s.commonPrefix(s1) works only if they both use the same 
> encoding; otherwise it still compiles but silently produces an 
> incorrect result.
>

False.

> * s.count('ö') works only with autodecoding. It returns always 
> zero without.
>

False.

> * s.countUntil(s1) is really odd - without autodecoding, 
> whether it works at all, and the result it returns, depends on 
> both encodings. With autodecoding it always works and returns a 
> number independent of the encodings.
>

False.

> * s.endsWith('ö') works only with autodecoding. It returns 
> always false without.
>

False.

> * s.endsWith(s1) works only with autodecoding. Otherwise it 
> compiles and runs but produces incorrect results if s and s1 
> have different encodings.
>

False.

> * s.find('ö') works only with autodecoding. It never finds it 
> without.
>

False.

> * s.findAdjacent is a very interesting one. It works with 
> autodecoding, but without it it just does odd things.
>

Not sure so I'll say OK, while I strongly suspect that, like for 
other, this will only work if string are normalized.

> * s.findAmong(s1) is also interesting. It works only with 
> autodecoding.
>

False.

> * s.findSkip(s1) works only if s and s1 have the same encoding. 
> Otherwise it compiles and runs but produces incorrect results.
>

False.

> * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) 
> work only if s and s1 have the same encoding. Otherwise they 
> compile and run but produce incorrect results.
>

False.

> * s.minCount, s.maxCount are unlikely to be terribly useful but 
> with autodecoding it consistently returns the extremum numeric 
> code unit regardless of representation. Without, they just 
> return encoding-dependent and meaningless numbers.
>

Note sure, so I'll say ok.

> * s.minPos, s.maxPos follow a similar semantics.
>

Note sure, so I'll say ok.

> * s.skipOver(s1) only works with autodecoding. Otherwise it 
> compiles and runs but produces incorrect results if s and s1 
> have different encodings.
>

False.

> * s.startsWith('ö') works only with autodecoding. Otherwise it 
> compiles and runs but produces incorrect results if s and s1 
> have different encodings.
>

False.

> * s.startsWith(s1) works only with autodecoding. Otherwise it 
> compiles and runs but produces incorrect results if s and s1 
> have different encodings.
>

False.

> * s.until!(c => c == 'ö') works only with autodecoding. 
> Otherwise, it will span the entire range.
>

False.

> ===
>
> The intent of autodecoding was to make std.algorithm work 
> meaningfully with strings. As it's easy to see I just went 
> through std.algorithm.searching alphabetically and found issues 
> literally with every primitive in there. It's an easy exercise 
> to go forth with the others.
>
>
> Andrei

I mean what a trainwreck. Your examples are saying it all doesn't 
it ? Almost none of them would work without normalizing the 
string first. And that is the point you've been refusing to hear 
so far. autodecoding doesn't pay for itself as it is unable to do 
what it is supposed to do in the general case.

Really, there is not much you can do with anything unicode 
related without first going through normalization. If you want 
anything more than searching substring or alike, you'll also need 
a collation, that is locale dependent (for sorting for instance).

Supporting unicode, IMO, would be to provide facilities to 
normalize (preferably lazilly as a range), to manage collations, 
and so on. Decoding to codepoints just don't cut it.

As a result, any algorithm that need to support string need to 
either fight against the language because it doesn't need 
decoding, use decoding and assume to be incorrect without 
normalization or do the correct thing by itself (which is also 
going to require to work against the language).



More information about the Digitalmars-d mailing list