The Case Against Autodecode

Marc Schütz via Digitalmars-d digitalmars-d at puremagic.com
Fri May 27 03:56:06 PDT 2016


On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
> This might be a good time to discuss this a tad further. I'd 
> appreciate if the debate stayed on point going forward. Thanks!
>
> My thesis: the D1 design decision to represent strings as 
> char[] was disastrous and probably one of the largest 
> weaknesses of D1. The decision in D2 to use immutable(char)[] 
> for strings is a vast improvement but still has a number of 
> issues. The approach to autodecoding in Phobos is an 
> improvement on that decision.

It is not, which has been shown by various posts in this thread. 
Iterating by code points is at least as wrong as iterating by 
code units; it can be argued it is worse because it sometimes 
makes the fact that it's wrong harder to detect.

> The insistent shunning of a user-defined type to represent 
> strings is not good and we need to rid ourselves of it.

While this may be true, it has nothing to do with auto decoding. 
I assume you would want such a user-define string type to 
auto-decode as well, right?

>
> On 05/12/2016 04:15 PM, Walter Bright wrote:
>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the 
> right thing instead of having the user wonder separately for 
> each case. These uses don't need decoding, and the standard 
> library correctly doesn't involve it (or if it currently does 
> it has a bug):
>
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')

Yes.

> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

Ideally yes, but this is a special case that cannot be detected 
by `count`.

>
> However the following do require autodecoding:
>
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters

No, they do not need _auto_ decoding, they need a decision _by 
the user_ what they should be decoded to. Code units? Code 
points? Graphemes? Words? Lines?

>
> Currently the standard library operates at code point level

Because it auto decodes.

> even though inside it may choose to use code units when 
> admissible. Leaving such a decision to the library seems like a 
> wise thing to do.

No one wants to take that second part away. For example, the 
`find` can provide an overload that accepts `const(char)[]` 
directly, while `walkLength` doesn't, requiring a decision by the 
caller.

>> 7. Autodecode cannot be used with unicode path/filenames, 
>> because it is
>> legal (at least on Linux) to have invalid UTF-8 as filenames. 
>> It turns
>> out in the wild that pure Unicode is not universal - there's 
>> lots of
>> dirty Unicode that should remain unmolested, and autocode does 
>> not play
>> with that.
>
> If paths are not UTF-8, then they shouldn't have string type 
> (instead use ubyte[] etc). More on that below.

I believe a library type would be more appropriate than bare 
`ubyte[]`. It should provide conversion between the OS encoding 
(which can be detected automatically) and UTF strings, for 
example. And it should be used for any "strings" that comes from 
outside the program, like main's arguments, env variables...

>> 9. Autodecode cannot be turned off, i.e. it isn't practical to 
>> avoid
>> importing std.array one way or another, and then autodecode is 
>> there.
>
> Turning off autodecoding is as easy as inserting 
> .representation after any string. (Not to mention using 
> indexing directly.)

This would no longer work if char[] and char ranges were to be 
treated identically.

>
>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
>> key
>> benefit of being arrays in the first place.
>
> First off, you always have the option with .representation. 
> That's a great name because it gives you the type used to 
> represent the string - i.e. an array of integers of a specific 
> width.
>
> Second, it's as it should. The entire scaffolding rests on the 
> notion that char[] is distinguished from ubyte[] by having UTF8 
> code units, not arbitrary bytes. It seems that many arguments 
> against autodecoding are in fact arguments in favor of 
> eliminating virtually all distinctions between char[] and 
> ubyte[]. Then the natural question is, what _is_ the difference 
> between char[] and ubyte[] and why do we need char as a 
> separate type from ubyte?
>
> This is a fundamental question for which we need a rigorous 
> answer. What is the purpose of char, wchar, and dchar? My 
> current understanding is that they're justified as pretty much 
> indistinguishable in primitives and behavior from ubyte, 
> ushort, and uint respectively, but they reflect a loose 
> subjective intent from the programmer that they hold actual UTF 
> code units. The core language does not enforce such, except it 
> does special things in random places like for loops (any other)?

Agreed.

>
> If char is to be distinct from ubyte, and char[] is to be 
> distinct from ubyte[], then autodecoding does the right thing: 
> it makes sure they are distinguished in behavior and embodies 
> the assumption that char is, in fact, a UTF8 code point.

Distinguishing them is the right thing to do, but auto decoding 
is not the way to achieve that, see above.


More information about the Digitalmars-d mailing list