The Case Against Autodecode
Vladimir Panteleev via Digitalmars-d
digitalmars-d at puremagic.com
Thu May 26 21:31:49 PDT 2016
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
>> 4. Autodecoding is slow and has no place in high speed string
>> processing.
>
> I would agree only with the amendment "...if used naively",
> which is important. Knowledge of how autodecoding works is a
> prerequisite for writing fast string code in D.
It is completely wasted mental effort.
>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the
> right thing instead of having the user wonder separately for
> each case. These uses don't need decoding, and the standard
> library correctly doesn't involve it (or if it currently does
> it has a bug):
>
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
As far as I can see, the language currently does not provide the
facilities to implement the above without autodecoding.
> However the following do require autodecoding:
>
> s.walkLength
Usage of the result of this expression will be incorrect in many
foreseeable cases.
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
Ditto.
> s.count!(c => c >= 32) // non-control characters
Ditto, with a big red flag. If you are dealing with control
characters, the code is likely low-level enough that you need to
be explicit in what you are counting. It is likely not what
actually needs to be counted. Such confusion can lead to security
risks.
> Currently the standard library operates at code point level
> even though inside it may choose to use code units when
> admissible. Leaving such a decision to the library seems like a
> wise thing to do.
It should be explicit.
>> 7. Autodecode cannot be used with unicode path/filenames,
>> because it is
>> legal (at least on Linux) to have invalid UTF-8 as filenames.
>> It turns
>> out in the wild that pure Unicode is not universal - there's
>> lots of
>> dirty Unicode that should remain unmolested, and autocode does
>> not play
>> with that.
>
> If paths are not UTF-8, then they shouldn't have string type
> (instead use ubyte[] etc). More on that below.
This is not practical. Do you really see changing std.file and
std.path to accept ubyte[] for all path arguments?
>> 8. In my work with UTF-8 streams, dealing with autodecode has
>> caused me
>> considerably extra work every time. A convenient timesaver it
>> ain't.
>
> Objection. Vague.
I can confirm this vague subjective observation. For example,
DustMite reimplements some std.string functions in order to be
able to handle D files with invalid UTF-8 characters.
>> 9. Autodecode cannot be turned off, i.e. it isn't practical to
>> avoid
>> importing std.array one way or another, and then autodecode is
>> there.
>
> Turning off autodecoding is as easy as inserting
> .representation after any string. (Not to mention using
> indexing directly.)
This is neither easy nor practical. It makes writing reliable
string handling code a chore in D. Because it is difficult to
find all places where this must be done, it is not possible to do
on a program-wide scale, thus bugs can only be discovered when
this or that component fails because it was not tested with
Unicode strings.
>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a
>> key
>> benefit of being arrays in the first place.
>
> First off, you always have the option with .representation.
> That's a great name because it gives you the type used to
> represent the string - i.e. an array of integers of a specific
> width.
>
> Second, it's as it should. The entire scaffolding rests on the
> notion that char[] is distinguished from ubyte[] by having UTF8
> code units, not arbitrary bytes. It seems that many arguments
> against autodecoding are in fact arguments in favor of
> eliminating virtually all distinctions between char[] and
> ubyte[]. Then the natural question is, what _is_ the difference
> between char[] and ubyte[] and why do we need char as a
> separate type from ubyte?
>
> This is a fundamental question for which we need a rigorous
> answer.
Why?
> What is the purpose of char, wchar, and dchar? My current
> understanding is that they're justified as pretty much
> indistinguishable in primitives and behavior from ubyte,
> ushort, and uint respectively, but they reflect a loose
> subjective intent from the programmer that they hold actual UTF
> code units. The core language does not enforce such, except it
> does special things in random places like for loops (any other)?
>
> If char is to be distinct from ubyte, and char[] is to be
> distinct from ubyte[], then autodecoding does the right thing:
> it makes sure they are distinguished in behavior and embodies
> the assumption that char is, in fact, a UTF8 code point.
I don't follow this line of reasoning at all.
>> 11. Indexing an array produces different results than
>> autodecoding,
>> another glaring special case.
>
> This is a direct consequence of the fact that string is
> immutable(char)[] and not a specific type. That error predates
> autodecoding.
There is no convincing argument why indexing and slicing should
not simply operate on code units.
> Overall, I think the one way to make real steps forward in
> improving string processing in the D language is to give a
> clear answer of what char, wchar, and dchar mean.
I don't follow. Though, making char implicitly convertible to
wchar and dchar has clearly been a mistake.
More information about the Digitalmars-d
mailing list