Dealing with Autodecode

Tue Jun 7 02:57:09 PDT 2016

On Wednesday, 1 June 2016 at 00:46:04 UTC, Walter Bright wrote:
> It is not practical to just delete or deprecate autodecode - it 
> is too embedded into things. What we can do, however, is stop 
> using it ourselves and stop relying on it in the documentation, 
> much like [] is eschewed in favor of std::vector in C++.
>
Hopefully my perspective on auto-decoding topic is useful rather 
than disruptive. I work on search applications, both run-time 
engines and data science. Processing multi-lingual text is an 
important aspect of these applications. There are a couple issues 
with D's current auto-decoding implementation for these 
applications.

One is lack of control over error handling when encountering 
corrupt utf-8 text. Real world data contains corrupt utf-8 
sequences, robust applications need to handle them. Proper 
handling is generally application specific. Both replacement 
character and throwing exceptions are useful behaviors, but the 
ability to choose between them is often necessary. At present, 
this behavior is built into the low-level primitives, without 
application control. Notably, 'front' and 'popFront' have 
different behaviors. This is also a consideration for explicitly 
invoked decoding facilities like 'byUTF'.

Another is performance. Iteration triggering auto-decoding is 
apparently an order of magnitude more costly than iteration 
without decoding. This is too large a delta when the algorithm 
doesn't require decoding. (Such algorithms are common.) Frankly, 
I'm surprised the cost is so large. It wouldn't surprise me to 
find out it's partly a compiler artifact, but it doesn't matter.

As to what to do about it - if changing currently built-in auto 
decoding is not an option, then perhaps providing parallel 
facilities that don't auto-decode would do the trick. RCStr would 
seem a real opportunity. Perhaps a raw array of utf-8 code units 
ala ubyte[] that doesn't get auto-decoded? With either, explicit 
decoding would be needed to invoke standard library routines 
operating on unicode code points or graphemes. (Sounds like 
interaction with character literals could still be an issue, as 
the actual representation is not obvious.) Having a consistent 
set of error handling options for explicit decoding facilities 
would be helpful as well.

Another possibility would be support for detecting inadvertent 
auto-decoding. D has very nice support for ensuring or detecting 
code properties (eg. '@nogc', '-vgc' compiler option). If there 
was a way to identify code triggering auto-decoding, that would 
be useful.