Dealing with Autodecode
Jon Degenhardt via Digitalmars-d
digitalmars-d at puremagic.com
Tue Jun 7 02:57:09 PDT 2016
On Wednesday, 1 June 2016 at 00:46:04 UTC, Walter Bright wrote:
> It is not practical to just delete or deprecate autodecode - it
> is too embedded into things. What we can do, however, is stop
> using it ourselves and stop relying on it in the documentation,
> much like [] is eschewed in favor of std::vector in C++.
>
Hopefully my perspective on auto-decoding topic is useful rather
than disruptive. I work on search applications, both run-time
engines and data science. Processing multi-lingual text is an
important aspect of these applications. There are a couple issues
with D's current auto-decoding implementation for these
applications.
One is lack of control over error handling when encountering
corrupt utf-8 text. Real world data contains corrupt utf-8
sequences, robust applications need to handle them. Proper
handling is generally application specific. Both replacement
character and throwing exceptions are useful behaviors, but the
ability to choose between them is often necessary. At present,
this behavior is built into the low-level primitives, without
application control. Notably, 'front' and 'popFront' have
different behaviors. This is also a consideration for explicitly
invoked decoding facilities like 'byUTF'.
Another is performance. Iteration triggering auto-decoding is
apparently an order of magnitude more costly than iteration
without decoding. This is too large a delta when the algorithm
doesn't require decoding. (Such algorithms are common.) Frankly,
I'm surprised the cost is so large. It wouldn't surprise me to
find out it's partly a compiler artifact, but it doesn't matter.
As to what to do about it - if changing currently built-in auto
decoding is not an option, then perhaps providing parallel
facilities that don't auto-decode would do the trick. RCStr would
seem a real opportunity. Perhaps a raw array of utf-8 code units
ala ubyte[] that doesn't get auto-decoded? With either, explicit
decoding would be needed to invoke standard library routines
operating on unicode code points or graphemes. (Sounds like
interaction with character literals could still be an issue, as
the actual representation is not obvious.) Having a consistent
set of error handling options for explicit decoding facilities
would be helpful as well.
Another possibility would be support for detecting inadvertent
auto-decoding. D has very nice support for ensuring or detecting
code properties (eg. '@nogc', '-vgc' compiler option). If there
was a way to identify code triggering auto-decoding, that would
be useful.
More information about the Digitalmars-d
mailing list