The Case Against Autodecode
Marc Schütz via Digitalmars-d
digitalmars-d at puremagic.com
Tue May 31 06:33:14 PDT 2016
On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
> On 5/30/2016 8:34 AM, Marc Schütz wrote:
>> In an ideal world, we'd also want to change the way `length`
>> and `opIndex` work,
>
> Why? strings are arrays of code units.
So, strings are _implemented_ as arrays of code units. But
indiscriminately treating them as such in all situations leads to
wrong results (just like arrays of code points would).
In an ideal world, the programs someone intuitively writes will
do the right thing, and if they can't, they at least refuse to
compile. If we agree that it's up to the user whether to iterate
over a string by code unit or code points or graphemes, and that
we shouldn't arbitrarily choose one of those (except when we know
that it's what the user wants), then the same applies to
indexing, slicing and counting.
On the other hand, changing such low-level things will likely be
impractical, that's why I said "In an ideal world".
> All the trouble comes from erratically pretending otherwise.
For me, the trouble comes from pretending otherwise _without
being told to_.
To make sure there are no misunderstandings, here is what is
suggested as an alternative to the current situation:
* `char[]`, `wchar[]` (and `dchar[]`?) no longer pass
`isInputRange`.
* Ranges with element type `char`, `wchar`, and `dchar` do pass
`isInputRange`.
* A bunch of rangeifying helpers are added to `std.string` (I
believe they are already there): `byCodePoint`, `byCodeUnit`,
`byChar`, `byWchar`, `byDchar`, ...
* Algorithms like `find`, `join(er)` get overloads that accept
char slices directly.
* Built-in operators and `length` of char slices are unchanged.
Advantages:
* Algorithms that can work _correctly_ without any kind of
decoding will do so.
* Algorithms that would yield incorrect results won't compile,
requiring the user to make a decision regarding the desired
element type.
* No auto-decoding.
=> Best performance depending on the actual requirements.
=> No results that look correct when tested with only
precomposed characters but are wrong in the general case.
* Behaviour of [] and .length is no worse than today.
More information about the Digitalmars-d
mailing list