The Case Against Autodecode

Marc Schütz via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 06:33:14 PDT 2016


On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
> On 5/30/2016 8:34 AM, Marc Schütz wrote:
>> In an ideal world, we'd also want to change the way `length` 
>> and `opIndex` work,
>
> Why? strings are arrays of code units.

So, strings are _implemented_ as arrays of code units. But 
indiscriminately treating them as such in all situations leads to 
wrong results (just like arrays of code points would).

In an ideal world, the programs someone intuitively writes will 
do the right thing, and if they can't, they at least refuse to 
compile. If we agree that it's up to the user whether to iterate 
over a string by code unit or code points or graphemes, and that 
we shouldn't arbitrarily choose one of those (except when we know 
that it's what the user wants), then the same applies to 
indexing, slicing and counting.

On the other hand, changing such low-level things will likely be 
impractical, that's why I said "In an ideal world".

> All the trouble comes from erratically pretending otherwise.

For me, the trouble comes from pretending otherwise _without 
being told to_.

To make sure there are no misunderstandings, here is what is 
suggested as an alternative to the current situation:

* `char[]`, `wchar[]` (and `dchar[]`?) no longer pass 
`isInputRange`.
* Ranges with element type `char`, `wchar`, and `dchar` do pass 
`isInputRange`.
* A bunch of rangeifying helpers are added to `std.string` (I 
believe they are already there): `byCodePoint`, `byCodeUnit`, 
`byChar`, `byWchar`, `byDchar`, ...
* Algorithms like `find`, `join(er)` get overloads that accept 
char slices directly.
* Built-in operators and `length` of char slices are unchanged.

Advantages:

* Algorithms that can work _correctly_ without any kind of 
decoding will do so.
* Algorithms that would yield incorrect results won't compile, 
requiring the user to make a decision regarding the desired 
element type.
* No auto-decoding.
   => Best performance depending on the actual requirements.
   => No results that look correct when tested with only 
precomposed characters but are wrong in the general case.
* Behaviour of [] and .length is no worse than today.


More information about the Digitalmars-d mailing list