std.experimental.collections.rcstring and its integration in Phobos

Wed Jul 18 22:44:33 UTC 2018

On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:
> On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt [...]
>> and whether applications would use arrays and ranges of char 
>> together with rcstring, or rcstring would be used for 
>> everything.
>
> That point is still open for discussion, but at the moment 
> rcstring isn't a range and the user has to declare what kind of 
> range he/she wants with e.g. `.by!char`
> However, one current idea is that for some use cases (e.g. 
> comparison) it might not matter and an application could add 
> overloads for rcstrings.

Maybe I misunderstood but you mean that for comparisons the 
encoding doesn't matter only right? But that does not preclude 
normalization, e.g. unicode defines U+00F1 as equal to the 
sequence U+006E U+0303 and that would work as long as they're 
normalized (from what I understand at least) and regardless of 
whether you compare char/wchar/dchars.

> The current idea is to do the same this for Phobos - though I 
> have to say that I'm not really looking forward to adding 200 
> overloads to Phobos :/
>
>> Perhaps its too early for these questions, and the current 
>> goal is simpler. For example, adding a meaningful collection 
>> class that is @nogc, @safe and ref-counted that be used as a 
>> proving ground for the newer memory management facilities 
>> being developed.
>
> That's the long-term goal of the collections project.
> However, with rcstring being the first big use case for it, the 
> idea was to push rcstring forward and by that discover all 
> remaining issues with the Array class.
> Also the interface of rcstring is rather contained (and doesn't 
> expose the underlying storage to the user), which allows us to 
> iterate over/improve upon the Array design.
>
>> Such simpler goals would be quite reasonable. What's got me 
>> wondering about the larger questions are the comments about 
>> ranges and autodecoding. If rcstring is intended as a vehicle 
>> for general @nogc handling of character data and/or for 
>> reducing the impact of autodecoding, then it makes sense to 
>> consider from those perspectives.
>
> Hehe, it's intended to solve both problems (auto-decoding by 
> default and @nogc) at the same time.
> However, it looks like to me like there isn't a good solution 
> to the auto-decoding problem that is convenient to use for the 
> user and doesn't sacrifice on performance.

How about a compile time flag that can make things more 
convenient:

auto str1 = latin1("literal");
rcstring!Latin1 latin1string(string str) {
   return rcstring!Latin1(str);
}

auto str2 = utf8("åsm");
// ...

struct rcstring(Encoding = Unknown) {
   ubyte[] data;
   bool normalized = false;
   static if (is(Encoding == Latin1)) {
     // by char range interface implementation
   } else if (is(Encoding == Utf8)) {
     // byGrapheme range interface implementation?
   } else {
     // no range interface implementation
   }

   bool opEquals()(auto ref const S lhs) const {
     static if (!is(Encoding == Latin1)) {
       return data == lhs.data;
     } else {
       return normalized() == lhs.normalized()
     }
   }

}

And now most ranges will work correctly. And then some of the 
algorithms that don't need to use byGrapheme but just need 
normalized code points to work correctly can do that and that 
seems like all the special handling you'll need inside range 
algorithms?

Then:

readText("foo".latin1);
"ä".utf8.split.join("|");

??

Cheers,
- Ali