std.experimental.collections.rcstring and its integration in Phobos

Thu Jul 19 07:55:43 UTC 2018

On Wednesday, 18 July 2018 at 22:44:33 UTC, aliak wrote:
> On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:
>> On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt 
>> [...]
>>> [...]
>>
>> That point is still open for discussion, but at the moment 
>> rcstring isn't a range and the user has to declare what kind 
>> of range he/she wants with e.g. `.by!char`
>> However, one current idea is that for some use cases (e.g. 
>> comparison) it might not matter and an application could add 
>> overloads for rcstrings.
>
> Maybe I misunderstood but you mean that for comparisons the 
> encoding doesn't matter only right? But that does not preclude 
> normalization, e.g. unicode defines U+00F1 as equal to the 
> sequence U+006E U+0303 and that would work as long as they're 
> normalized (from what I understand at least) and regardless of 
> whether you compare char/wchar/dchars.
>
>> The current idea is to do the same this for Phobos - though I 
>> have to say that I'm not really looking forward to adding 200 
>> overloads to Phobos :/
>>
>>> [...]
>>
>> That's the long-term goal of the collections project.
>> However, with rcstring being the first big use case for it, 
>> the idea was to push rcstring forward and by that discover all 
>> remaining issues with the Array class.
>> Also the interface of rcstring is rather contained (and 
>> doesn't expose the underlying storage to the user), which 
>> allows us to iterate over/improve upon the Array design.
>>
>>> [...]
>>
>> Hehe, it's intended to solve both problems (auto-decoding by 
>> default and @nogc) at the same time.
>> However, it looks like to me like there isn't a good solution 
>> to the auto-decoding problem that is convenient to use for the 
>> user and doesn't sacrifice on performance.
>
> How about a compile time flag that can make things more 
> convenient:
>
> auto str1 = latin1("literal");
> rcstring!Latin1 latin1string(string str) {
>   return rcstring!Latin1(str);
> }
>
> auto str2 = utf8("åsm");
> // ...
>
> struct rcstring(Encoding = Unknown) {
>   ubyte[] data;
>   bool normalized = false;
>   static if (is(Encoding == Latin1)) {
>     // by char range interface implementation
>   } else if (is(Encoding == Utf8)) {
>     // byGrapheme range interface implementation?
>   } else {
>     // no range interface implementation
>   }
>
>   bool opEquals()(auto ref const S lhs) const {
>     static if (!is(Encoding == Latin1)) {
>       return data == lhs.data;
>     } else {
>       return normalized() == lhs.normalized()
>     }
>   }
>
> }
>
> And now most ranges will work correctly. And then some of the 
> algorithms that don't need to use byGrapheme but just need 
> normalized code points to work correctly can do that and that 
> seems like all the special handling you'll need inside range 
> algorithms?
>
> Then:
>
> readText("foo".latin1);
> "ä".utf8.split.join("|");
>
> ??
>
> Cheers,
> - Ali

I like this approach, `rcstring.by!` is to verbose for my taste 
and quite annoying for day to day usage.

I think rcstring should be aliased by concrete implementation 
like ansi, uft8, utf16, utf32. Those aliases should be ranges and 
maybe subtype their respective string, wstring, dstring so they 
can be transparently used for non-range based APIs (this required 
dip1000 for @safe).

The take away is that rcstring by itself does not satisfy the 
usability criteria, and probably should focus on performance and 
flexibility to be used as a building block for higher level 
constructs that are easier to use and safer in regards to how 
they work with the string type they hold.