[Dlang-study] [rcstring] Defining rcstring

Михаил Страшун public at dicebot.lv
Tue Feb 2 20:46:53 PST 2016


Quick remark: I am out of the loop for latest RC discussions and have no
idea how it is going to be implemented in compiler-friendly way. So for
now I'll assume that it just magically works and focus on derived topics.

On 02/02/2016 11:40 PM, Andrei Alexandrescu wrote:
> * make it part of a nascent std.experimental.lifetime module/package
>
> * call it rcstring or RCString? The first makes it closer to "string",
> the other is politically correct.
>
> * Characters are small so no need to return them by reference. Because
> of this, making RCString @safe should be possible in current D.
> However, this also makes RCString not a plug-in replacement for string
> (which may after all be a good thing)
>
> * Since string-compatibility is off the table, how about we fix
> string's issues with autodecoding? RCString should offer no indexed
> access and no length. Instead it offers the ranges byCodeUnit, byChar,
> byWChar, and byDChar. The first one does not do any decoding and
> offers length and random access. (What should be its element type?)
> The other ones are bidirectional ranges that do the appropriate decoding.

Opinion
-------

Idea of "next generation" opt-in string replacement sounds very
appealing on its own. However, I don't see it fitting for
`std.experimental.lifetime` because to act as a full blown string
replacement (and to be eventually enhanced by compiler support) it needs
to be part of druntime. Putting initial proof of concept implementation
into `std.experimental` is good approach but it I don't think it makes
sense to promise from the very beginning that it will go to matching
`std.lifetime`.

I absolutely support explicit `byCodeUnit`/`byChar` requirement
(skipping naming bikeshedding for now) but it also needs to work
naturally with `byGrapheme`. Importing all std.uni is certainly an
overkill but when writing documentation for new modules it must be
mentioned as most "correct" option for multi-language text processing.
Element type of `byCodeUnit` should be `ubyte` in my opinion so that it
becomes clear each separate element is not a valid char on its own.

Not sure what you mean about "no need to return them by reference"
though. Does that apply only to byX ranges or you want to make the whole
string effectively unmodifiable? In other words, how the idiom of
mutable reusable buffer will look like?

Related Stories
---------------

When it comes to encoding, there is also issue of how lacking is current
support of non-UTF encodings in Phobos.I want to share two stories from
personal experience when it was an issue:

1. A friend of mine got very excited about Pegged from DConf videos and
wanted to try it out for a project consisting of bunch of very small DSL
implementations. However after initial experiments he has quickly found
out that Pegged always uses char[] internally with auto-decoding and no
way to change it. It wasn't even a performance issues - my friend needed
resulting parser to work with extended ASCII text input which isn't a
valid UTF at all. He has abandoned the idea of using D/Pegged when
discovering it and switched to some solution he liked less but which
actually worked.

2. A while ago I have been helping to optimize small piece of
bioinformatics code which was processing DNA sequence from a huge text
file. Initial performance was surprisingly bad and one of biggest
speedups (~ 2x-3x) was achieved by casting read data to ubyte[] and
reimplementing Phobos functions from std.string that expect `string`
arguments to also accept ubyte[] - because the text

Both stories show one recurring issue with existing string design -
simply using `string` or `char[]` every time you think about text is so
easy that even experienced developers keep forgetting it must represent
Unicode text and that other options may be more applicable. Requiring
explicit `byChar` is one good way to encourage thinking about used
encoding but I think it is also very important for new `rcstring` to
also support other kinds of encodings without too much added hassle.

Proposals
---------

For me idealised sequence of events could look like this:

1. Put draft implementation of `rcstring` into
`std.experimental.rcstring` making it templated on encoding policy
2. When it looks good, move implementation to druntime and very slowly
start using it internally there
3. Make `utf8string` an alias for `rcstring!(Encoding.UTF8)` and use it
as only string type within druntime itself (it doesn't need to support
any others).
4. Enhance std.encoding to provide any additional rcstring encodings,
most importantly raw ASCII
5. Review Phobos to deprecate uncalled assumptions for `char[]`,
especially when working with I/O (files, stdin)

It would take a very long time of course but I always prefer to know a
long-term picture for any new design.

> * Immutable does not play well with reference counting. I'm of a mind
> to reject immutable rcstring for now and figure out later how to go
> about it. Then const rcstring is okay because we always consider const
> a view on mutable strings (even though they're gone). We'll cast const
> away when manipulating the refcount.

This one is the toughest in my opinion. Putting aside my own opinion and
preferences, you should have answers on several points if pursuing this way:

* What are cases for const if one wants to prohibit immutable for a
given a type? Being a wildcard for mutable/immutable is main idea behind
const. Everything else is just making compiler happy when it forces
const on you (like `this` pointer within in/out contracts).
* As a consequence, how will compiler ensure in/out contracts won't
affect refcounting state for `this` if it becomes legal to cast const
away and mutate?
* How do you envision efficient cross-thread sharing of rcstring if
immutability is out of the question?
* If one can't support immutability for something relatively simple and
specialized like char array, doesn't it effectively kill the concept or
immutable containers important for multi-threading?

Right now I am of opinion that issues with immutability highlight issues
of your desired approach to const and should not be discarded that
easily. But it is a feeling caused by more by lack of understanding than
hard proof thus I am interested in how you envision it all together. I
am also most interested to learn what Walter has to say on topic,
especially in regards to how such changes to const would affect code
generation and optimizations available to compiler.

> * I don't have the small string optimization implemented yet, but
> obviously the definition of the type should allow it.

Sounds fairly uncontroversal.

Best Regards,
Dicebot

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.puremagic.com/pipermail/dlang-study/attachments/20160203/f8db5288/attachment.sig>


More information about the Dlang-study mailing list