.NET on a string

Tue Mar 17 21:07:48 PDT 2009

On Tue, 17 Mar 2009 13:03:56 -0400, Steven Schveighoffer  
<schveiguy at yahoo.com> wrote:

> On Tue, 17 Mar 2009 10:59:21 -0400, Georg Wrede <georg.wrede at iki.fi>  
> wrote:
>
>> Walter Bright wrote:
>>> Cristi talks about adapting D strings to .NET.
>>>   
>>> http://www.reddit.com/r/programming/comments/84urf/net_on_a_string/?already_submitted=true
>>
>> Actually, meant to ask earlier, but
>>
>> "[...in .NET,] System.Array and System.ArraySegment are distinct,  
>> unrelated types. In D arrays slices are indistinguishable from arrays  
>> and this creates the problems that I mentioned in the interview."
>>
>> Is this discussed somewhere?
>>
>> The idea of slices and arrays being distinct types does seem to have  
>> advantages. I've seen a couple of mentions of this lately, but has  
>> there been a *rigorous* discussion?
>
> There has been.  But there are very good reasons to keep arrays and  
> slices the same type.  Even in C# and Java, a substring is the same type  
> as a string.  It allows iterative patterns such as:
>
> str = str[1..$];
>
> The main problem comes form having a substring overwrite data in another  
> string via appending.  From what I can tell, that is the *only* issue  
> with arrays and slices being the same type.  This problem can be solved  
> in other ways.  I've put forth 2 solutions that as far as I know were  
> not proven to be infeasible, but which never received any attention in  
> the NG  from Walter or Andrei.
>
> There was mention of a separate T[new] type, before my time with D, but  
> I think you still have the same aliasing problems there unless you zero  
> out the source instance on assignment, or simply disallow assigning a  
> T[new]  from another T[new], which can affect the design of code, and  
> make certain iterative patterns difficult if not impossible.
>
> The two solutions I put forth are:
>
> 1. Storing the requested length of an allocated block in the GC.  This  
> not only allows for more intuitive appending, but could also help the GC  
> when scanning for pointers (you might not have to zero out the block  
> first).  This proposal received zero responses.

Well, if it helps, my roommate and I just independently came up with this  
idea. (So is this great minds think a like or that simple ones seldom  
differ?)

> 2. Storing the requested length of an allocated array at the front of  
> the array.  I had a hackish scheme to use the most significant bit in  
> the length field to identify if a slice is just beyond the allocated  
> length.  Therefore, an append operation would first check if it is the  
> first element in a block, and if so check the allocated length before  
> deciding whether to append in place or allocate a new block.  There were  
> some questions on the proposal, but I don't believe anyone found any  
> killer problems with it.  It has advantages and disadvantages over the  
> first proposal and current implementation.  The main advantage is that  
> the append code does not have to query the GC to get the requested  
> length.

You mentioned being un-sure of the threading and alignment issues of your  
proposal in [2]. I see both a lock and lock-free solution:

Lock solution:
As to append to an array, one has to know the capacity, which requires  
taking the GC lock. If, as in [1], the used length and capacity are stored  
in the GC together, one could move the extension operation to the GC. i.e.  
given a block, the used length in that block and an array (consisting of a  
pointer into that block and a length) it is straight forward to calculate  
if that array could be lengthen and if not, allocate a new one. The caller  
can then do a full or partial copy as appropriate. This avoids the  
alignment issues (as no new information is stored in the block) and  
maintains the maximum array size (although this is a fairly minor corner  
case)

Lock-free solution
This solution would make use of the 'empty' 16-byte block that proceeds  
each allocated block. I'm assuming this is usable, but if it's not, an  
extra 16-byte block would have to be added to each heap array to maintain  
proper alignment. This solution would cache the array capacity and its  
used length just prior to the start of the array (i.e. in the 'empty'  
block). This is similar to [2], except that in [2] you still have to query  
the GC for capacity. Again, a bit flag in the array length determines  
slice / non-slice status. Regarding multi-threading issues: reading  
capacity  suffers from publication safety issues (I think) but it might  
get cleared up by the releasing of the GC allocation lock. Otherwise, it  
would need to be fenced. As for the length, one can use a single compare  
and swap (comparing the GC's length with the local array's length and  
setting the new desired length). This also has the advantage of avoiding  
the GC when doing array appending, which decreases the need for separate  
ArrayBuldier classes. (Of course, once shared/local is introduced, the  
expensive fences and CAS instructions can be limited appropriately.)

> Aside from the append problem, I see no other drawbacks to having  
> strings the way they are.
>
> References:
>
> Proposal 1:  
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=63146
>
> Proposal 2:  
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=77437
>
> -Steve