Container insertion and removal

Sun Mar 7 19:07:14 PST 2010

On Sun, 07 Mar 2010 12:43:09 -0500, Robert Jacques <sandford at jhu.edu>  
wrote:

> On Sun, 07 Mar 2010 08:23:03 -0500, Steven Schveighoffer  
> <schveiguy at yahoo.com> wrote:
>>
>> How do you know when to stop?  A range has a beginning and an ending,  
>> otherwise it's an iterator.  Whether you store it via a pointer to the  
>> last (not-to-be-iterated) node, or via a length, or via a value to  
>> compare with for stopping, you have to use something.  Or are you  
>> asserting the only useful range for a linked list iterates the entire  
>> list?
>>
>> Think of it as the equivalent of a slice of an array.
>
> Please define for me an O(1) slice or index operation for a linked-list.

One for which you have references to the slice end points.  I think this  
will work, and I was planning on providing it in the upcoming dcollections  
port.  The only thing you cannot guarantee is that the order is correct.

> By the way, having ranges detect if they reach their end nodes or not is  
> fairly easy to do.

you are correct in that point, you could throw an exception as long as the  
end point is part of the range structure.  If you just use a current node  
plus a length, then you cannot do that.  But soft ops are not necessary to  
create this.

>
>> I still fail to see the difference between "soft" operations and  
>> non-soft.  What does soft guarantee?  Give me a concrete definition, an  
>> example would help too.
>
> There are a couple of possible definitions for soft operations: 1) the  
> memory safety of the ranges of a collection are guaranteed. 2) That for  
> the topology viewed by a range isn't logically changed. i.e. the range  
> will continue to perform the same logical function if the topology its  
> operating on is updated 3) That for the topology viewed by a range isn't  
> actually changed and all elements selected at range creation will be  
> viewed. 4) Like 3, but with all values being viewed.
>
> For example, modifying an array in any way doesn't change 1, 2 or 3 for  
> any of its slices.
> For a linked list defining a forward range, mutation, insertion and  
> removal can be done under 1 & 2.
> The same can be said about doubly linked lists and bidirectional ranges.
> For other containers, such as a sorted tree, mutation can break a 2/3  
> though insertion and deletion don't break 2.
> Although, the ranges will see many values, they may not see all the  
> values currently in the collection nor all the values in the collection  
> when the iterator was generated. So code that relies on such properties  
> would be logically invalid.
>
> I'd probably define hard ops as being 1) and soft ops at level 2. 4) is  
> really only possible with immutable containers.

Hard ops definitely qualify as #1, since we are in a GC'd language.

I don't really understand 2, "the range will continue to perform the same  
logical function," what does that mean?  Define that function.  I would  
define it as #3, so obviously you think it's something else.

#3 would be most useful for soft operations if it could be guaranteed.  I  
don't think it can.

>>> Wouldn't re-hashing necessitate re-allocation? (Thus the range would  
>>> see a
>>> stale view)
>>
>> God no.  If my hash collision solution is linked-list based (which it  
>> is in dcollections), why should I reallocate all those nodes?  I just  
>> rearrange them in a new bucket array.
>
> Sorry, I was assuming that if you were going to implement a hash  
> collection you wouldn't be using a linked list approach, since that's  
> what D's associative arrays already do. The are some really good reasons  
> to not use a list based hash in D due to GC false pointer issues, but  
> basically none to re-implementing (poorly?) D's built-in data structure.

Hm... my hash outperforms builtin AAs by a wide margin.  But this is not  
technically because my implementation is better, it's because AA's use  
"dumb" allocation methods.  I don't know about false pointers, the hash  
nodes in my implementation only contain pointers, so I'm not sure there is  
any possibility for false ones.

What my hash implementation *does* do that AA's don't is to provide a  
better interface -- removal while iterating, storing cursors for later  
reference and removal (i.e. avoid double lookups).

Also, technically AAs are implemented with a tree, not a LL, so no opCmp  
is required for my hash.

>>> The difference is that algorithms can document in their template
>>> constraints that they need a container with 'soft' properties.
>>
>> What is the advantage?  Why would an algorithm require soft functions?   
>> What is an example of such an algorithm?
>
> Something that uses toUpperCase or toLowerCase, for example.

I guess I won't get a real example.  I'm not sure it matters.  When Andrei  
starts implementing the soft methods, either they will be a huge win or  
obviously useless.  If I were a betting man, I'd bet on the latter, but  
I'm not really good at betting, and Andrei's ideas are usually good :)

>>> I wasn't thinking of multi-threaded containers. I was trying to point  
>>> out
>>> that version ids have failed in lock-free containers, where things are
>>> happening on the order of a single atomic op or a context switch. Given
>>> the time a range could go unused in standard code, versioning won't  
>>> work.
>>
>> Are you trying to say that if you don't use your range for exactly 2^32  
>> mutations, it could mistakenly think the range is still valid?  That's  
>> a valid, but very very weak point.
>
> Umm, no. That is a valid point that happens in production code to  
> disastrous effects. Worse, it doesn't happen often and there's no good  
> unit test for it. I for one, never want to debug a program that only  
> glitches after days of intensive use in a live environment with real  
> customer data. Integer overflow bugs like this are actually one of the  
> few bugs that have ever killed anyone.

I think you are overestimating the life of ranges on a container.  They  
will not survive that many mutations, and the worst that happens is the  
range continues on fully allocated memory (i.e. your #1 above).

You can also force the container to invalidate itself once the first wrap  
occurs.  This at least will be a hard error.

>>> No it's not. version tags + integer overflow = bug. Doug Lea knew about
>>> the problem and but thought it would never happen in real code. And  
>>> Bill
>>> Gates thought no one will need more than 640k of ram. They both have  
>>> been
>>> proven wrong.
>>
>> Overflow != bug.  Wrapping completely to the same value == bug, but is  
>> so unlikely, it's worth the possibility.
>
> Statistics 101: do a test enough times and even the highly improbable  
> will happen.

In statistics we generally ignore the outliers.  I normally am on your  
side in these types of cases, but we are talking about a statistical  
impossibility -- nobody will leave a range untouched for exactly 2^32  
mutations, it simply won't happen.  Your computer will probably be  
obsolete before it does.

BTW, I'm not advocating adding mutation counters, I don't plan on adding  
them.  It's just another alternative to "soft" mutations.

-Steve