Container insertion and removal

Robert Jacques sandford at jhu.edu
Sun Mar 7 09:43:09 PST 2010


On Sun, 07 Mar 2010 08:23:03 -0500, Steven Schveighoffer  
<schveiguy at yahoo.com> wrote:

> Robert Jacques Wrote:
>
>> On Sat, 06 Mar 2010 21:54:50 -0500, Steven Schveighoffer
>> <schveiguy at yahoo.com> wrote:
>> > On Sat, 06 Mar 2010 11:19:15 -0500, Robert Jacques <sandford at jhu.edu>
>> > wrote:
>> >
>> >> On Sat, 06 Mar 2010 08:19:36 -0500, Steven Schveighoffer
>> >> <schveiguy at yahoo.com> wrote:
>> >>>
>> >>> How can softRemove not affect iterating ranges?  What if the range  
>> is
>> >>> positioned on the element removed?
>> >>
>> >> It always affects ranges in so far as the range and container are
>> >> inconsistent, but that is a problem of softInsert as well. Removing  
>> an
>> >> element from an array doesn't invalidate the underlying range, since
>> >> the memory is still there. And so long as you're not trying to use  
>> free
>> >> lists, linked-lists, trees, etc. can be written so that the ranges
>> >> never enter an invalid state. If they happen to be pointing at the
>> >> removed node, they just end up stopping early.
>> >
>> > If my linked list range has two node pointers as the implementation,  
>> and
>> > you remove the end node, it's going to end later, not early.
>>
>> A linked list maps to a forward range: so the range simply points to a
>> single internal node. If you're going to store a pointer to the end,  
>> then
>> you should be using a doubly linked list, not a singly linked list.
>
> How do you know when to stop?  A range has a beginning and an ending,  
> otherwise it's an iterator.  Whether you store it via a pointer to the  
> last (not-to-be-iterated) node, or via a length, or via a value to  
> compare with for stopping, you have to use something.  Or are you  
> asserting the only useful range for a linked list iterates the entire  
> list?
>
> Think of it as the equivalent of a slice of an array.

Please define for me an O(1) slice or index operation for a linked-list.  
The only way of doing this is to search the entire list in order,  
comparing for the search terms or counting the number of nodes passed. The  
range itself can do this just as easily as the host container, if you  
really want this functionality (I'd argue this isn't a valid list  
operation). For simple list[5..10] operations its definitely more  
efficient and then the range could even throw an exception when it reaches  
the end of the list before finding the end slice value (due to list  
topology manipulation). The efficiency of more complex ranges, like  
list["apple".."oranges"], is still much more efficient for the single  
range case. Even when you start to take many slices, cache effects can  
marginalize a lot of the cost.

>>
>> > Of course, you can get around this by just storing a current node and  
>> a
>> > length, but that may not be what the user is expecting.  For example,  
>> if
>> > you did a range of a sorted tree from "a" to "f" and it stops at "n",  
>> I
>> > think that would be extremely unexpected.
>>
>> By this logic, any and all forms of mutation, not just insertions and
>> deletions must not be allowed.
>
> Allowed, yes.  Not invalidating iterators, no.

If I change a node in a sorted tree from 'a' to 'n', the node moves,  
changing the tree topology. Any range currently at the node would continue  
to look for the 'f' node, and iterate the rest of the tree erroneously. By  
the way, having ranges detect if they reach their end nodes or not is  
fairly easy to do.

>>
>> > Stopping early is invalidation also IMO.  If your program logic  
>> depends
>> > on iterating over all the elements you originally intended to iterate,
>> > then we have big problems if they stop early.
>>
>> Stopping early is the result of a logical error by the programmer. The
>> code itself, however, is completely valid.
>
> I still fail to see the difference between "soft" operations and  
> non-soft.  What does soft guarantee?  Give me a concrete definition, an  
> example would help too.

There are a couple of possible definitions for soft operations: 1) the  
memory safety of the ranges of a collection are guaranteed. 2) That for  
the topology viewed by a range isn't logically changed. i.e. the range  
will continue to perform the same logical function if the topology its  
operating on is updated 3) That for the topology viewed by a range isn't  
actually changed and all elements selected at range creation will be  
viewed. 4) Like 3, but with all values being viewed.

For example, modifying an array in any way doesn't change 1, 2 or 3 for  
any of its slices.
For a linked list defining a forward range, mutation, insertion and  
removal can be done under 1 & 2.
The same can be said about doubly linked lists and bidirectional ranges.
For other containers, such as a sorted tree, mutation can break a 2/3  
though insertion and deletion don't break 2.
Although, the ranges will see many values, they may not see all the values  
currently in the collection nor all the values in the collection when the  
iterator was generated. So code that relies on such properties would be  
logically invalid.

I'd probably define hard ops as being 1) and soft ops at level 2. 4) is  
really only possible with immutable containers.

>> >>> The only two containers that would support softInsert would be  
>> linked
>> >>> list and sorted map/set.  Anything else might completely screw up  
>> the
>> >>> iteration.  I don't see a lot of "generic" use for it.
>> >>
>> >> There's all the containers based upon linked-lists, etc like hashes,
>> >> stacks, queues and dequeues.
>> >
>> > Hashes may be rehashed when inserting, completely invalidating a range
>> > (possibly the end point will be before the starting point).
>>
>> Wouldn't re-hashing necessitate re-allocation? (Thus the range would  
>> see a
>> stale view)
>
> God no.  If my hash collision solution is linked-list based (which it is  
> in dcollections), why should I reallocate all those nodes?  I just  
> rearrange them in a new bucket array.

Sorry, I was assuming that if you were going to implement a hash  
collection you wouldn't be using a linked list approach, since that's what  
D's associative arrays already do. The are some really good reasons to not  
use a list based hash in D due to GC false pointer issues, but basically  
none to re-implementing (poorly?) D's built-in data structure.

>>
>> > Yes, the others you mentioned will be valid.  But I still don't see it
>> > being more useful than just using documentation to indicate insertion
>> > will not invalidate ranges.  Perhaps I am wrong.
>>
>> The difference is that algorithms can document in their template
>> constraints that they need a container with 'soft' properties.
>
> What is the advantage?  Why would an algorithm require soft functions?   
> What is an example of such an algorithm?

Something that uses toUpperCase or toLowerCase, for example.

>>
>> >>> Another option is to use a "mutation" field that is checked every
>> >>> chance by the range.  If it changes, then the range is invalidated.
>> >>
>> >> The mutation field would have to be a version number to support
>> >> multiple ranges, and given experience with lock-free algorithms which
>> >> use a 'tag' in a similar manner, this concept is bug prone and should
>> >> not be relied upon. It would be better to 'lock' the node or  
>> container
>> >> to topology changes, though this does slow things down and has no
>> >> conflict resolution: removing a locked node would have to throw an
>> >> exception.
>> >
>> > I was not thinking of multithreaded applications.  I don't think it's
>> > worth making containers by default be multithreaded safe.
>>
>> I wasn't thinking of multi-threaded containers. I was trying to point  
>> out
>> that version ids have failed in lock-free containers, where things are
>> happening on the order of a single atomic op or a context switch. Given
>> the time a range could go unused in standard code, versioning won't  
>> work.
>
> Are you trying to say that if you don't use your range for exactly 2^32  
> mutations, it could mistakenly think the range is still valid?  That's a  
> valid, but very very weak point.

Umm, no. That is a valid point that happens in production code to  
disastrous effects. Worse, it doesn't happen often and there's no good  
unit test for it. I for one, never want to debug a program that only  
glitches after days of intensive use in a live environment with real  
customer data. Integer overflow bugs like this are actually one of the few  
bugs that have ever killed anyone.

>>
>> > The mutation index has been used in Tango forever, and I think was in
>> > Doug Lea's original container implementations.  I'm pretty sure it is
>> > sound in single-threaded uses.
>>
>> No it's not. version tags + integer overflow = bug. Doug Lea knew about
>> the problem and but thought it would never happen in real code. And Bill
>> Gates thought no one will need more than 640k of ram. They both have  
>> been
>> proven wrong.
>
> Overflow != bug.  Wrapping completely to the same value == bug, but is  
> so unlikely, it's worth the possibility.

Statistics 101: do a test enough times and even the highly improbable will  
happen.




More information about the Digitalmars-d mailing list