2.011 invariant question

Sun Feb 24 16:02:39 PST 2008

Janice Caron wrote:
> On 23/02/2008, Sönke Ludwig <ludwig at informatik_dot_uni-luebeck.de> wrote:
>> I think that if it's fixed, it has to be done transparently or it my be
>>  difficult to explain why s = s ~ "x"; is allowed but s ~= "x"; is not.
> 
> You know, the more I think about it, the more I come to the conclusion
> that you're right, and that
> 
>     x ~= y;
> 
> should behave /exactly/ like
> 
>     x = x ~ y;
> 
> and cause a reallocation every time. Incremental string building is
> the usual reason given why this isn't the case, but I think there's
> another way round that problem.
> 
> Think about what happens at runtime when x ~= y; is executed. First,
> the compiler has to decide whether or not x is resizable (that is,
> does it point to the start of a resizable block?). Well - how does it
> know that? Answer: it has to ask the garbage collector. OK, so how
> does the garbage collector know that? Well, presumably, the gc has to
> have some sort of map or table, so it can look up the address x, and
> if it's in the table, then the answer is yes. But that lookup
> operation has to be slower that O(1). No matter how you look at it,
> there has to be at least O(log N) CPU time being consumed there, where
> N is the number of gc allocated blocks.

A hashtable could get you an average O(1) for that lookup (the worst 
case could still be O(N) or O(log N) though, depending on the 
implementation). You wouldn't need to enter every allocation in there, 
by the way. If GC pools are always allocated on some aligned address, 
the start of one can be found by bitwise-anding out the lower bits of 
the pointer, and that address could be checked in the hashtable. "is 
this the start of an allocation" then becomes a hashtable lookup plus a 
check "(cast(size_t)(p - pool_start) % pool_objsize == 0"[1]. (The 0 may 
need to be adjusted to allocation_header.sizeof)

[1]: For pool_objsize == 2^N for some N no actual division needs to be 
performed; "& ((1 << N) - 1)" has the same effect.

On another note, how about this for a rule on concatenating to invariant 
arrays:
For each heap-allocated array, put both the reserved size and *the 
maximum used size* in the header of the allocation. Whenever arr.ptr + 
arr.length is equal to allocation.start + allocation.max_size_ever, it's 
safe to append to it in-place as long as you increase the maximum size 
accordingly.
(This'll need some synchronization to be thread-safe, but when there are 
multiple threads allocations are currently already synchronized anyway)

Note that the rule above allows appending to trailing slices; this is 
unsafe for mutable arrays since they traditionally reallocated in this 
instance and programs may rely on them not sharing data with the 
original. However, this is safe for invariant arrays since the shared 
prefix can't change -- the only way to detect sharing would be pointer 
comparisons which are not required to provide meaningful information 
anyway (IIRC the spec allows "moving" garbage collectors, so pointer 
comparisons aren't reliable anyway since you can't count on your data 
staying where it is).

This rule would allow incremental string building to be fairly efficient 
even for invariant arrays; at worst, the initial append will trigger a 
reallocation before an allocated block of memory runs out. The rest will 
only reallocate when it's actually needed just like with mutable-string 
building. (At least, as long as you don't slice away elements from the end)