BitArray/BitFields - Resumed and polishing

Wed Jan 2 23:57:45 PST 2013

1/3/2013 6:05 AM, Era Scarecrow пишет:
> On Wednesday, 2 January 2013 at 21:00:38 UTC, Dmitry Olshansky wrote:
>> 12/31/2012 9:35 PM, Era Scarecrow пишет:
>
>> Personally I believe that if we introduce a slice of a BitArray as a
>> separate range type that represents a view of array (but can't be
>> appended to, etc. it's just a RA range with slicing).
>
>   Could do that I suppose, but then either exact length or
> append/alignment optimizations may go out the window if we remove some
> of those as options. To support those making slices part of the main
> type was easy enough. But the other issues still have to be addressed
> (that's mentioned below)
>
>>
>>>  const BitArray x = BitArray(32);
>>>  func(x);
>>>
>>
>> Then here x is passed either by ref or a copy (depending on func).
>
>   yeah... assuming func returned a struct of...
>
>   struct S {
>     BitArray x;
>     //other stuff
>   }
>
>   const(BitArray) func(const BitArray s);
>
>   in the case it's small string optimization and the original string
> gets allocated OR gets passed outside the original function that made
> the BitArray. So small string if possible, convert to ref and slice it,
> or dup it. not 100% ref in all cases but pretty close.
>
>>>  Would the function get passed the original compact buffer? Copy of x
>>> or a dup?  May not be so much an issue. However...
>>>
>>>  BitArray x = BitArray(32);
>>>  const y = x[]; //slice
>>>
>> Then y has type BitArraySlice and references x.
>
>   Agreed, although I'm still iffy about a separate slice as a range.
> Could do the range just to give it limitations, but seems like a
> headache to do less and extra code to manage.
>
>   Current code I'm writing converts (if possible) the BitArray to a
> allocated type, then passes that slice to y.
>
>>>  Now if y points to the actual compact buffer and we do the following...
>>>
>>>  x.length = 256; //allocates if it was compact
>>>
>>>  Although y will have const access the data instead of a compact
>>> array it's a pointer to an array and broken.
>>
>> Then IMO that should be a slice invalidation (akin to c++ iterator
>> invalidation). Depending on how slice is implemented it could avoid
>> being invalidated in this case but in some others it surely will (e.g.
>> shrinking).
>
>   Maybe, with the small string optimization (no allocation) it isn't so
> much an easy option. That's what this is about. Say it uses the compact
> (fixed string) and in opSlice...
>
> BitArray {
>    bool isCompact;
>    bool canExpand; //represents owner for slices or compact.
>    union {
>      size_t[] normal;
>      size_t[2] compact;
>    }
>    BitArray opSlice() {
>      BitArray ba = this;
>      ba.canExpand = false;
>      if (isCompact) {
>        ba.isCompact = false;
>        ba.normal = cast(size_t[]) this.compact[0 .. $];
>      }
>
>      return ba;
>    }
> }
>
>   This is where the allocation could cause a problem as the slice could
> be referring to stack memory at which all the data could be duds. Worse
> if you overwrite that data suddenly the allocated memory could point to
> something else! I think it's safe to say that's a bad idea.
>

Hm, I'd think that having Slice type be:

BitArraySlice{
BitArray* bp;
size_t start, end;
// all else forwards to the pointed array
}
should work avoiding the most of code duplication. With any luck inliner 
will expand all of this wrapping. To enable bulk mode operations:

void opOpSliceAssign(BitArraySlice rhs)
{//has all the data to do it
    bulkOp(bp, start, end, rhs.bp, rhs.start, rhs.end);
}

then it's a question of bit of refactoring & tweaking of array bulks ops.

Then the only problem left is slice invalidation and original array 
going out of scope.

>>> If during the x call it
>>
>> You mean x[] call i.e. opSlice() call?
>
>   well both this(this) and opSlice() would be called at different times
> so they would have different meanings (copying vs slicing).
>
>> Seems like an overkill and semantically bad as I totally expect that
>> the following to be true for any Array-like container:
>> auto cont = ...;
>> auto range = cont[];
>> range[0] = 1;
>> assert(cont[0] == 1);
>>
>> Also:
>> foreach(v; x) --> foreach(v; x[]) if x defines opSlice
>>
>> however since BitArray defines opApply it's called instead so this one
>> is probably OK.
>
>   Does that mean we might be able to drop opApply? Hmmm... Depends on if
> we have enough implemented so the range accepts it. I'll glance over
> std.range.

Yes, it's got to be a range to work like this and then opApply is useless.
>
>>> reallocated then y would work fine no matter what in that case. If x
>>> can't reallocate it, then the issue remains but is much smaller than
>>> before, most likely a dup would be safest.
>>>
>>>  Thoughts? Questions? Ideas?
>>
>> The link to the latest code would be cool ;)
>
>
> https://github.com/rtcvb32/phobos/blob/BitArray-Updates/std/bitmanip.d
>
>   There's a merge issue due to having tried to keep the bitfields and
> bitarray separated; That seems to have made more problems than
> solutions. Working on it but may have to wait until I've read my Git
> book when it arrives before I understand this all.
>
>> I'm thinking that the main problem is trying to mimic built-in arrays.
>
>   Not really, it's trying to do small string optimization. I suppose if
> this was done as a class instead most of these problems could/would go
> away.
>
>> They are not ordinary containers in that they easily combine slice and
>> container that is (in fact) partially hidden in run-time & GC
>> implementation.
>>
>> The fact that combination of language features and run-time support
>> makes it easy shouldn't encourage other to go this path in the library.
>>
>> For one thing this way a small string optimization is not achievable
>> because of the many slices that have to reference the data packed in
>> _small_ chunk. Then whenever original array changes (and becomes not
>> small) any other slice to it have to somehow reflect this change. And
>> it's hard if slices internally are exactly the same as original (if
>> the same type).
>
>   Really depends on the use case. I'm guessing there's enough call for
> the 64bit small packed bitarray that justifies it, otherwise it would be
> better to throw it out. As you mention later, separating them does seem
> like a better idea.
>

I've meant the fact that it has to preserve current std lib behavior 
*and* has small string optimization that makes it overly painful and 
partially defeats the optimization (by requiring pointless compact->big 
conversions on slice etc.).

>> ...
>>
>> Now that I tested the current behavior that we shouldn't break a lot.
>> Clearly current BitArray is old and rusty D1 design.
>>
>>     BitArray a, b, c;
>>     //the interesting thing is that BitArray(32) doesn't construct
>> BitArray with length 32 but some explosive garbage !!
>>
>>     c.length = 512;
>>     a.length = 32;
>>     assert(a.length == 32);
>>     b = a;
>>     b[0] = true;
>> //this one is bad as it means both a & b has to reference the same data
>>     assert(a[0] == 1);
>>     b ~= c;
>>     b[1] = 1;
>>     //the next fails for me (it also may not fail depending on c's
>> length)
>>     assert(a[1] == 1);
>
>   My code tests that the last two asserts fail, course, a being under
> 64bits remains with small string optimization. Making it slice... seems
> I'll add this to my list of tests and get it fixed. Even if i reserved
> the space it still failed the last one. Mmm...

Right, since it's _a_ that's tested and in current design it shares data 
with b unless that got relocated (b ~= c can relocate b).
>
>> Currently I think that D container design have to be extended to
>> include 2 kinds of containers:
>>
>>  -small containers, values type a-la C++ STL these all use small
>> container optimization (as they can cleanly)
>>  -big heavy-duty ones, these have reference semantics and tuned for
>> large sets of values
>>
>> Then the proposed way out is to make 2 containers:
>> - current BitArray to always have reference semantics (both asserts in
>> the above always pass);
>> - and BitString or BitVector to be small value semantic array that is
>> always copied/duped on assign, these should be passed by ref or by
>> range/slice.
>>
>> That's was lengthy but what do you think? And it seems like we'd need
>> to bring this last topic to general D NG.
>
>   hmmm.. Either would need some code duplication, or template to turn
> off/swap small string optimization. But separating them does seem like a
> good idea.

I'd expect a fair chunk of code to be split off into free functions, 
then both containers would make use of them.

Now note that there is also std.container Array!bool that has reference 
semantics and got to be doing bit packing... it could potentially take 
place of big BitArray mentioned.
-- 
Dmitry Olshansky