std.bitarray examples

Me Here p9e883002 at sneakemail.com
Sat May 10 08:31:34 PDT 2008


Janice Caron wrote:

> You might want to consider moving up. :-)

Hm. And so we come full circle. I started out with D2, but the performance
impact of using invariant strings (for may application) is just to costly.

See http://genome.ucsc.edu/FAQ/FAQformat.html#format7

Essentially, once the compacted (2-bit format has been expanded back to the
(huge) strings of ACGTs, then the nBlocks (count/Starts/sizes) and maskBlocks
(count/Starts/Sizes) describe ranges of those huge strings that need to be
changed to 'N's (nBlocks) or be lower-cased (A->a, C->c etc.).

Breaking these huge strings up into pieces to do these tranformations, and then
sticking all the pieces back together, when they are *all* 1 to 1
substitutions, is just ludicrous. The impact of generating all those iddy biddy
invariant char[]s from one huge invriant char[] and then sticking them all back
together to form another huge invariant char[] and throwing away all the
intermediates causes the GC to go into fits.

When the (compressed) input is 900+MB and the expanded output is > 3GB, the
performance impact is considerable and inacceptable.

Even once I get around to multi-threading this, the use of invariant char[]s
will still be no advantage because I want (*need*) the processing of the
strings to operate /in-place/.

I (the programmer) need to control what gets copied when. And to control
concurrent access by sharing ranges of the data (without copying), for
modification /in-place/.

Attempting to isolate the programmer (me) from the concerns of multi-threading
by duplicating data over and over isn't an option given the volumes of data.

(Personnally I think is a waste of time and effort anyway. Better to educate
the programmer than to try and nanny his use of threads. The effort expended on
this would be far better spent on other things--like fixing the GC. But that's
not my call)

Cheers, b.
-- 




More information about the Digitalmars-d mailing list