std.compress

Wed Jun 5 10:36:06 PDT 2013

On 6/5/13 1:10 PM, Jonathan M Davis wrote:
> On Wednesday, June 05, 2013 07:49:22 H. S. Teoh wrote:
>> (Actually, std.algorithm currently sits at 11636 lines. I call BS on
>> whoever claims to be able to "skim over" std.algorithm and "get a feel
>> for how it works". Chances are your finger will get so tired of hitting
>> PgDn about 2000 lines into the file that you won't even look at the
>> rest. And most of the code is only superficially related to each other
>> -- about 20 functions into the file you'd have lost track of all sense
>> of how things fit together -- 'cos they *don't* really fit together!
>> It's the epitome of why we *should* move to smaller modules, rather than
>> the current giant monolithic ones.)
>
> There's a significant difference between a 10,000+ line file and a file that's
> less than 1000 lines. Walter's proposed's std.compress is less than 1000 lines
> long, and yet we're talking about splitting it up. That's going to lead to
> ridiculously small modules IMHO. There's a balance to be had here that's
> somewhere in between the two extremes.

Walter's algo is among the simplest around. I expect any of the 
industry-standard algos to be significantly larger, most likely each 
worth a module.

I think it's worth discussing the interface much more than the approach 
to modularization.

Walter's algo traffics in InputRange!ubyte and offers an 
InputRange!ubyte. That makes sense in some situations, but often 
trafficking one ubyte at a time may be not only inefficient but also the 
wrong granularity. Consider:

auto data = File("input.txt").byChunk().compress();

That won't work because byChunk deals in ubyte[], not ubyte. How do we 
fix this while keeping everybody efficient?

I talked to Walter and during this work he figured a lot of things about 
how ranges work and how they generate code. Turns out that the range 
equivalent of a tight loop is slightly less efficient with dmd because a 
range must keep its state together, which is harder to enregister than a 
bunch of automatic variables.

Right now we have joiner(), which given several ranges of T, offers a 
range of T. Developing along that idea, we should have two opposite 
functions: itemize() and collect().

itemize() takes a range of ranges of T and offers a range of T. For 
example, given a range of T[], offers a range of T.

collect() takes a range of T and offers a range of T[]. The number of 
items in each chunk can be a parameter.

With that in tow, we can set things up such that compress() and expand() 
traffic in ranges of ranges of ubyte (or simply ranges of ubyte[]), 
which ensures work at maximum speed. Then the matter of adapting to and 
fro ranges of ubyte is a simple matter of chaining a call to itemize() 
or collect().

Destroy.

I salute this code. It is concise, well engineered, well written, just 
as general as it needs, uses the right features in the right places, and 
does real work. A perfect example to follow. The only thing I'd add is 
this convenience function:

void putBits(bool[] bits...) {
     foreach (b; bits) putBit(b);
}

That reduces a bunch of code, and leaves room for future optimizations 
inside putBits.

Andrei