"Rolling Hash computation" or "Content Defined Chunking"

Sat May 6 00:21:51 PDT 2017

Am Mon, 01 May 2017 21:01:43 +0000
schrieb notna <notna.remove.this at ist-einmalig.de>:

> Hi Dlander's.
> 
> Found some interesting reads ([1] [2] [3]) about the $SUBJECT and 
> wonder if there is anything available in the Dland?!
> 
> If yes, pls. share.
> If not, how could it be done (D'ish)
> 
> [1] - 
> https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
>      - 
> https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c
> 
> [2] - 
> https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
> 
> [3] - http://www.infoarena.ro/blog/rolling-hash
> 
> Thanks & regards

Interesting concept. I'm not aware of any D implementation but it
shouldn't be difficult to implement this in D:
https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial

There's a BSD licensed haskell implementation, so a BSD licensed port
would be very easy to implement:
https://hackage.haskell.org/package/optimal-blocks-0.1.0
https://hackage.haskell.org/package/optimal-blocks-0.1.0/docs/src/Algorithm-OptimalBlocks-BuzzHash.html

To make an implementation D'ish it could integrate with either
std.digest or process input ranges. If you want to use it exclusively
for chunking your code can be more efficient (process InputRange until
a boundary condition is met). When using input ranges, prefer some kind
of buffered approach, Range!ubyte[] instead of Range!ubyte for better
performance.

If you really want the rolling hash value for each byte in a sequence
this will be less efficient as you'll have to enter data byte-by-byte.
In this case it's extremely important for performance that your
function can be inlined, so use templates:

ubyte[] data;
foreach(b; data)
{
    // This needs to be inlined for performance reasons
    rollinghash.put(b);
}

-- Johannes