"Rolling Hash computation" or "Content Defined Chunking"

notna via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Tue May 9 11:17:45 PDT 2017


On Saturday, 6 May 2017 at 07:21:51 UTC, Johannes Pfau wrote:
> Am Mon, 01 May 2017 21:01:43 +0000
> schrieb notna <notna.remove.this at ist-einmalig.de>:
>
>> Hi Dlander's.
>> 
>> Found some interesting reads ([1] [2] [3]) about the $SUBJECT 
>> and wonder if there is anything available in the Dland?!
>> 
>> If yes, pls. share.
>> If not, how could it be done (D'ish)
>> 
>> [1] -
>> https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
>>      -
>> https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c
>> 
>> [2] - 
>> https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
>> 
>> [3] - http://www.infoarena.ro/blog/rolling-hash
>> 
>> Thanks & regards
>
> Interesting concept. I'm not aware of any D implementation but 
> it shouldn't be difficult to implement this in D: 
> https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial
>
> There's a BSD licensed haskell implementation, so a BSD 
> licensed port
> would be very easy to implement:
> https://hackage.haskell.org/package/optimal-blocks-0.1.0
> https://hackage.haskell.org/package/optimal-blocks-0.1.0/docs/src/Algorithm-OptimalBlocks-BuzzHash.html
>
> To make an implementation D'ish it could integrate with either 
> std.digest or process input ranges. If you want to use it 
> exclusively for chunking your code can be more efficient 
> (process InputRange until a boundary condition is met). When 
> using input ranges, prefer some kind of buffered approach, 
> Range!ubyte[] instead of Range!ubyte for better performance.
>
> If you really want the rolling hash value for each byte in a 
> sequence this will be less efficient as you'll have to enter 
> data byte-by-byte. In this case it's extremely important for 
> performance that your function can be inlined, so use templates:
>
> ubyte[] data;
> foreach(b; data)
> {
>     // This needs to be inlined for performance reasons
>     rollinghash.put(b);
> }
>
> -- Johannes

Thanks for the feedback, Johannes.

I hoped there may already be something in Mir or Weka.io or 
somewhere else... Will read the Golang, C and C++ source and see 
if my Dlang is good enough for ranges and the like magic...



More information about the Digitalmars-d-learn mailing list