std.hash: More questions

Thu Jul 5 14:24:04 PDT 2012

On 04-Jul-12 18:58, Johannes Pfau wrote:
> Code:
> https://github.com/D-Programming-Language/phobos/pull/646
> Docs:
> http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html
> http://dl.dropbox.com/u/24218791/d/phobos/std_hash_crc.html
> http://dl.dropbox.com/u/24218791/d/phobos/std_hash_md.html
> http://dl.dropbox.com/u/24218791/d/phobos/std_hash_sha.html
>
> I just had another look at my initial std.hash design, and I realized
> that the API could be simplified a little:
>
> There's a reset function that's implemented in every hash. For sha1,
> md5, crc32 it only forwards to the start function though. So I'm not
> sure how useful this function is or if it should be dropped.
>
> Advantages of keeping it:
> * 'reset' better documents what's done than 'start' if the hash has
>    already processed data
> * Are there hashes which can implement a reset function in a faster way
>    than calling start again?
>
> Cons:
> * Adds an additional function which probably isn't necessary
>
>
> The start function is probably not needed as well. Tango doesn't have a
> start function or something similar, but it could use constructors for
> this (I only looked at docs, not code).

The only thing  I can think of that would require start function is 
using unconventional initial vectors.

> We can't use constructors, so a
> start function would be necessary for advanced initialization. But do
> we actually need that advanced initialization? SHA1, MD5 and CRC32 just
> do a "this = typeof(this).init" so a start function isn't necessary
> here.
>
> Advantages of keeping it:
> * Are there hash algorithms which need some sort of complex
>    initialization which can't be done with .init / default values?
> * If we drop both start and reset the only way to reset the internal
>    state is calling finish. This might be a little less efficient than a
>    start/reset method.
>
> Advantages of dropping it:
> * Using hashes is easier, no need to call 'start' before hashing data
>
> I think someone more familiar with hash functions than me needs to
> answer the "do we need start/reset functions" questions.
>
>
> API question:
>
> CRC32 sums are usually presented as a uint, not a ubyte[4]. To fit the
> rest of the API ubyte[4] is used. Now there's a small annoying detail:
> The CRC32 should be printed in LSB-first order.
You probably meant MSB first.

> When printing an uint like this, that works well:
> writefln("%#x", 4157704578); //0xf7d18982
> but this doesn't:
> toHexString(*cast(ubyte[4]*)&4157704578); //8289D1F7

There is no problem it's just order of printing that at fault. So I 
suggest to *stop* doing a bswap.

It's just that printing something as an array of ubytes does it from 
least significant byte to most significant. You could try to add 
MSB/LSB first options to toHexString.

>
> I can't change toHexString as it's used for all hashes and it's correct
> for SHA1, MD5, ...
> So I currently use bswap in the CRC32 finish() implementation to fix
> this issue.
>
no-no-no see the above ;)

> Now the question is should I provide an additional finishUint function
> which avoids the bswap?
>
>
> Implementation issue:
>
> The current implementation of SHA1 and MD5 uses memcpy which doesn't
> work in CTFE IIRC and which also prevents the code from being pure.
> I could replace those memcpy calls with array copying but I'm not
> sure if memcpy was used for performance, so I'd like to keep it as long
> as we have no performance tests.
>
Replace memcpy with and array ops:
ptr1[x..y] = ptr2[x2..y2];
note that it's better to have them be pointers as it avoid bounds check 
& D runtime magic.

If need be I can provide benchmarks but I'm certain from the days of 
optimizing std.regex that it's faster or on par with memcpy.

-- 
Dmitry Olshansky