std.hash: More questions
dmitry.olsh at gmail.com
Thu Jul 5 14:24:04 PDT 2012
On 04-Jul-12 18:58, Johannes Pfau wrote:
> I just had another look at my initial std.hash design, and I realized
> that the API could be simplified a little:
> There's a reset function that's implemented in every hash. For sha1,
> md5, crc32 it only forwards to the start function though. So I'm not
> sure how useful this function is or if it should be dropped.
> Advantages of keeping it:
> * 'reset' better documents what's done than 'start' if the hash has
> already processed data
> * Are there hashes which can implement a reset function in a faster way
> than calling start again?
> * Adds an additional function which probably isn't necessary
> The start function is probably not needed as well. Tango doesn't have a
> start function or something similar, but it could use constructors for
> this (I only looked at docs, not code).
The only thing I can think of that would require start function is
using unconventional initial vectors.
> We can't use constructors, so a
> start function would be necessary for advanced initialization. But do
> we actually need that advanced initialization? SHA1, MD5 and CRC32 just
> do a "this = typeof(this).init" so a start function isn't necessary
> Advantages of keeping it:
> * Are there hash algorithms which need some sort of complex
> initialization which can't be done with .init / default values?
> * If we drop both start and reset the only way to reset the internal
> state is calling finish. This might be a little less efficient than a
> start/reset method.
> Advantages of dropping it:
> * Using hashes is easier, no need to call 'start' before hashing data
> I think someone more familiar with hash functions than me needs to
> answer the "do we need start/reset functions" questions.
> API question:
> CRC32 sums are usually presented as a uint, not a ubyte. To fit the
> rest of the API ubyte is used. Now there's a small annoying detail:
> The CRC32 should be printed in LSB-first order.
You probably meant MSB first.
> When printing an uint like this, that works well:
> writefln("%#x", 4157704578); //0xf7d18982
> but this doesn't:
> toHexString(*cast(ubyte*)&4157704578); //8289D1F7
There is no problem it's just order of printing that at fault. So I
suggest to *stop* doing a bswap.
It's just that printing something as an array of ubytes does it from
least significant byte to most significant. You could try to add
MSB/LSB first options to toHexString.
> I can't change toHexString as it's used for all hashes and it's correct
> for SHA1, MD5, ...
> So I currently use bswap in the CRC32 finish() implementation to fix
> this issue.
no-no-no see the above ;)
> Now the question is should I provide an additional finishUint function
> which avoids the bswap?
> Implementation issue:
> The current implementation of SHA1 and MD5 uses memcpy which doesn't
> work in CTFE IIRC and which also prevents the code from being pure.
> I could replace those memcpy calls with array copying but I'm not
> sure if memcpy was used for performance, so I'd like to keep it as long
> as we have no performance tests.
Replace memcpy with and array ops:
ptr1[x..y] = ptr2[x2..y2];
note that it's better to have them be pointers as it avoid bounds check
& D runtime magic.
If need be I can provide benchmarks but I'm certain from the days of
optimizing std.regex that it's faster or on par with memcpy.
More information about the Digitalmars-d