The review of std.hash package

Tue Aug 7 15:55:47 PDT 2012

Am 07.08.2012 21:53, schrieb Johannes Pfau:
> Am Tue, 07 Aug 2012 20:07:07 +0200
> schrieb David <d at dav1d.de>:
>
>> Is the API already set in stone?
> No, as Jonathan already mentioned reviewing the API is an important
> part of the review. (Actually the most important part for this review,
> as we define one API for many implementations)
>
>> Using .start and .finish, feels like the use of OpenSSL. Why not
>> making a Hash object not "restartable" (since e.g. MD5.start() sets
>> `this` just to MD5.init)
> I'm not an expert in the hash/digest area. We had incompatible APIs for
> CRC32, MD5 and SHA1 and I proposed we should have a unique interface.
> As I had some free time and nobody else stepped up I implemented it,
> asking for expert help in the newsgroup. This probably also explains
> why this API isn't revolutionary, it's inspired by other designs
> (tango, .net, ...).
>
> BTW: making it possible to wrap OpenSSL and other C hash libraries was
> also a goal. This is one reason why a start() function is necessary.
> OK, that was the intro ;-)
>
> start is here as structs can't have default constructors. For all
> current hashes it's really mostly useless (except it can be used as a
> simple way to reset a hash, which is also why the hash structs don't
> have a reset member), but I just don't know if there are hashes which
> would require a more complex (runtime) initialization.

Ok this point with the one above makes sense (I implemented my OpenSSL 
hashing wrapper as a class, initilaization is done in the constructor), 
it still doesn't feel right, if you have to call .start first. What 
about a struct-constructor which calls .start internally, so you get a 
bit more of a modern API (imo) and you're still able to implement the 
same interface for any kind of wrapper/own implementation/whatever.

> Classes don't have a start method as they can do that initialization in
> the default constructor. But the default constructor can't be used as a
> cheap way to reset a hash, therefore the reset function was added to
> the OOP interface.

Ok. (more below)

>> and making finish private and implementing a
>> digest and hexdigest property
> property is not a good idea, as finish must reset the internal state
> for some objects, so you can't access the property repeatedly. It'd
> have to be a function.

Well, you could store the result internally, property generates on the 
first call the digest, stores it (that's not too much, 16 byte for a 
md5) and returns the already stored value on the 2nd call.

>> which calls finish and returns an
>> ubyte[] array (or string for hexdigest)
>
> This could be done and is probably a matter of taste. I prefer the free
> function in this case, it makes clear that digest() really is a generic
> function which can be used with any hash. It's also one function less
> which must be implemented by hash writers. I'd like to keep the number
> of members required to implement a hash as low as possible.

With digest you mean: 
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html#digest ?

You normally always want the hexdigest (you barely need "real" digest), 
so it's a matter of convenience. Well, thanks to UFCS, you can call it 
like a method.

> (This is however unrelated to the actual naming of the functions. If
> you think 'digest' is not a good name, please let me know)

I think that's a fitting name, but I am not a hashing expert.

>> , this would also eliminate
>> the need of the `digest` wrapper (e.g. you can mixin these properties
>> with template mixins, so no code duplication)
>
> there's actually no code duplication. 'digest' is the only
> implementation, sha1Of, md5Of are just aliases.

Yeah, I meant, you would have to implement these properties for each 
hash-type and in the end, all properties do the same (calling finish and 
maybe setting some internal flags, buffers), so you can put them into a 
template mixin and mixin them for each hash-type.

>> Also I am not sure if we want to use `hash.put` instead of
>> `hash.update` (which is used by python and openssl). But that's just
>> a minor point (e.g. perl uses .add)
>
> I initially called it update, I agree it's a better name. But phobos is
> all about ranges, so we need the put member anyway. Providing an
> 'update' alias just for the name isn't worth it, imho (see above,
> keeping the number of hash members low).

I have to agree, an alias would produce more confusion (what two methods 
which do the same?)

>> Furthermore, I don't like the `sha1Of` and `md5Of` etc. functions,
>> why not having a static method? e.g. MD5.hexdigest("input"), which
>> returns the hexdigest and not a MD5 object.
>
> Same reason as above, sha1Of, md5Of are just convenience aliases, it
> could be argued you don't need those at all. You can use 'digest'
> directly and it will provide exactly the same result. So hash writers
> don't have to implement anything to support digest, providing an alias
> is optional and just 1 line of code. A static method would have to be
> implemented by every hash.

Yes, you would have to implement it for every hash, but that's 3 lines:

static string hexdigest(void[] data) {
	return toHexString(digest!(typeof(this))("yay"));
}

> Yes you could use mixins to make that
> painless, but I think it's still to much work. (And mixins can't be
> documented in DDOC, but that's an unrelated compiler issue)

> BTW: it is implemented as a member for the OOP API, as we can just
> implement it in the base interface, so the actual implementations don't
> have to care about it.

I've seen that, but I am wondering why it's not a static method, the 
creation of the Hash Object could be hidden inside. You might say, this 
would allocate a class instance just for a single use, without the 
possibility to reuse it. Correct, but I think that's the use of such a 
function, if you just need a Hash, nothing else, otherwise you could use 
the class "directly", with all its other functionality.

>> One more thing, `.reset` does the same as `start`? If so, why do both
>> exist?
> See above. It's because start is used in the template/struct API and
> structs can't have default constructors, so we need start.
>
> The OOP/class api can use a default constructor. But unlike the start
> function it can't work as a reset function, so there's an explicit
> reset function.

I am not sure what do think about that. On the one side it's useful if 
you really need that speed, but then I think, is it really worth it 
resetting the state hidden in a function. If you really want to avoid 
another allocation you could use emplace (if I didn't misunderstand the 
use of emplace). To quote from the Python Zen:
"Explicit is better than implicit."

>> (I also find the examples quite confusing, why do you want to
>> reset the hash onject?
> My approach was to document how to use the functions, not as much when
> it makes sense to use them. Maybe I should care more about that?

When I saw the documentation/examples, I was confused, why do you want 
to reset a hash object, so does it really reset the state? So I had to 
take a look into the code before I was sure, it surely does reset the 
whole thingy (is it called context?).

> It's more efficient than allocating a new hash with 'new' (or do you
> mean why use reset if finish resets anyway? It's useful if you put some
> data into a hash, then decide you don't want the hash (because the
> user aborts the operation, network connection is lost,...) but you still
> need the hash object later on.) You could just call finish and
> disregard the result, but the reset implementation is faster. I agree
> there's probably no _strong_ need for reset, but I think it doesn't do
> any harm.

"Explicit is better than implicit."

>> Well, that's it from my side :)
>
> Thanks for your review!

Thanks for writing std.hash, I think lots of people need it.