The review of std.hash package

Wed Aug 8 08:50:33 PDT 2012

Am Wed, 08 Aug 2012 02:49:00 -0700
schrieb Walter Bright <newshound2 at digitalmars.com>:

> 
> It should accept an input range. But using an Output Range confuses
> me. A hash function is a reduce algorithm - it accepts a sequence of
> input values, and produces a single value. You should be able to
> write code like:
> 
>    ubyte[] data;
>    ...
>    auto crc = data.crc32();

auto crc = crc32Of(data);
auto crc = data.crc32Of(); //ufcs

This doesn't wok with every InputRange and this needs to be fixed.
That's a quite simple fix (max 10 lines of code, one new overload) and
not a inherent problem of the API (see below for more).

> 
> For example, the hash example given is:
> 
>    foreach (buffer; file.byChunk(4096 * 1024))
>        hash.put(buffer);
>    auto result = hash.finish();
> 
> Instead it should be something like:
> 
>    auto result = file.byChunk(4096 * 1025).joiner.hash();

But it also says this:
//As digests implement OutputRange, we could use std.algorithm.copy
//Let's do it manually for now

You can basically do this with a range interface in 1 line:
----
import std.algorithm : copy;

auto result = copy(file.byChunk(4096 * 1024), hash).finish();
----
or with ufcs:
----
auto result = file.byChunk(4096 * 1024).copy(hash).finish();
----

OK, you have to initialize hash and you have to call finish. With a new
overload for digest it's as simple as this:
----
auto result = file.byChunk(4096 * 1024).digest!CRC32();
auto result = file.byChunk(4096 * 1024).crc32Of(); //with alias
----

The digests are OutputRanges, you can write data to them. There's
absolutely no need to make them InputRanges as they only produce 1
value, and the hash sum is produced at once, so there's no way to
receive the result in a partial way. A digest is very similar to
Appender and it's .data property in this regard.

The put function could accept an InputRange, but I think there was a
thread recently which said this is evil for OutputRanges as the same
feature can be achieved with copy.

There's also no big benefit in doing it that way. If your InputRange is
really unbuffered you could avoid double buffering. But then you
transfer data byte by byte which will be horribly slow.
If your InputRange has an internal buffer copy should just copy from
that internal buffer to the 64 byte buffer used inside the digest
implementation.
This double buffering could only be avoided if the put function
accepted an InputRange and could supply a buffer for that InputRange so
the InputRange could write directly into the 64 byte buffer. But
there's nothing like that in phobos, so this is all speculation.

(Also the internal buffer is only used for the first 64 bytes (or less)
of the supplied data. The rest is processed without copying. It could
probably be optimized so that there's absolutely no copying as long as
the input buffer length is a multiple of 64)

> 
> The magic is that any input range that produces bytes could be used,
> and that byte producing input range can be hooked up to the input of
> any reducing function.
See above. Every InputRange with byte element type does work. You just
have to use copy.

> 
> The use of a member finish() is not what any other reduce algorithm
> has, and so the interface is not a general component interface.

It's a struct with state, not a simple reduce function so it needs that
finish member. It works like that way in every other language (and this
is not cause those languages don't have ranges; streams and iterators
(as in C#) work exactly the same in this case).

Let's take a real world example: You want to download a huge file with
std.net.curl and hash it on the fly. Completely reading into a buffer
is not possible (large file!). Now std.net.curl has a callback
interface (which is forced on us by libcurl). How would you map that
into an InputRange? (The byLine range in std.net.curl is eager,
byLineAsync needs an additional thread). A newbie trying to do that
will despair as it would work just fine in every other language, but
D forces that InputRange interface.

Implementing it as an OutputRange is much better. The described
scenario works fine and hashing an InputRange also works fine - just
use copy. OutputRange is much more universal for this usecase.

However, I do agree digest!Hash, md5Of, sha1Of should have an additional
overload which takes a InputRange. It would be implemented with copy
and be a nice convenience function.

> 
> I know the documentation on ranges in Phobos is incomplete and
> confusing.

Especially for copy, as the documentation doesn't indicate the line I
posted could work in any way ;-)