How to continue work on std.zip

Mon Oct 21 08:44:30 UTC 2019

On Sunday, October 20, 2019 8:03:06 AM MDT berni44 via Digitalmars-d wrote:
> Recently I fixed some bugs in std.zip, mainly concerning malware
> attacks. I would like to continue there, but I think its a good
> idea to clearify the roadmap, before I continue with more PRs.
> Here's a list of items, I've got in my mind now:
>
> Extracting zip files
> --------------------
>
> * Add structs that directly map the data in the zip records. As
> Atila pointed
> out, this should be done with serialisation instead of unions
> (what I planed
> and what he convinced me not to do, because this results in
> endianess problems).
> See below (*) for some suggestions, on how this might be done.

What would be the point of having something so low-level? The ZipMembers are
already put in an AA in the ZipArchive. So, the basic design of std.zip is
pretty high-level, whereas something like accessing the low-level zip
records would fit better in a solution that was lower level and potentially
lazy.

> * use enforce instead of if ... throw like Vladimir suggested. Do
> you think I
> should use enforce!zipException("message") or is it better to get
> rid of
> zipException?

Getting rid of ZipException would break code that currently uses std.zip and
wouldn't necessarily be a desirable change anyway. enforce!ZipException is
perfectly fine.

> * fix bugs 20289 - 20294 and bug 18950 (this is part of bug
> 20290).
>
> * Add more unittests, especially for zip64 and to reach (almost)
> 100%
> coverage.
>
>
> Documentation
> -------------
>
> * Reversing the order of the four items in this file would bring
> the more
> important ones to the beginning. This would make for better
> documentation.

That would be terrible for diffs and really wouldn't help much with the
documentation. There are already links to the top-level symbols at the top
of the documentation, and the example at the top shows how the various
types are related. It might make sense to add a table at the top with a
short description of each top-level symbol, but rearranging the code just to
change the documentation is needless churn - especially in such a small
file.

> * Using unittests for examples (maybe replacing the two examples
> at the top of
> the module).

The example at the top of the module is useful and really shouldn't be
removed. However, adding more documented unittest blocks as examples for
specific symbols would be an improvement.

> * Rewriting some of the documentation for better ddoc
> conformation and clearer
> descriptions.

I have no clue what you mean by better ddoc conformation. As far as ddoc
goes, what's there looks perfectly fine to me. If you can come up with
better descriptions for stuff, that's fine, but I don't see anything
obviously wrong with what's there.

> Extensions
> ----------
>
> * Support for ranges. For unzipping, a RandomAccessRange could be
> used, for
> zipping an OutputRange could be provided. Also, the
> ArchiveMembers could be
> accessed via an other RandomAccessRange. These ranges would (in
> some cases)
> allow for lazy evaluation, which might result in a speed gain,
> when not all
> elements of an archive need to be read (or kept in memory).

Given that ZipArchive already sticks everything in an AA, I'm not sure what
you'd gain by trying to add a random access range as well. And random-access
would be incompatible with lazy evaluation. In order to have random access,
you'd basically have to stick everything in a dynamic array instead of an
AA, which doesn't fit with the current design and wouldn't be lazy either.

In order to actually have lazy evaluation, you'd have to have a completely
different design, because ZipArchive creates the AA up front - and changing
that would cause problems for existing code. There's certainly something to
be said for a design that lazily iterates through a zip archive, but from
what I can tell, that would require a completely different type from
ZipArchive, and I'd argue that if that's the sort of thing that you're
looking to do, it would make a lot more sense to just create a fresh
solution and put it on code.dlang.org instead of changing std.zip, since it
really doesn't fit into the current design, and moving away from the AA
design would break existing code.

> * Multidisk? I think, it's *not* worth it. Just causes trouble
> and no one
> needs this nowadays.

Maybe it a more complex solution, but std.zip is very basic. Also, looking
at the wikipedia page for zip, it looks like the current design of std.zip
is probably pretty similar to what ISO/IEC 21320-1:2015 requires, and
spanning multiple volumes is not supported by that standard. I don't know
that we need to stick to that standard, but it might make sense to just
target what the ISO standard does for std.zip and let anything more complex
be in a more complex solution code.dlang.org if someone feels the need to
write such a solution.

> * Support for versions < 2.0 (called version 20 in the docs)?
> Again I think
> that is *not* worth it.

Wikipedia doesn't even inclued versions older than that in the versions that
it lists, and 2.0 was released in 1993. Really, the question should be
whether we include support for stuff newer than that, not whether we support
anything older.

> * Support for other compression methods? Maybe bzip2 and LZMA
> would be nice.
> All other methods look history to me.

That could be useful, but we'd like to avoid having Phobos depend on any
more external libraries (having curl has caused us enough problems already),
so we'd need implementations for them inside of Phobos. It also wouldn't be
ISO compatible, so if we decided that that's what we were targeting, then it
would not be appropriate.

> * Support for encryption? I'm not sure about legal issues here.
> Looks like
> PKWARE has some copyrights here. Therefore I'd skip encryption.
> In general I
> personally think, that it's better to use a separate encryption
> program,
> instead of using encryption mechanisms inside of a fixed format.
> Sooner or
> later a fixed format will produce security issues.

I'd suggest skipping it at this point. If nothing else, doing it correctly
will likely require external libraries. It would be better for such a
solution to exist on code.dlang.org rather than be part of Phobos. It's also
explicitly banned by the ISO standard.

> * Files larger than 4GB? I see no reason, why we should block
> them. All it
> would probably need is to move from uint to ulong.

>From what little I understand, that would involve being zip64 rather than
normal zip, which would involve more than simply going from uint to ulong,
and non-zip64 files do not support larger than 4 GiB. So, while adding
support for larger files wouldn't necessarily be a problem, it needs to be
done in a way that's standards compatible and without requiring that zip
files be zip64 if they don't need to be.

> Concerning the order of all these points, I'd roughly go from top
> to bottom.
>
> What do you think? Is all of this worth it? Or should part of it
> be removed?
> Do you have other stuff, that you would like to see here?

Improvements to std.zip which expand its capabilitiues without breaking
existing code are fine, but at some point, if you want a full, flexible
implementation, it's probably better to come up with something new and put
it on code.dlang.org. Also including stuff like bzip2 or encryption would
likely be better in a 3rd party library that can use existing C libraries.
Without digging into the whole issue in detail, I'd be tempted to argue that
we should just target the ISO standard with std.zip and leave anything
fancier to code.dlang.org.

- Jonathan M Davis