How to continue work on std.zip

Sun Oct 20 14:03:06 UTC 2019

Recently I fixed some bugs in std.zip, mainly concerning malware 
attacks. I would like to continue there, but I think its a good 
idea to clearify the roadmap, before I continue with more PRs. 
Here's a list of items, I've got in my mind now:

Extracting zip files
--------------------

* Add structs that directly map the data in the zip records. As 
Atila pointed
out, this should be done with serialisation instead of unions 
(what I planed
and what he convinced me not to do, because this results in 
endianess problems).
See below (*) for some suggestions, on how this might be done.

* use enforce instead of if ... throw like Vladimir suggested. Do 
you think I
should use enforce!zipException("message") or is it better to get 
rid of
zipException?

* fix bugs 20289 - 20294 and bug 18950 (this is part of bug 
20290).

* Add more unittests, especially for zip64 and to reach (almost) 
100%
coverage.

Documentation
-------------

* Reversing the order of the four items in this file would bring 
the more
important ones to the beginning. This would make for better 
documentation.

* Using unittests for examples (maybe replacing the two examples 
at the top of
the module).

* Rewriting some of the documentation for better ddoc 
conformation and clearer
descriptions.

Extensions
----------

* Support for ranges. For unzipping, a RandomAccessRange could be 
used, for
zipping an OutputRange could be provided. Also, the 
ArchiveMembers could be
accessed via an other RandomAccessRange. These ranges would (in 
some cases)
allow for lazy evaluation, which might result in a speed gain, 
when not all
elements of an archive need to be read (or kept in memory).

* Multidisk? I think, it's *not* worth it. Just causes trouble 
and no one
needs this nowadays.

* Support for versions < 2.0 (called version 20 in the docs)? 
Again I think
that is *not* worth it.

* Support for other compression methods? Maybe bzip2 and LZMA 
would be nice.
All other methods look history to me.

* Support for encryption? I'm not sure about legal issues here. 
Looks like
PKWARE has some copyrights here. Therefore I'd skip encryption. 
In general I
personally think, that it's better to use a separate encryption 
program,
instead of using encryption mechanisms inside of a fixed format. 
Sooner or
later a fixed format will produce security issues.

* Files larger than 4GB? I see no reason, why we should block 
them. All it
would probably need is to move from uint to ulong.

Concerning the order of all these points, I'd roughly go from top 
to bottom.

What do you think? Is all of this worth it? Or should part of it 
be removed?
Do you have other stuff, that you would like to see here?

=============================================================================

(*) Currently the mapping is done like this:

{
     // Read end record data
     _diskNumber = getUshort(i + 4);
     _diskStartDir = getUshort(i + 6);

     _numEntries = getUshort(i + 8);
     _totalEntries = getUshort(i + 10);

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = getUint(i + 12);
     directoryOffset = getUint(i + 16);

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using functions like

@safe @nogc pure nothrow ushort getUshort(uint i)
{
     ubyte[2] result = data[i .. i + 2];
     return littleEndianToNative!ushort(result);
}

I dislike this approach, because it's using magic constants I'd 
like to get rid of.

As D has no build in serialisation mechanism and neither Phobos 
has, there is the need to add a specialised one.

One idea:

struct EndOfCentralDirRecord
{
     ushort numberOfThisDisk;
     ushort numberOfCDDisk;
     ushort entriesOnThisDisk;
     ushort entries;
     uint size;
     uint offset;
     ushort commentLength;
}

{
     auto eocdr = conv!Test(data[offset .. offset + 
EndOfCentralDirLength);

     // Read end record data
     _diskNumber = eocdr.numberOfThisDisk;
     _diskStartDir = eocdr.numberOfCDDisk;

     _numEntries = eocdr.entriesOnThisDisk;
     _totalEntries = eocdr.entries;

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = eocdr.size;
     directoryOffset = eocdr.offset;

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using

T conv(T)(ubyte[] a)
{
     auto tmp = T();

     foreach (member; __traits(allMembers, T))
     {
         alias type = typeof(mixin("T."~member));
         static if (is(type == ushort))
             mixin("tmp."~member~" = cast(ushort) (a[0] | 
(a[1]<<8));");
         else static if (is(type == uint))
             mixin("tmp."~member~" = cast(uint) (a[0] | (a[1]<<8) 
| (a[2]<<16) | (a[3]<<24));");
         else static if (is(type == ulong))
             mixin("tmp."~member~" = a[0] | (a[1]<<8) | (a[2]<<16) 
| (a[3]<<24)"~
                   "| (cast(ulong)(a[4] | (a[5]<<8) | (a[6]<<16) | 
(a[7]<<24))<<32);");
         else
             assert(false, "type "~member~" not supported");
         a = a[type.sizeof .. $];
     }

     return tmp;
}

One other idea motivated by some comment from Vladimir:

alias fromLE = littleEndianToNative;

align (1) // just to make sure there are no gaps
struct EndOfCentralDirRecord
{
     ubyte[2] numberOfThisDisk;
     ubyte[2] numberOfCDDisk;
     ubyte[2] entriesOnThisDisk;
     ubyte[2] entries;
     ubyte[4] size;
     ubyte[4] offset;
     ubyte[2] commentLength;
}

{
     auto eocdr = getRecord!EndOfCentralDirRecord(offset);

     // Read end record data
     _diskNumber = fromLE!ushort(eocdr.numberOfThisDisk);
     _diskStartDir = fromLE!ushort(eocdr.numberOfCDDisk);

     _numEntries = fromLE!ushort(eocdr.entriesOnThisDisk);
     _totalEntries = fromLE!ushort(eocdr.entries);

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = fromLE!uint(eocdr.size);
     directoryOffset = fromLE!uint(eocdr.offset);

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using

private T getRecord(T)(ulong offset)
{
     union U
     {
         ubyte[T.sizeof] data;
         T result;
     }

     U tmp;
     tmp.data = data[offset, offset + T.sizeof];
     return tmp.result;
}