How to continue work on std.zip

Tue Oct 22 23:54:52 UTC 2019

On Monday, October 21, 2019 7:35:37 AM MDT berni44 via Digitalmars-d wrote:
> On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis
> wrote:
> >> * Add structs that directly map the data in the zip records.
> >> As Atila pointed
> >> out, this should be done with serialisation instead of unions
> >> (what I planed
> >> and what he convinced me not to do, because this results in
> >> endianess problems).
> >> See below (*) for some suggestions, on how this might be done.
> >
> > What would be the point of having something so low-level? The
> > ZipMembers are already put in an AA in the ZipArchive.
>
> Sorry, I think this was a missunderstanding. I don't want to
> change anything of the AA nor the members in that AA. For me the
> question is, how to get the data from this void[]-array into the
> AA. The current approach uses lot's of lines like "_diskNumber =
> getUshort(i + 4);" with magic constants (here 4), which is error
> prone. I'd prefere to get rid of these in favour of some structs
> to avoid adding bugs when fixing the existing issues.

The constants don't look even vaguely magical. They're clearly simply based
on the sizes of the values. That being said, if I were to fix it, I'd
probably use append and read from std.bitmanip to replace all of the put*
and get* calls, since they would take care of the sizes for you without
having to deal with them directly at all. So, instead of

    putUshort(i + 4,  de._madeVersion);
    putUshort(i + 6,  de.extractVersion);
    putUshort(i + 8,  de.flags);
    putUshort(i + 10, de._compressionMethod);
    putUint  (i + 12, cast(uint) de.time);
    putUint  (i + 16, de.crc32);
    putUint  (i + 20, de.compressedSize);
    putUint  (i + 24, de.expandedSize);
    putUshort(i + 28, cast(ushort) de.name.length);
    putUshort(i + 30, cast(ushort) de.extra.length);
    putUshort(i + 32, cast(ushort) de.comment.length);
    putUshort(i + 34, de.diskNumber);
    putUshort(i + 36, de.internalAttributes);
    putUint  (i + 38, de._externalAttributes);
    putUint  (i + 42, de.offset);

you'd get something like

    buffer.append!ushort(de._madeVersion);
    buffer.append!ushort(de.extractVersion);
    buffer.append!ushort(de.flags);
    buffer.append!ushort(de._compressionMethod);
    buffer.append!uint  (cast(uint) de.time);
    buffer.append!uint  (de.crc32);
    buffer.append!uint  (de.compressedSize);
    buffer.append!uint  (de.expandedSize);
    buffer.append!ushort(cast(ushort) de.name.length);
    buffer.append!ushort(cast(ushort) de.extra.length);
    buffer.append!ushort(cast(ushort) de.comment.length);
    buffer.append!ushort(de.diskNumber);
    buffer.append!ushort(de.internalAttributes);
    buffer.append!uint  (de._externalAttributes);
    buffer.append!uint  (de.offset);

Then all of the constants disappear. Also, then std.zip isn't trying to
replace functionality that's already in std.bitmanip (like it's doing right
now). The sort of thing that std.zip is doing here is pretty much exactly
what read, peek, write, and append in std.bitmanip were written to handle.

> >> * Reversing the order of the four items in this file would
> >> bring the more
> >> important ones to the beginning. This would make for better
> >> documentation.
> >
> > That would be terrible for diffs and really wouldn't help much
> > with the documentation. There are already links to the
> > top-level symbols at the top of the documentation, and the
> > example at the top shows how the various types are related.
> > [...]
>
> I disagree here. First of all, documentation is made for humans
> and this should be more important than diffs. (But maybe there is
> some aspect of diffs, that I'm not aware of. For me, this would
> be a larger change once, and after that change, the diffs would
> again be pretty small.)
>
>  From my experience, the links at the top are not usefull for
> people looking the first time into the docs of some module. At
> that time, one often does not know, which item is the one they
> are looking for.
>
> For example. When I first looked at the documentation of std.zip,
> I was looking for a way to extract an existing archive. I
> expected that to be done by a function called unzip, but found no
> such function. Skimming the example at the top I thought, this
> was about creating zip archives (The comment in the first line
> tells that this is about reading zip files; I don't know exactly,
> why I overlooked this. Maybe, because it's small letters in light
> gray; maybe, because it's english, maybe I did not expect it
> there; I can't tell). Therefore I scrolled down. I found
> ZipException which is obviously of no use. After that I found
> CompressionMethod which is of little use either. Next there is
> ArchiveMember, which is more interesting, but yet does not show a
> way on how to get to the content of a zip file.
>
> Finally, if you do not overlook it (which is quite possible
> between all those members of ArchiveMember and ZipArchive),
> you'll find ZipArchive. But again you don't know how to use it.
> The constructor I was looking for is actually the second to last
> item, very well hidden.
>
> I went into detail here, because I think this is important. I had
> this experience over and over again with the documentation of
> Phobos. Comparing this to the documentation of PHP [1], where I
> never had any problems with, I conclude that there is space for
> improvement. ;-)

Reordering stuff makes it harder to track changes over time, and with a file
as small as std.zip, I have a very hard time believing that changing the
order is going to help much - especially since you need to be aware of all
of the types in std.zip when using it anyway. The main way to make it easier
for people to figure out where they need to know to start is by improving
the module-level documentation. For instance, std.zip could certainly use a
table similar to what modules like std.array and std.range have:

https://dlang.org/phobos/std_array.html
https://dlang.org/phobos/std_range.html

Having an overview of that explains what the pieces are and provides links
to them is going to work far better than simply reorganizing the file will
do. If the module-level documentation doesn't give enough information for
people to know what they should be looking for in the module for the
functionality they want, then the module-level documentation needs to be
improved, regardless of what order the symbols are in in the file. And if
the module-level documentation does its job properly, the order of the
symbols in the documentation should be irrelevant.

> >> * Support for ranges. For unzipping, a RandomAccessRange could
> >> be used, for
> >> zipping an OutputRange could be provided. Also, the
> >> ArchiveMembers could be
> >> accessed via an other RandomAccessRange. These ranges would
> >> (in some cases)
> >> allow for lazy evaluation, which might result in a speed gain,
> >> when not all
> >> elements of an archive need to be read (or kept in memory).
> >
> > Given that ZipArchive already sticks everything in an AA, I'm
> > not sure what you'd gain by trying to add a random access range
> > as well. And random-access would be incompatible with lazy
> > evaluation.
>
> This is probably only about the last range I mentioned above,
> isn't it? That's the one I myself are most unsure if it's going
> to be usefull. One could use AA.byValue etc instead and that's
> probably sufficient.
>
> I added support for ranges in the list, because it was explicitly
> mentioned in the vision 2016H2 [2]. I think, there are use cases,
> where laziness can be applied. Think of an archive with lots of
> files, where only some of them are going to be extracted. The
> underlying file-access-range can use fseek and the like to only
> load the part of the archive, that's gona be needed. zip's
> structure allowes to fill in the AA (almost) completely by
> scanning the directory at the end of the file. I write "almost",
> because the real data content is still missing. But that can be
> retrieved, when it's going to be used.
>
> [2] https://wiki.dlang.org/Vision/2016H2

You'd have to be careful with this, because right now, if you call
ZipArchive's directory, you get the same AA that it has internally, and
accessing any values in it does not cause any calculations to be done.
Having them be done lazily might be nice, but you also wouldn't want to end
up having stuff calculated every time you access one of the ArchiveMembers
in the AA. And making it so that ArchiveMember is somehow lazy but doesn't
redo any work means that it won't work when it's const. Fortunately,
functions like expandedData and compressedData are not const, so maybe some
of what's done there could be made lazy without breaking code, but it would
have to be done carefully.

The fact that std.zip is designed around filling in an AA, and it provides
direct access to that AA pretty much destroys most of the possible
range-based solutions - especially the lazy ones. If you can make some
aspect of how the internals work lazier without causing problems, then
that's fine, but I don't see much point in trying to provide a different
range API on top of what's there when it's basically just an AA.

I could easily see having a lazy, range-based solution which did something
like iterate through the files in the zip file in whatever order they're in
in the zip file, but that would require a very different design from what we
currently have with std.zip. Also, whether it would be better or worse would
largely depend on what the user code was trying to do.

Certainly, if you can come up with something clever that improves things
while not breaking existing code, then that's great, but the fact that the
code provides direct access to the AA (and as the primary way to access the
data no less) makes it pretty hard to change. Maybe it would make sense to
provide some sort of range-based solution beside what we currently have in
std.zip, but that's the sort of thing that should probably be put on
code.dlang first and battle-tested a bit before being put into Phobos, since
it's essentially designing a new set of functionality for accessing and
processing zip files rather than simply improving what's already there.

> > In order to actually have lazy evaluation, you'd have to have a
> > completely different design, because ZipArchive creates the AA
> > up front - and changing that would cause problems for existing
> > code.
>
> I think, it can be done with the current (public) design. I would
> add a new constructor that takes a range instead of void[]. All
> else would happen underneath.
>
> >> * Files larger than 4GB? I see no reason, why we should block
> >> them. All it
> >> would probably need is to move from uint to ulong.
> >>
> >>From what little I understand, that would involve being zip64
> >>rather than normal zip,
>
> Indeed, but zip64 is allready implemented. There is just still
> the limit on the filesize.

Just make sure that whatever you're doing doesn't result in a file being
treated as a zip file rather than a zip64 file when it can't legally be a
zip file.

- Jonathan M Davis