How to continue work on std.zip

Mon Oct 21 13:35:37 UTC 2019

On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis 
wrote:

First of all, thanks for pointing me to the ISO standard. I was 
not aware of this standard (the german wikipedia does not mention 
it and I must have overlooked it, when I skimmed through the 
english one). I agree, that we should stick to that standard 
which answers several of my questions.

Second, my goal is to improve Phobos. I'm pretty much aware, that 
this means to avoid breaking code at (almost) all costs. In case 
of std.zip the only "breaking" change will be to get rid of some 
constants (digiSignLength, diskNumber) that should have been 
private from the very beginning or are only usefull with 
multidisk support. And this is done with a full deprecation cycle.

>> * Add structs that directly map the data in the zip records. 
>> As Atila pointed
>> out, this should be done with serialisation instead of unions 
>> (what I planed
>> and what he convinced me not to do, because this results in 
>> endianess problems).
>> See below (*) for some suggestions, on how this might be done.
>
> What would be the point of having something so low-level? The 
> ZipMembers are already put in an AA in the ZipArchive.

Sorry, I think this was a missunderstanding. I don't want to 
change anything of the AA nor the members in that AA. For me the 
question is, how to get the data from this void[]-array into the 
AA. The current approach uses lot's of lines like "_diskNumber = 
getUshort(i + 4);" with magic constants (here 4), which is error 
prone. I'd prefere to get rid of these in favour of some structs 
to avoid adding bugs when fixing the existing issues.

>> * Reversing the order of the four items in this file would 
>> bring the more
>> important ones to the beginning. This would make for better 
>> documentation.
>
> That would be terrible for diffs and really wouldn't help much 
> with the documentation. There are already links to the 
> top-level symbols at the top of the documentation, and the 
> example at the top shows how the various types are related. 
> [...]

I disagree here. First of all, documentation is made for humans 
and this should be more important than diffs. (But maybe there is 
some aspect of diffs, that I'm not aware of. For me, this would 
be a larger change once, and after that change, the diffs would 
again be pretty small.)

 From my experience, the links at the top are not usefull for 
people looking the first time into the docs of some module. At 
that time, one often does not know, which item is the one they 
are looking for.

For example. When I first looked at the documentation of std.zip, 
I was looking for a way to extract an existing archive. I 
expected that to be done by a function called unzip, but found no 
such function. Skimming the example at the top I thought, this 
was about creating zip archives (The comment in the first line 
tells that this is about reading zip files; I don't know exactly, 
why I overlooked this. Maybe, because it's small letters in light 
gray; maybe, because it's english, maybe I did not expect it 
there; I can't tell). Therefore I scrolled down. I found 
ZipException which is obviously of no use. After that I found 
CompressionMethod which is of little use either. Next there is 
ArchiveMember, which is more interesting, but yet does not show a 
way on how to get to the content of a zip file.

Finally, if you do not overlook it (which is quite possible 
between all those members of ArchiveMember and ZipArchive), 
you'll find ZipArchive. But again you don't know how to use it. 
The constructor I was looking for is actually the second to last 
item, very well hidden.

I went into detail here, because I think this is important. I had 
this experience over and over again with the documentation of 
Phobos. Comparing this to the documentation of PHP [1], where I 
never had any problems with, I conclude that there is space for 
improvement. ;-)

But it's getting a little bit off topic. Maybe Phobos 
documentation should be discussed in a separate threat.

[1] https://www.php.net/manual/de/book.zip.php

>> * Using unittests for examples (maybe replacing the two 
>> examples at the top of the module).
>
> The example at the top of the module is useful and really 
> shouldn't be removed.

No, I don't want to remove them. Just moving them inside of a 
unittest to make sure, that documentation and implementation does 
not diverge too much.

>> * Rewriting some of the documentation for better ddoc 
>> conformation and clearer descriptions.
>
> I have no clue what you mean by better ddoc conformation.

Probably also a missunderstanding. What I mean is: As far as I 
know, all parameters of all functions should be listed; the 
return value should be mentioned and a "see also" section is 
recommended and stuff like this. Most of the functions do not 
have that information. For example: The documentation of 
"extractVersion" is "Read Only" which is not really useful, isn't 
it?

>> * Support for ranges. For unzipping, a RandomAccessRange could 
>> be used, for
>> zipping an OutputRange could be provided. Also, the 
>> ArchiveMembers could be
>> accessed via an other RandomAccessRange. These ranges would 
>> (in some cases)
>> allow for lazy evaluation, which might result in a speed gain, 
>> when not all
>> elements of an archive need to be read (or kept in memory).
>
> Given that ZipArchive already sticks everything in an AA, I'm 
> not sure what you'd gain by trying to add a random access range 
> as well. And random-access would be incompatible with lazy 
> evaluation.

This is probably only about the last range I mentioned above, 
isn't it? That's the one I myself are most unsure if it's going 
to be usefull. One could use AA.byValue etc instead and that's 
probably sufficient.

I added support for ranges in the list, because it was explicitly 
mentioned in the vision 2016H2 [2]. I think, there are use cases, 
where laziness can be applied. Think of an archive with lots of 
files, where only some of them are going to be extracted. The 
underlying file-access-range can use fseek and the like to only 
load the part of the archive, that's gona be needed. zip's 
structure allowes to fill in the AA (almost) completely by 
scanning the directory at the end of the file. I write "almost", 
because the real data content is still missing. But that can be 
retrieved, when it's going to be used.

[2] https://wiki.dlang.org/Vision/2016H2

> In order to actually have lazy evaluation, you'd have to have a 
> completely different design, because ZipArchive creates the AA 
> up front - and changing that would cause problems for existing 
> code.

I think, it can be done with the current (public) design. I would 
add a new constructor that takes a range instead of void[]. All 
else would happen underneath.

>> * Files larger than 4GB? I see no reason, why we should block
>> them. All it
>> would probably need is to move from uint to ulong.
>
>>From what little I understand, that would involve being zip64 
>>rather than normal zip,

Indeed, but zip64 is allready implemented. There is just still 
the limit on the filesize.