dmdz
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Thu Mar 18 13:48:15 PDT 2010
On 03/18/2010 03:11 PM, Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> Reading the file header (e.g. first 512 bytes) and then matching
>> against archive signatures is, I think, a very nice touch. (I was only
>> thinking of matching by file name.) There is a mild complication - you
>> can't close and reopen the archive, so you need to pass those 512
>> bytes to the archiver along with the rest of the stream. This is
>> because the stream may not be rewindable, as is the case with pipes.
>
> The reasons for reading the file to determine the archive type are:
>
> 1. Files sometimes lose their extensions when being transferred around.
> I sometimes have this problem when downloading files from the internet -
> Windows will store it without an extension.
>
> 2. Sometimes I have to remove the extension when sending a file via
> email, as stupid email readers block certain email messages based on
> file attachment extensions.
>
> 3. People don't always put the right extension onto the file.
>
> 4. Passing an archive of one type to a reader for another type causes
> the reader to crash (yes, I know, readers should be more robust that
> way, but reality is reality).
Makes sense.
> Is it really necessary to support streaming archives?
It is not necessary, only vital.
> The reason I ask
> is we can nicely separate building/reading archives from file I/O. The
> archives can be entirely done in memory. Perhaps if an archive is being
> streamed, the program can simply accumulate it all in memory, then call
> the archive library functions.
This is completely nonscalable! 90% of all my archive manipulation
involves streaming, and I wouldn't dream of thinking of loading most of
those files in RAM. They are huge!
I paste from a script I'm working on right now:
if [[ ! -f $D/sentences.num.gz ]]; then
echo '# Text to numeric...'
./txt2num.d $D/voc.txt \
< <(pv $D/sentences.txt.gz | gunzip) \
> >(gzip >$D/sentences.num.tmp.gz)
mv $D/sentences.num.tmp.gz $D/sentences.num.gz
fi
That takes a good amount of time to run because the .gz involved is
2,180,367,456 bytes _after_ compression. Note how zipping is done both
ways - on reading and writing.
It would be great if we all went to the utmost possible lengths to
distance ourselves from such nonscalable thinking. It's the root reason
for which the wc sample program on digitalmars.com is _inappropriate_
and _damaging_ to the reputation of the language, and also the reason
for which hash tables' implementation performs so poorly on large data -
i.e., exactly when it matters. It's the kind of thinking stemming from
"But I don't have _one_ file larger than 1GB anywhere on my hard drive!"
which you repeatedly claimed as if it were a solid argument. Well if you
don't have one you better get some.
Nobody's going to give us a cookie if we process 50KB files 10 times
faster than Perl or Python. Where it does matter is large data, and I'd
be in a much better mood if I didn't feel my beard growing while I'm
waiting next to a program that uses hashes to build a large index file.
Andrei
More information about the Digitalmars-d
mailing list