[OT] shuffling lines in a stream
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Fri Oct 10 07:37:05 PDT 2008
KennyTM~ wrote:
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 07:55:38 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sergey Gromov wrote:
>>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>>> Andrei Alexandrescu wrote:
>>>>>> The high traffic enjoyed by "random k-sample of a file" confirms
>>>>>> that plenty of people around here have idle cycles ready to be
>>>>>> spent on fun tasks.
>>>>>>
>>>>>> In that vein, here's another problem. Say you want to randomly
>>>>>> shuffle lines in a stream. You don't know the size beforehand, and
>>>>>> one pass is preferable. The stream can be larger than the working
>>>>>> memory, but not an unbounded amount of times larger. Also lines in
>>>>>> the stream can be unbounded, but it is guaranteed that at least a
>>>>>> nonzero fraction of the file fits in working memory.
>>>>>>
>>>>>> You can use temporary files, but of course minimizing their total
>>>>>> size is recommended. Speed is of course important. Draft your
>>>>>> solution and explain why you think it's correct and advantageous.
>>>>> Well, my temporary file contains the whole stream. Because there
>>>>> is a probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load
>>>> the entire file in memory.
>>>
>>> This implies though that I cannot start output before I read all the
>>> lines from the stream. Therefore I must store those lines somewhere.
>>
>> Creating temporary files is fine as long as you never need to store
>> the entire input in memory.
>>
>> Andrei
>
> So HDD (or any file system) is not counted as memory? Then just store
> all temporary variables to HDD.
This takes us back to one lseek per line. You can create temporary
files. They behave as files usually do - they are better accessed in
sequential patterns and not too many times.
Andrei
More information about the Digitalmars-d
mailing list