[OT] shuffling lines in a stream
BLS
nanali at nospam-wanadoo.fr
Fri Oct 10 13:03:59 PDT 2008
Jason House schrieb:
> Andrei Alexandrescu Wrote:
>
>> Sergey Gromov wrote:
>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>> Andrei Alexandrescu wrote:
>>>> The high traffic enjoyed by "random k-sample of a file" confirms that
>>>> plenty of people around here have idle cycles ready to be spent on fun
>>>> tasks.
>>>>
>>>> In that vein, here's another problem. Say you want to randomly shuffle
>>>> lines in a stream. You don't know the size beforehand, and one pass is
>>>> preferable. The stream can be larger than the working memory, but not an
>>>> unbounded amount of times larger. Also lines in the stream can be
>>>> unbounded, but it is guaranteed that at least a nonzero fraction of the
>>>> file fits in working memory.
>>>>
>>>> You can use temporary files, but of course minimizing their total size
>>>> is recommended. Speed is of course important. Draft your solution and
>>>> explain why you think it's correct and advantageous.
>>> Well, my temporary file contains the whole stream. Because there is a
>>> probability that the last line of the stream goes first.
>> Yes, that probability exists, but that doesn't imply you must load the
>> entire file in memory.
>>
>> Andrei
>
> What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?
>
> I look forward to the next challenge ;)
Hi Jason,
I think (I may be wrong) that -Normal distribution- is the key
to optimize speed. Let's say the stream Andrei is talking about is a
book. Each sentence in this book consist of N characters. (End of Line
is a DOT) Now we read the first ~10 pages. Character by Character. Based
on this information we are able to say that each sentence has an average
size of 100 characters and the normal distribution looks similat to :
no Graphic possible :(
so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution
Indeed, NOT the solution Andrei is looking for, because indexing is
still necesary. But due to the fact that finding the EOL/DOT is
probabely faster by creating a slice from position 90 to 110 (in
sentence #1, +- 10 percent from our average sentence size), this could
be already a win.
Further : During the indexing process we can also adjust our first(base)
normal distribution, as well as our +- tolerance.
Heck it is late... hope this is not completely bullshi*
Bjoern
More information about the Digitalmars-d
mailing list