shuffling lines in a stream

Russell Lewis webmaster at villagersonline.com
Mon Oct 13 12:02:01 PDT 2008


I've read most of the responses (not all) and haven't yet seen a
duplicate of my algorithm, so here it is.  You can debate whether the
cost of opening many little files is as bad as lseek or not...I don't know.

BEGIN CODE

void main()
{
  CreateAnEmptyTmpDirectoryAndChdirToIt();

  while(!feof(stdin))
  {
    char[] name = Random8CharacterFilename();

    if(!file_exists(name))
      WriteToFile(name, ReadALine(stdin));
    else
    {
      if(CoinFlip())
        WriteToFile("junk", ReadALine(stdin));
      else
      {
        rename(name, "junk");
        WriteToFile(name, ReadALine(stdin));
      }

      // shuffle "junk" to some non-duplicate name.  chains of
      // renames are allowed so as to prevent any biasing of
      // the odds towards lines earlier in the file.
      name = Random8CharacterFilename();
      while(file_exists(name))
      {
        if(CoinFlip())
        {
          rename(name, "tmp");
          rename("junk", name);
          rename("tmp", "junk")
        }
        name = Random8CharacterFilename();
      }
    }
  }

  char[][] names = GetSortedFilenames();
  foreach(name; names)
  {
    PrintFileContents(name);
    unlink(name);
  }

  DestroyTmpDir();
}

END CODE

Andrei Alexandrescu wrote:
> The high traffic enjoyed by "random k-sample of a file" confirms that 
> plenty of people around here have idle cycles ready to be spent on fun 
> tasks.
> 
> In that vein, here's another problem. Say you want to randomly shuffle 
> lines in a stream. You don't know the size beforehand, and one pass is 
> preferable. The stream can be larger than the working memory, but not an 
> unbounded amount of times larger. Also lines in the stream can be 
> unbounded, but it is guaranteed that at least a nonzero fraction of the 
> file fits in working memory.
> 
> You can use temporary files, but of course minimizing their total size 
> is recommended. Speed is of course important. Draft your solution and 
> explain why you think it's correct and advantageous.
> 
> In case you hate random stuff, just wait; the next problem won't have 
> any random element in it :o).
> 
> 
> Andrei



More information about the Digitalmars-d mailing list