My experience making an on-disk merge sort using ranges

Tue Feb 26 17:47:11 PST 2013

Greetings everyone,

First, let me apologize that this is quite a long post. I hope 
that it doesn't scare you off.

I wasn't exactly sure where to put this because I have the 
intention of learning more about D and its overall design as well 
as to let others know about some various difficulties I've been 
experiencing while using it for various purposes and point out 
potential fixes. As such I was torn between the D.Learn forum and 
the vastly more observed D forum. I hope you don't mind my 
eventual decision on the matter.

Let me say that I was inspired to try out the "Component 
Programming" aspect of D using its ranges because of the 
excellently done recent talk by Walter. As such, I'm mostly 
interested in perfecting this particular approach. Obviously, I 
have no doubt that I could program this in other ways and have no 
issues whatsoever.

--

So, with that said, let's get to my point:

My goal is to write an on-disk merge sort (nearly) exclusively 
using ranges. The part of the problem I'm highlighting in this 
post is the process of merging multiple sorted files together 
into one file (in this case, stdout). I claim that this code 
should be acceptable:

---

import std.stdio, std.file, std.algorithm, std.conv;

void main() {
	dirEntries("data", "*.out", SpanMode.shallow)
		.map!(e => File(e.name).byLine(KeepTerminator.yes)
			.map!(l => l.to!string())()
		)()
		.array()
		.nWayUnion()
		.copy(stdout.lockingTextWriter);
}

---

I was certainly surprised to find out that nWayUnion existed in 
std.algorithm which essentially does the work for me. However, I 
was more surprised to find out that this does not compile. So, 
who here thinks this _shouldn't_ work? Look at it pretty closely 
and, without investigating the error, justify to yourself why 
this shouldn't work before you continue on. I will explain how I 
perceive the problem.

I thought I was already way ahead of the game by using to!string 
on each line in byLine (because I'm aware of its oddities because 
it gives you the actual buffer it writes to). Honestly, I was a 
bit shocked to discover that such a simple approach won't work 
because moveFront doesn't work with MapResult. Keep in mind, 
there are two maps here, but the MapResult that causes a problem 
is the *inner* one (the one which is applied on each line of the 
file).

---

Okay, so with that out of the way, I'd like to know what the 
idiomatic way to solve this particular problem is. Here's what I 
did (I don't claim the ranges I created are robustly implemented):

https://gist.github.com/Zshazz/c488e70eee4fd352b789

The first thing I figured out was that if you turned the 
MapResult into an array (using .array()), then it would work as 
expected. However, this is obviously not an acceptable solution 
to my problem because I'm doing an on-disk merge sort to sort 
things that wouldn't work in RAM. So, finding this out, I 
realized that the (likely) problem with MapResult is that it's a 
value type and that prevents moveFront from being applicable to 
it for some reason (I'm unknowledgable to that reason 
unfortunately).

My second solution was to write a wrapper for it that turns it 
into a reference type (just made a quick class that forwards 
calls to MapResult which it holds a copy of) to test my theory. 
Sure enough, my wrapper worked as expected. However, I compared 
it to another program and found it to be much slower, so I 
assumed it might be because of the forwarding mechanism slowing 
it down.

My third solution was to write a reference-based map. I couldn't 
make a value type of map that would work because I didn't know 
what prevented it from playing nicely with moveFront. This 
approach was much faster and actually managed to match my 
hand-written merge code. I was pretty satisfied with that 
approach, but it bothered me that I was essentially duplicating 
the functionality provided by std.algorithm.

My final solution was born out of me spending more time looking 
at the reason why my refMap was faster. I discovered that when I 
implemented refMap, I cached the result of applying the function 
on front (which differs from the map implementation in 
std.algorithm). So, I decided to rewrite my original wrapper so 
that it caches the result. This provided me the performance 
benefits of my refMap, but I didn't feel guilty about duplicating 
existing functionality. So, my caching range is my favorite 
solution thus far. Still, I would have much rather have had my 
original attempt work and a caching range be invented soley to 
improve performance, not to just get it to work at all.

---

With all of that in mind, my question is this: Why doesn't 
MapResult work with moveFront? Also, if it cannot be made to work 
with moveFront as a value type, would it be a good idea to turn 
it into a reference type? Is there any way to make it transparent 
to the user so that they don't have to do this sort of 
investigation to solve what should be a fairly simple problem 
(merging multiple ranges together, in this case)?

Thank you for taking the time to read this,
Take care.