Multithreaded file IO?

Jerry Quinn jlquinn at optonline.net
Sat Sep 24 23:26:18 PDT 2011


Jonathan M Davis Wrote:

> On Saturday, September 24, 2011 01:05:52 Jerry Quinn wrote:
> > Jonathan M Davis Wrote:
> > > On Friday, September 23, 2011 23:01:17 Jerry Quinn wrote:
> > > 
> > > A direct rewrite would involve using shared and synchronized (either on
> > > the class or a synchronized block around the code that you want to
> > > lock). However, the more idiomatic way to do it would be to use
> > > std.concurrency and have the threads pass messages to each other using
> > > send and receive.
> > 
> > I'm trying the direct rewrite but having problems with shared and
> > synchronized.
> > 
> > class queue {
> >   File file;
> >   this(string infile) {
> >     file.open(infile);
> >   }
> >   synchronized void put(string s) {
> >     file.writeln(s);
> >   }
> > }
> > 
> > queue.d(10): Error: template std.stdio.File.writeln(S...) does not match any
> > function template declaration queue.d(10): Error: template
> > std.stdio.File.writeln(S...) cannot deduce template function from argument
> > types !()(string)
> > 
> > Remove the synchronized and it compiles fine with 2.055.
> 
> Technically, sychronized should go on the _class_ and not the function, but 
> I'm not sure if dmd is currently correctly implemented in that respect (since 
> if it is, it should actually be an error to stick synchronized on a function). 
> Regardless of that, however, unless you use shared (I don't know if you are), 
> each instance of queue is going to be on its own thread. So, no mutexes will 
> be necessary, but you won't be able to have multiple threads writing to the 
> same File. It could get messy.

I get similar errors if I put synchronized on the class, or shared.  My best guess is that a File struct cannot currently (2.055) be shared or accessed in a synchronized context.

If I make the class synchronized and use __gshared on the File then it compiles.  I don't know if it works, though.


> You could use a synchronized block instead of synchronizing the class
> 
> void put(string s)
> {
>     synchronized(this)
>     {
>         file.writeln(s);
>     }
> }
> 
> and see if that works. But you still need the variable to be shared.

I'll try that.


> > > So, what you'd probably do is spawn 3 threads from the main thread. One
> > > would read the file and send the data to another thread. That second
> > > thread would process the data, then it would send it to the third
> > > thread, which would write it to disk.
> > 
> > I think that would become messy when you have multiple processing threads. 
> > The reader and writer would have to handshake with all the processors.
> 
> The reader therad would be free to read at whatever rate that it can. And the 
> processing thread would be free to process at whatever rate that it can. The 
> writer thread would be free to write and whatever rate it can. I/O issues 
> would be reduced (especially if the files be read and written to are on 
> separate drives), since the reading and writing threads would be separate. I 
> don't see why it would be particularly messy. It's essentially what TDPL 
> suggests a file copy function should do except that there's a thread in the 
> middle which does so processing of the data before sending it the thread doing 
> the writing. Really, this is a classic example of the sort of situation which 
> std.concurrency's message passing is intended for.

If you only have one processing thread, this is true.  However if you have, say, 16 processing threads,  the code becomes messy.  The reader has to handshake with each processor and select which processor to send the next chunk of input data to.  The same dance has to happen when the processors are ready to write, since I want all the output ordered correctly.  Broadcast (which isn't present in std.concurrency) wouldn't work because only one processor should get an input chunk.

The problem is really a single-writer, multiple-reader problem.

> > std.parallelism actually looks the closest to what I want.  Not sure if I
> > can make it work easily though.
> 
> For std.parallelism to work, each iteration of a parallel foreach _must_ be 
> _completely_ separate from the others. They can't access the same data. So, 
> for instance, they could access separate elements in an array, but they can't 
> ever access the same element (unless none of the write to it), and something 
> like a file is not going to work at all, since then you'd be trying to write to 
> the file from multiple threads at once with no synchronization whatsoever. 
> std.parallelism is for when you do the same thing many times _in parallel_ 
> with each other, and your use case does not sound like that at all. I really 
> think that std.concurrency is what you should be using if you don't want to do 
> it the C/C++ way and use shared.

Well I would have used the C++ shared way if I could get it to compile :-)

I actually am doing the same thing many times in parallel.  I have a single input file, but each line is processed independently by a worker thread.  The results are output to a single file in order.  Is map() what I want, perhaps?

With both parallel foreach and parallel map, it wasn't clear if all processors need to finish a work item before more work items get used.  The issue is that each processor may take different amount of time and I want to keep all processors busy continuously.  My reading of std.parallelism docs suggested that a set of inputs are passed to the processors and all must finish before the next set is handled.

Thanks,
Jerry



More information about the Digitalmars-d-learn mailing list