Reading a structured binary file?

Sat Aug 3 14:29:01 PDT 2013

On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:
> On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
> [...]
>> FWIW
>> i have to deal with big data files that can be a few GB. for 
>> some data
>> analysis software i wrote in C a while back i did some testing 
>> with
>> caching and such. turns out that for Win7-64 the automatic 
>> caching
>> done by the OS is really good and any attempt to speed things 
>> up
>> actually slowed it down. no kidding, i have seen more than 2GB 
>> of data
>> being automatically cached. of course the system RAM must be 
>> larger
>> than the file size (if i remember my tests correctly by a 
>> factor of
>> ~2, but this is maybe not a linear relationship, i did not 
>> actually
>> change the RAM just the size of the data file) and it will 
>> hold it in
>> the cache only as long as there are no concurrent applications
>> requiring RAM or caching. i guess my point is, if your target 
>> is Win7
>> and your files are >5x smaller than the installed RAM i would 
>> not
>> bother at all trying to optimize file access. i suppose -nix 
>> machine
>> will do a similar good job these days.
> [...]
>
> IIRC, Linux has been caching files (or disk blocks, rather) in 
> memory
> since the days of Win95. Of course, memory in those days was 
> much
> scarcer, but file sizes were smaller too. :) There's still a 
> cost to
> copy the kernel buffers into userspace, though, which should 
> not be
> disregarded. But if you use mmap, then you're essentially 
> accessing that
> memory cache directly, which is as good as it gets.
>
> I don't know how well mmap works on windows, though, IIRC it 
> doesn't
> have the same semantics as Posix, so you could accidentally run 
> into
> performance issues by using it the wrong way on windows.
>
>
> T

I did some benching a while back with user bioinfornatics. He had 
to do some pretty large file reads, preferably in very little 
time. Observations showed my algo was *much* faster under windows 
then linux.

What we observed is that under windows, as soon as you open a 
file for reading, windows starts buffering the file in a parallel 
thread.

What we did was create two threads. The first did nothing but 
read the file, store it into chunks of memory, and then pass it 
to a worker thread. The worker thread did the parsing proper.

Doing this *halved* the linux runtime, tying it with the 
"monothreaded" windows run time. Windows saw no change.

FYI, the full thread is here:
forum.dlang.org/thread/gmfqwzgtjfnqiajghmsx at forum.dlang.org