Making byLine faster: we should be able to delegate this

rumbu via Digitalmars-d digitalmars-d at puremagic.com
Mon Mar 23 14:13:16 PDT 2015


On Monday, 23 March 2015 at 19:25:08 UTC, Tobias Pankrath wrote:
>> I made the same test in C# using a 30MB plain ASCII text file. 
>> Compared to fastest method proposed by Andrei, results are not 
>> the best:
>>
>> D:
>> readText.representation.count!(c => c == '\n') - 428 ms
>> byChunk(4096).joiner.count!(c => c == '\n') - 1160 ms
>>
>> C#:
>> File.ReadAllLines.Length - 216 ms;
>>
>> Win64, D 2.066.1, Optimizations were turned on in both cases.
>>
>> The .net code is clearly not performance oriented 
>> (http://referencesource.microsoft.com/#mscorlib/system/io/file.cs,675b2259e8706c26), 
>> I suspect that .net runtime is performing some optimizations 
>> under the hood.
>
> Does the C# version validate the input? Using std.file.read 
> instead of readText.representation halves the runtime on my 
> machine.

Source code is available at the link above. Since the C# version 
works internally with streams and UTF-16 chars, the pseudocode 
looks like this:

---
initilialize a LIST with 16 items;
while (!eof)
{
   read 4096 bytes in a buffer;
   decode them to UTF-16 in a wchar[] buffer
   while (moredata in the buffer)
   {
     read from buffer until (\n or \r\n or \r);
     discard end of line;
     if (nomorespace in LIST)
        double its size.
     add the line to LIST.
   }
}
return number of items in the LIST.
---

Since this code is clearly not the best for this task, as I 
suspected, I looked into jitted code and it seems that the .net 
runtime is smart enough to recognize this pattern and is doing 
the following:
- file is mapped into memory using CreateFileMapping
- does not perform any decoding, since \r and \n are ASCII
- does not create any list
- searches incrementally for \r, \r\n, \n using CompareStringA 
and LOCALE_INVARIANT and increments at each end of line
- there is no temporary memory allocation since searching is 
performed directly on the mapping handle
- returns the count.



More information about the Digitalmars-d mailing list