howto count lines - fast

Wed May 31 12:57:15 PDT 2017

On Tuesday, 30 May 2017 at 23:41:01 UTC, H. S. Teoh wrote:
> On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via 
> Digitalmars-d-learn wrote:
>> After reading 
>> https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was wondering how fast one can do a simple "wc -l" in D.
>> 
> size_t lineCount3(string filename)
> {
>     import std.mmfile;
>
>     auto f = new MmFile(filename);
>     auto data = cast(ubyte[]) f[];
>     size_t c;
>
>     foreach (i; 0 .. data.length)
>     {
>         if (data[i] == '\n') c++;
>     }
>     return c;
> }
>
> // real    0m0.242s
> // user    0m1.151s
> // sys     0m0.057s

You should try something more like
     auto data = cast(ulong[]) f[];

      foreach (i; 0 .. data.length/ulong.sizeof)

and then using bitfiedling to count the number of \n in the 
loaded ulong. This divides by 8 the number of load instructions. 
The counting of \n in the loaded word then only uses registers. 
It is also possible to use bit fiddling to detect and count the 
characters in that ulong. I don't know if it is really faster then

Here a function to detect if a given character is in an ulong

     auto detect(alias CHAR)(ulong t)
     {
       enum ulong u = CHAR;
       enum mask1 = u|(u<<8)|(u<<16)|(u<<24UL)|(u<<32UL);
       enum mask = (mask1<<32)|mask1;
       return ((t^mask) - 0x0101010101010101UL) & ~(t^mask) & 
0x8080808080808080UL;
     }

The returned value is 0 if the character is not in t. And the 
highest bit of each byte is set if it contained the character. If 
the CPU has a fast popcnt it should be easy to count.