D is for Data Science

Mon Nov 24 14:25:27 PST 2014

25-Nov-2014 00:34, weaselcat пишет:
> On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:
>> Just browsing reddit and found this article posted about D.
>> Written by Andrew Pascoe of AdRoll.
>>
>> From the article:
>> "The D programming language has quickly become our language of choice
>> on the Data Science team for any task that requires efficiency, and is
>> now the keystone language for our critical infrastructure. Why?
>> Because D has a lot to offer."
>>
>> Article:
>> http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
>>

Quoting the article:

 > One of the best things we can do is minimize the amount of memory 
we’re allocating; we allocate a new char[] every time we read a line.

This is wrong. byLine reuses buffer if its mutable which is the case 
with char[]. I recommend authors to always double checking hypothesis 
before stating it in article, especially about performance.

Observe:
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1660
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1652

And notice a warning about reusing the buffer here:

https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1741

>> Reddit:
>> http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/
>>
>
> Why is File.byLine so slow?

Seems to be mostly fixed sometime ago. It's slower then straight fgets 
but it's not that bad.

Also nearly optimal solution using C's fgets with growable buffer is way 
simpler then outlined code in the article. Or we can mmap the file too.

> Having to work around the standard library
> defeats the point of a standard library.

Truth be told the most of slowdown should be in eager split, notably 
with GC allocation per line. It may also trigger GC collection after 
splitting many lines, maybe even many collections.

The easy way out is to use standard _splitter_ which is lazy and 
non-allocating.  Which is a _2-letter_ change, and still using nice 
clean standard function.

Article was really disappointing for me because I expected to see that 
single line change outlined above to fix the 80% of problem elegantly. 
Instead I observe 100+ spooky lines that needlessly maintain 3 buffers 
at the same time (how scientific) instead of growing single one to 
amortize the cost. And then a claim that's nice to be able to improve 
speed so easily.

-- 
Dmitry Olshansky