Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Sat Dec 27 05:39:47 PST 2014

On Sat, 2014-12-27 at 06:21 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[…]
> I don't believe I agree that we need a perfect multi-dimensional 
> rectangular array library to serve as a backend before thinking 
> and doing much on data frames (although it will certainly be very 
> useful when ready).

Also, if there is a ready made C or C++ library that can be made use of,
do it.

> First, it seems we do have matrices, even if lacking in complete 
> functionality for linear algebra, and the like.  There is a 
> chicken and egg aspect in the development of tools - it is rarely 
> the case that one kind of tool necessarily totally precedes 
> another, and often complementarities and dynamic effects between 
> different stages.  If one waits till one has everything one needs 
> done for one, one won't get much done.

In the end there is no point in a language/compiler/editor if there is
not the perceived support for the things that large numbers of people
want to do. C, C++, C#, F#, Java, Scala, Groovy, Python, R, Julia, Go,
all find themselves with a vocal audience doing things. The language
evolves with the libraries and "end user" applications. In the end it is
all about people doing things with a language and hyping it up.

> Secondly, much of the kind of thing Pandas is useful for is not 
> exactly rocket science from a quantitative perspective, but it's 
> just the kinds of thing that is very useful if you are thinking 
> about working with data sets of a decent size.The concepts seem 
> to me to fit very well with std.algorithm and std.range, and can 
> be thought of as just as way to bring out the power of the tools 
> we alreaady have when working with data in the world as it is.  
> See here for an example of just how simple.  Remember Excel 
> pivottables?
> 
> http://pandas.pydata.org/pandas-docs/stable/groupby.html

I recently discovered a number of hedge funds work solely on moving
average based algorithmic trading. NumPy, SciPy and Pandas all have
variations on this basic algorithm.

Isn't "group by" standard in all languages. Certainly, Python, Groovy,
Scala, Haskell,… 

> Thirdly, one of the reasons Pandas is popular is because it is 
> written in C/Cython and very fast.  It's significantly faster 
> than Julia.  One might hit roadblocks down the line when it comes 
> to the Global Interpreter Lock and difficulty of processing 
> larger sets quickly in Python, but at least this stage is fast 
> and easy.  So people do care about speed, but they also care 
> about the frictions being taken away, so that they can spend 
> their energies on addressing the problem at hand.  Ie a dataframe 
> will be helpful, in my view.

Perceived to be fast. In fact it isn't anything like as fast as it
should be. NumPy (which underpins Pandas and provides all the data
structures and basic algorithms), is actually quite slow.

I have ranted many times about GIL in Python, and on two occasions spent
2 or 3 hours trying to convince Guido about the lunacy of a GIL based
interpreted in 2014. Armin Rigo has an STM-based version in PyPy and
CPython and has shown it can work just fine. Guido though is I/O bound
rather than CPU bound in his work and doesn't see a need for anything
other than multiprocessing for accessing parallelism in Python. Sadly,
it can be shown that multiprocessing is slow and inefficient at what it
does and it needs replacing.

NumPy's approach to parallelism is nice as an abstraction, but doesn't
really "cut it" unless you do not know any better.

In principle this is fertile territory for a new language to take the
stage. Hence Julia. I fear D has missed the boat of this opportunity
now. On the other hand if some real data science people begin to do data
science with D and show that more can be done with less, and without
loss of functionality, then there is an opportunity for marketing and
possible traction in the market.  

> Processing of log data is a growing domain - partly from 
> internet, but also from the internet of things.  See below for 
> one company using D to process logs:
> 
> http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-ad-data-processed-daily-says-size-matters/
> http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

This is worth hyping up, it should be front and centre on teh dlang
pages along with Facebook funding bug fixes. Having the tweets list is
great but too ephemeral, the "D is for Data Science" tweet will fade too
quickly.

> A poster on this forum is already using D as a library to call 
> from R (from Reddit), which brings home the point that it isn't 
> necessary for D to be able to do every part of the process for it 
> to be able to take over some of the heavy work.

Funny isn't it how every language must do everything. So for all new
languages you have to have a new build system and a new event loop. The
problem is though that C is the language of extension for R, Python,…
even though it is a language that should now only used for working
"right on the metal", if at all.

> "[–]bachmeier 6 points 1 month ago
> 
> I call D shared libraries from R. I've put together a library 
> that offers similar functionality to Rcpp. I've got a 
> presentation showing its use on Linux. Both the presentation and 
> library code should be made available within the next couple of 
> days.
> 
> My library makes available the R API and anything in Gretl. You 
> can allocate and manipulate R objects in D, add R assert 
> statements in your D code, and so on. What I'm working on now is 
> calling into GSL for optimization.
> 
> These are all mature libraries - my code is just an interface. 
> It's generally easy to call any C library from D, and modern 
> Fortran, which provides C interoperability, is not too much 
> harder.
> "

But if all the libraries are C , C++ and Fortran, is there any value add
role for D?

Lots of C++ system embed Python or Lua for dynamic scripting capability,
lots of Python and R system call out to C. This seems a well established
milieu. Is there a good way for D to, in an evolutionary way establish a
permanent foothold. Certainly it cannot be a revolutionary one.

[…]

Splunk stuff is just an example of using dataflow networks for
processing data rather than using SQL. The "Big Data using JVM"
community are already on this road, cf. various proprietary frameworks
running over Hadoop and Spark.

Dataflow frameworks are likely to be the next big thing. Java and Groovy
have established offerings, no other language really does other than Go.
If D could get a really good dataflow framework before C++, Rust, etc.
then that might be a route to traction.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder at ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel at winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: This is a digitally signed message part
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20141227/3aa598f4/attachment.sig>