Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Fri Dec 26 22:21:16 PST 2014

On Friday, 26 December 2014 at 21:31:00 UTC, aldanor wrote:
> On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood 
> wrote:
>> I've been playing with the python pandas app enables 
>> interactive manipulation of tables of data in their dataframe 
>> structure, which they say is similar to the structures used in 
>> R.
>>
>> It appears pandas has laid claim to being a faster version of 
>> R, but is doing so basically limited to what they can exploit 
>> from moving operations back and forth from underlying cython 
>> code.
>>
>> Has anyone written an example app in D that manipulates 
>> dataframe type structures?
>
> Pandas has numpy as "backend" which does a lot of heavy 
> lifting, so first things first -- imo D needs a fast and 
> flexible blas/lapack-compatible multi-dimensional rectangular 
> array library that could later serve as backend for pandas-like 
> libraries.

I don't believe I agree that we need a perfect multi-dimensional 
rectangular array library to serve as a backend before thinking 
and doing much on data frames (although it will certainly be very 
useful when ready).

First, it seems we do have matrices, even if lacking in complete 
functionality for linear algebra, and the like.  There is a 
chicken and egg aspect in the development of tools - it is rarely 
the case that one kind of tool necessarily totally precedes 
another, and often complementarities and dynamic effects between 
different stages.  If one waits till one has everything one needs 
done for one, one won't get much done.

Secondly, much of the kind of thing Pandas is useful for is not 
exactly rocket science from a quantitative perspective, but it's 
just the kinds of thing that is very useful if you are thinking 
about working with data sets of a decent size.The concepts seem 
to me to fit very well with std.algorithm and std.range, and can 
be thought of as just as way to bring out the power of the tools 
we alreaady have when working with data in the world as it is.  
See here for an example of just how simple.  Remember Excel 
pivottables?

http://pandas.pydata.org/pandas-docs/stable/groupby.html

Thirdly, one of the reasons Pandas is popular is because it is 
written in C/Cython and very fast.  It's significantly faster 
than Julia.  One might hit roadblocks down the line when it comes 
to the Global Interpreter Lock and difficulty of processing 
larger sets quickly in Python, but at least this stage is fast 
and easy.  So people do care about speed, but they also care 
about the frictions being taken away, so that they can spend 
their energies on addressing the problem at hand.  Ie a dataframe 
will be helpful, in my view.

Processing of log data is a growing domain - partly from 
internet, but also from the internet of things.  See below for 
one company using D to process logs:

http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-ad-data-processed-daily-says-size-matters/
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

A poster on this forum is already using D as a library to call 
from R (from Reddit), which brings home the point that it isn't 
necessary for D to be able to do every part of the process for it 
to be able to take over some of the heavy work.

"[–]bachmeier 6 points 1 month ago

I call D shared libraries from R. I've put together a library 
that offers similar functionality to Rcpp. I've got a 
presentation showing its use on Linux. Both the presentation and 
library code should be made available within the next couple of 
days.

My library makes available the R API and anything in Gretl. You 
can allocate and manipulate R objects in D, add R assert 
statements in your D code, and so on. What I'm working on now is 
calling into GSL for optimization.

These are all mature libraries - my code is just an interface. 
It's generally easy to call any C library from D, and modern 
Fortran, which provides C interoperability, is not too much 
harder.
"

See here, for just one use case in the internet of things.  They 
don't use D, but maybe they should have.  And it shows an example 
where perhaps at least log processing could easily be handled by 
what we have with a few small additional data structures - even 
if people use outside libraries for the machine learning part.

http://www.forbes.com/sites/danwoods/2014/11/04/how-splunk-caught-wall-streets-eye-by-taming-the-messy-world-of-iot-data/3/

"By using Splunk software, Hrebek said that his division’s leader 
product is able to offer customers a real-time view of operations 
on a train and to use machine learning to suggest optimal 
strategies for driving trains along various routes. Just shaving 
a small percentage off of fuel costs can mean huge savings for a 
railroad.

Why Doesn’t BI Work for the IoT?

In both of the use cases just mentioned, for years, existing 
business intelligence technology had been applied to the problem 
of making sense of the data with little success.

The problem is not that that it is impossible to use traditional 
ETL technology and an RDBMS or, more commonly, spreadsheets to 
get something working so that some of the data becomes useful. It 
is just that the effort involved is great and the technical 
effort involved in maintaining such systems is massive. Hrebek 
compared using spreadsheets for IoT data to living in the ninth 
circle of hell in Dante’s Inferno, because the process is so 
tedious and error prone.

Machine data is different from flat files that are the paradigm 
for BI technology, which works in rows and columns. Also, machine 
data can be naturally organized into a time series, but this is 
not the default way that a spreadsheet or an RDBMS works.

Why Does Splunk Work for the IoT?

IoT data essentially looks exactly the same as the machine data 
from servers in a data center that Splunk Enterprise was 
initially created to handle. The software allows you to:

     Automatically parse fields
     Identify several different types of records as a related group
     Organize and store records by timestamp
     Create dashboards and analytics that are updated in real time

With each successive release, Splunk is making the process of 
parsing machine data as automatic and machine assisted as 
possible. Its software handles variations of IoT data by allowing 
a simple mapping of a field into a standard name. For example, 
the GPS coordinates of a train car might be recorded in six or 
seven different ways in various forms of machine data, but can be 
unified via Splunk Enterprise. Splunk software allows these 
mappings to be implemented and maintained with a minimum of 
effort.

The bottom line is that there is no way to avoid the 
imperfections that naturally occur in the real world. We are 
always going to have lots of trees and to have to deal with them 
both as individuals and as a forest, in a normalized aggregate 
form. The reason Splunk is making such inroads in IoT 
applications is that it can handle both the trees and the forest 
and turn the information from the real world into a clear view of 
what is happening that allows useful models of reality to be 
created. If you are building an IOT application, you must find a 
way to handle the messy nature of the real world."

Many more similar oppties for D here: 
https://www.google.de/search?q=internet+of+things+massive+log+processing+growth&btnG=Search&oe=utf-8&gws_rd=cr

Laeeth.