Data frames in D?

Fri Dec 26 13:12:26 PST 2014

On Fri, 2014-12-26 at 20:44 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[…]
> I agree with other posters that a D REPL and
> interactive/visualization data environment would be very cool,
> but unfortunately doesn't exist. Batch computing is more
> practical, but REPLs really hook new users. I see statistical
> computing as a huge opportunity for D adoption. (R is just
> super-ugly and slow, leaving Python + its various native-code
> cyborg appendages as the hot new stats environment).

REPLs are over-hyped and have become a fashion touchstone that few dare
argue against for fear of being denounced as un-hip. REPLs have their
place, but in the main are nowhere near as useful as people claim.
IPython Notebooks on the other hand are a balance between
editor/execution environment and REPL that really has a lot going for
it.

Stats folks using R, love R and hate Python. Stats folk using Python,
love Python and hate R. In the end it's all about what you know and can
use to get the job done. To be frank (as in open rather than Jill), D
hasn't got the infrastructure to compete with either R or Python and so
is a non-starter in the data science arena.

> There are tons of ways of accomplishing the same thing in D, but
> as far as I know there isn't a "standard" at this point. A
> statically typed dataframe is, at minimum, just a range of
> structs -- even more minimally, a bare *array* of structs, or
> alternatively just a 2-D array in a thin wrapper that provides
> access via column labels rather than indexes. You can manipulate
> these ranges with functions from std.range and std.algorithm.
> Missing or N/A data is a common issue, and can be represented in
> a variety of ways, with integers being the most annoying since
> there is no built-in NaN value for ints (check out the Nullable
> template from std.typecons).
> 
> Supporting features like having *both* rows and columns are
> accessible via labels rather than indexes requires a little bit
> more wrapping. We have a NamedMatrix class at my workplace for
> that purpose. It's easy to overload the index operator [] for
> access, * for matrix multiplication, etc.
> 
> CSV loads can be done with std.csv; unfortunately there's no
> corresponding support in that module for *writing* CSV (I've
> rolled my own). At my workplace we also have a MysqlConnection
> class that provides one-liner loading from a SQL query into
> minimalist, range-of-structs dataframes.
> 
> Beyond that, it really depends on how you want to manipulate the
> dataframes. What specific things do you want to do? If you've got
> an idea, I could work up some sample code.
> 
> So yes, there are people doing it in The Real World.
> Unfortunately my colleagues don't have a nice, tidy,
> self-contained DataFrame module to share (yet). But having one
> would be a great thing for D. The bigger problem though is
> matching the huge 3rd-party stats libraries (like CRAN for R).
> "

Nor the whole Python/SciPy/Matplotlib thing.
> ----
> 
> Since we do have an interactive shell (the pastebin), and now 
> bindings and wrappers for hdf5 (key for large data sets) and 
> basic seeds for a matrix library, should we start to think about 
> what would be needed for a dataframe, and the best way to 
> approach it, starting very simply?
> 
> One doesn't need to have a comparable library to R for it to 
> start being useful in particular use cases.

Whilst I can do workshops for data science folk using Python and have an
argument why Python beats R for almost all cases so far brought up,
there is no way I can even start to mention D.

> Pandas and Julia would be obvious potential sources of 
> inspiration (and it may be that one still uses them to call out 
> to D in some cases), but rather than trying to just port pandas 
> to D, it seems to make sense to ask how one should do it from 
> scratch to better suit D.

Pandas is just one of the "native code cyborg appendages" you were
railing about earlier. It happens to be "a big thing" in data science
and one of the reasons Python is running away with the market, reducing
the R market penetration and only being a little bit dented in same
places by Julia.

It's not about the language, its about the total milieu. Whether or not
Python is a good language vs D is irrelevant,
Python/SciPy/Matplotlib/Pandas/IPython is there and ready, D has no play
in the game.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder at ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel at winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: This is a digitally signed message part
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20141226/dc818e05/attachment.sig>