Data frames in D?

Laeeth Isharc via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Fri Dec 26 12:44:34 PST 2014


"
> I thought about it once but quickly abandoned the idea. The 
> primary reason was that D doesn't have REPL and is thus not 
> suitable for interactive data exploration.

The quick compile times could allow interactive data exploration

I agree with other posters that a D REPL and
interactive/visualization data environment would be very cool,
but unfortunately doesn't exist. Batch computing is more
practical, but REPLs really hook new users. I see statistical
computing as a huge opportunity for D adoption. (R is just
super-ugly and slow, leaving Python + its various native-code
cyborg appendages as the hot new stats environment).

There are tons of ways of accomplishing the same thing in D, but
as far as I know there isn't a "standard" at this point. A
statically typed dataframe is, at minimum, just a range of
structs -- even more minimally, a bare *array* of structs, or
alternatively just a 2-D array in a thin wrapper that provides
access via column labels rather than indexes. You can manipulate
these ranges with functions from std.range and std.algorithm.
Missing or N/A data is a common issue, and can be represented in
a variety of ways, with integers being the most annoying since
there is no built-in NaN value for ints (check out the Nullable
template from std.typecons).

Supporting features like having *both* rows and columns are
accessible via labels rather than indexes requires a little bit
more wrapping. We have a NamedMatrix class at my workplace for
that purpose. It's easy to overload the index operator [] for
access, * for matrix multiplication, etc.

CSV loads can be done with std.csv; unfortunately there's no
corresponding support in that module for *writing* CSV (I've
rolled my own). At my workplace we also have a MysqlConnection
class that provides one-liner loading from a SQL query into
minimalist, range-of-structs dataframes.

Beyond that, it really depends on how you want to manipulate the
dataframes. What specific things do you want to do? If you've got
an idea, I could work up some sample code.

So yes, there are people doing it in The Real World.
Unfortunately my colleagues don't have a nice, tidy,
self-contained DataFrame module to share (yet). But having one
would be a great thing for D. The bigger problem though is
matching the huge 3rd-party stats libraries (like CRAN for R).
"

----

Since we do have an interactive shell (the pastebin), and now 
bindings and wrappers for hdf5 (key for large data sets) and 
basic seeds for a matrix library, should we start to think about 
what would be needed for a dataframe, and the best way to 
approach it, starting very simply?

One doesn't need to have a comparable library to R for it to 
start being useful in particular use cases.

Pandas and Julia would be obvious potential sources of 
inspiration (and it may be that one still uses them to call out 
to D in some cases), but rather than trying to just port pandas 
to D, it seems to make sense to ask how one should do it from 
scratch to better suit D.


Laeeth.


More information about the Digitalmars-d-learn mailing list