dataframe implementations

Wed Nov 18 09:06:46 PST 2015

On Tuesday, 17 November 2015 at 13:56:14 UTC, Jay Norwood wrote:
> I looked through the dataframe code and a couple of comments...
>
> I had thought perhaps an app could read in the header info and 
> type info from hdf5, and generate D struct definitions with 
> column headers as symbol names.  That would enable faster 
> processing than with the associative arrays, as well as support 
> the auto-completion that would be helpful in writing 
> expressions.

Yes - I think that one will want to have a choice between this 
kind of approach and using associative arrays.  Because for some 
purposes it's not convenient to have to compile code every time 
you open a strange file, and on the other hand the hit with an AA 
sometimes will matter.

The situation at the moment for me is that I have very little 
time to work on a correct general solution for this problem 
myself (yet its important for D that we do get to one).  I also 
lack the experience with D to do it very well very quickly.  I do 
have a couple of seasoned people from the community helping me 
with things, but dataframes won't be the first thing they look 
at, and it could be a while before we get to that.  If we 
implement for our own needs,then I will open source it as it is 
commercially sensible as well as the right thing to do.  But that 
could be a year away.

Vlad Levenfeld was also looking at this a bit.

> The csv type info for columns could be inferred, or else stated 
> in the reader call, as done as an option in julia.
>
> In both cases the column names would have to be valid symbol 
> names for this to work.  I believe Julia also expects this, or 
> else does some conversion on your column names to make them 
> valid symbols. I think the D csv processing would also need to 
> check if the
>
> The jupyter interactive environment supports python pandas and 
> Julia dataframe column names in the autocompletion, and so I 
> think the D debugging environment would need to provide similar 
> capability if it is to be considered as a fast-recompile 
> substitute for interactive dataframe exploration.

Well we don't need to get there in a single bound - already just 
being able to do this at all is a big improvement, and I am 
already using D with jupyter to do things.

> It seems to me that your particular examples of stock data 
> would eventually need to handle missing data, as supported in 
> Julia dataframes and python pandas.  They both provide ways to 
> drop or fill missing values.  Did you want to support that?
Yes - we should do so eventually, and there's much more that 
could be done.  But maybe a sensible basic implementation is a 
start and we can refine after that.

I wrote the dataframe in a couple of evenings, so I am sure it 
can be improved, and even rearchitected.  Pull requests welcomed, 
and maybe we should set up a Trello to organise ideas ?  Let me 
know if you are in.