[GSoC] Dataframes for D

Prateek Nayak lelouch.cpp at gmail.com
Sat Aug 10 04:10:31 UTC 2019


On Friday, 9 August 2019 at 08:08:39 UTC, bioinfornatics wrote:
> Dear D community,
>
> Thanks, Prateeek Nayak for your works.
> As currently, I am working with pandas (python, dataframe ...) 
> . They are an extra feature that I appreciate a lot, it is the 
> IO tool part:
>
> * SQL
> method: read_sql and to_sql
> Description: which allow to read and save from a DataBase. 
> These methods combined with SqlAlchemy are awesome.
>
> * Parquet
> method: read_parquet and to_parquet
> Description: In BigData environment Parquet is a file format 
> often used
>
> These abilities made Panda and its Dataframe API a core library 
> to have. Using like this, allow standardizing data structured 
> used into our application and in same time offer rich 
> statistics API.
>
> Indeed it is important for tho code maintainability. And the 
> FairData point that an application is a set of input data + 
> program's feature = result. Thus put data structured as the 
> first component to think how to develop an application is 
> important.
> The application is more robust and flexible as we can handle 
> multiple input data file format.
>
> I hope to see such features in D.
>
>
> Best regards
>
> Source:
> - 
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html
> - 
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
> - 
> https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet

I was looking into Parquet and it even came up in the reddit post 
i had linked to earlier on - smaller file size and better I/O 
makes it really good for industrial use.
A quick search on DUB didn't give any result for a parser so I'll 
probably work on a library to work with Parquet files.
I looked into Cap'n Proto too - it looks promising but its 
missing from Pandas I/O section which was disappointing.
Thanks for mentioning SQL. I will start working on these features 
soon.

> Again, thank you so much for working on this!
> We will be excited to put Magpie through its paces in our lab, 
> but it is missing* a few key (really, basic IMO) features we 
> make heavy use of in pandas.
> * I have read the README and glanced at code but not used 
> Magpie yet, so if I am > wrong about below please correct me!
> Since you are soliciting ideas:
> 1. Selecting/indexing into data with boolean vectors. e.g:
> df[df.A > 30 && df.B != "ignore"]
> 1a. This really means returning a boolean vector for df.COL 
> <op> <operand>
> 1b. ...and being able to subset data by a bool vector
> 2. We make heavy use of "pivot" functionality.
> Kind regards

I was thinking of the same feature as 1 - a filter like function 
for DataFrame and Group - finding possible ways to implement it
I'm really embarrassed to admit I never even thought about Pivot. 
I looks like a beautiful feature to have - will definitely add to 
Magpie soon (possibly over the next couple of weeks - I'm a bit 
tied down right now with commencement of University academics but 
it will definitely come soon)


More information about the Digitalmars-d mailing list