[GSoC] Dataframes for D

Andre Pany andre at s-e-a-p.de
Wed May 29 18:26:46 UTC 2019


On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
> Hello everyone,
>
> I have began work on my Google Summer of Code 2019 project 
> DataFrame for D.
>
> -----------------
> About the Project
> -----------------
>
> DataFrames have become a standard while handling and 
> manipulating data. They give a neat representation, access and 
> power to modulate the data in way user wants.
> This project aims at bringing native DataFrame to D one which 
> brings with it:
>
> * A User Friendly API
> * Multi - Indexing
> * Writing to CSV and parsing from CSV
> * Column binary operation in the form: df["Index1"] = 
> df["Index2"] + df["Index3"];
> * groupBy on an arbitrary number of columns
> * Data Aggregation
>
> Disclaimer: The entire structuring was inspired by Pandas, the 
> most popular DataFrame library in Python and hence most of the 
> usage will look very similar to the ones in Pandas.
>
> Main focus of this project is user-friendliness of the API 
> while also maintaining fair amount of speed and power.
> The preliminary road map can be viewed here -> 
> https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing
>
> The core developments can be seen here -> 
> https://github.com/Kriyszig/magpie
>
>
> -----------------------------
> Brief idea of what is to come
> -----------------------------
>
> This month
> ----------
> * Finish up with structure of DataFrame
> * Finish Terminal Output (What good is data which cannot be 
> seen)
> * Finish writing to CSV
> * Parsing DataFrame from CSV (Both single and multi-indexed)
> * Accessing Elements
> * Accessing Rows and Columns
> * Assignment of element, an entire row or column
> * Binary operation on rows and columns
>
> Next Month
> ----------
> * groupBy
> * join
> * Begin writing ops for aggregation
>
>
> -----------
> Speed Bumps
> -----------
>
> I am relatively new to D and hail from functional C background. 
> Sometimes (most of the times) my code can start to look more C 
> than D.
> However I am adapting thanks to my mentors Nicholas Wilson and 
> Ilya Yaroshenko. They have helped me a ton - whether it be with 
> debugging errors or me falling back to my functional C past, 
> they have always come for my rescue and I am grateful for their 
> support.
>
>
> -------------------------------------
> Addressing Suggestions from Community
> -------------------------------------
>
> This suggestion comes from Laeeth Isharc
> Source: 
> https://github.com/dlang/projects/issues/15#issuecomment-495831750
>
> Though this is not on my current road map, I would love to 
> pursue this idea. Adding an easy way to inter operate with 
> other libraries would be very beneficial.
> Although I haven't formally addressed this in the road map, I 
> would love to implement a msgpack based I/O as I continue to 
> develop the library. Also JSON I/O was something on my mind to 
> implement after the data aggregation part.  (I had prioritised 
> JSON as I believed there were much more datasets as JSON 
> compared to any other format)

The outlook looks really great. CSV as a starting point is good. 
I am not sure wheter adding a second text format like json brings 
real value.
Text formats have 2 disadvantages, string to number and vise 
versa slows down loading and saving files. If I remember 
correctly saving a file as CSV needed around 50 seconds while 
saving the same data as binary file (Parquet) took 3 seconds.
Second issue is the file size. A CSV with a size of 580 MB has a 
size of 280 MB while saving as e.g. Parquet file.

The file size isn't an issue on your local file system but it is 
a big issue while storing these file in the cloud e.g. Amazon S3. 
The file size will cause longer transfer times.

Adding a packed binary format would be great, if possible.

Kind regards
Andre


More information about the Digitalmars-d mailing list