[GSoC] Dataframes for D
Andre Pany
andre at s-e-a-p.de
Wed May 29 18:26:46 UTC 2019
On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
> Hello everyone,
>
> I have began work on my Google Summer of Code 2019 project
> DataFrame for D.
>
> -----------------
> About the Project
> -----------------
>
> DataFrames have become a standard while handling and
> manipulating data. They give a neat representation, access and
> power to modulate the data in way user wants.
> This project aims at bringing native DataFrame to D one which
> brings with it:
>
> * A User Friendly API
> * Multi - Indexing
> * Writing to CSV and parsing from CSV
> * Column binary operation in the form: df["Index1"] =
> df["Index2"] + df["Index3"];
> * groupBy on an arbitrary number of columns
> * Data Aggregation
>
> Disclaimer: The entire structuring was inspired by Pandas, the
> most popular DataFrame library in Python and hence most of the
> usage will look very similar to the ones in Pandas.
>
> Main focus of this project is user-friendliness of the API
> while also maintaining fair amount of speed and power.
> The preliminary road map can be viewed here ->
> https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing
>
> The core developments can be seen here ->
> https://github.com/Kriyszig/magpie
>
>
> -----------------------------
> Brief idea of what is to come
> -----------------------------
>
> This month
> ----------
> * Finish up with structure of DataFrame
> * Finish Terminal Output (What good is data which cannot be
> seen)
> * Finish writing to CSV
> * Parsing DataFrame from CSV (Both single and multi-indexed)
> * Accessing Elements
> * Accessing Rows and Columns
> * Assignment of element, an entire row or column
> * Binary operation on rows and columns
>
> Next Month
> ----------
> * groupBy
> * join
> * Begin writing ops for aggregation
>
>
> -----------
> Speed Bumps
> -----------
>
> I am relatively new to D and hail from functional C background.
> Sometimes (most of the times) my code can start to look more C
> than D.
> However I am adapting thanks to my mentors Nicholas Wilson and
> Ilya Yaroshenko. They have helped me a ton - whether it be with
> debugging errors or me falling back to my functional C past,
> they have always come for my rescue and I am grateful for their
> support.
>
>
> -------------------------------------
> Addressing Suggestions from Community
> -------------------------------------
>
> This suggestion comes from Laeeth Isharc
> Source:
> https://github.com/dlang/projects/issues/15#issuecomment-495831750
>
> Though this is not on my current road map, I would love to
> pursue this idea. Adding an easy way to inter operate with
> other libraries would be very beneficial.
> Although I haven't formally addressed this in the road map, I
> would love to implement a msgpack based I/O as I continue to
> develop the library. Also JSON I/O was something on my mind to
> implement after the data aggregation part. (I had prioritised
> JSON as I believed there were much more datasets as JSON
> compared to any other format)
The outlook looks really great. CSV as a starting point is good.
I am not sure wheter adding a second text format like json brings
real value.
Text formats have 2 disadvantages, string to number and vise
versa slows down loading and saving files. If I remember
correctly saving a file as CSV needed around 50 seconds while
saving the same data as binary file (Parquet) took 3 seconds.
Second issue is the file size. A CSV with a size of 580 MB has a
size of 280 MB while saving as e.g. Parquet file.
The file size isn't an issue on your local file system but it is
a big issue while storing these file in the cloud e.g. Amazon S3.
The file size will cause longer transfer times.
Adding a packed binary format would be great, if possible.
Kind regards
Andre
More information about the Digitalmars-d
mailing list