[GSoC] Dataframes for D

Wed May 29 22:31:57 UTC 2019

On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>> Hello everyone,
>>
>> I have began work on my Google Summer of Code 2019 project 
>> DataFrame for D.
>>
>> -----------------
>> About the Project
>> -----------------
>>
>> DataFrames have become a standard while handling and 
>> manipulating data. They give a neat representation, access and 
>> power to modulate the data in way user wants.
>> This project aims at bringing native DataFrame to D one which 
>> brings with it:
>>
>> * A User Friendly API
>> * Multi - Indexing
>> * Writing to CSV and parsing from CSV
>> * Column binary operation in the form: df["Index1"] = 
>> df["Index2"] + df["Index3"];
>> * groupBy on an arbitrary number of columns
>> * Data Aggregation
>>
>> Disclaimer: The entire structuring was inspired by Pandas, the 
>> most popular DataFrame library in Python and hence most of the 
>> usage will look very similar to the ones in Pandas.
>>
>> Main focus of this project is user-friendliness of the API 
>> while also maintaining fair amount of speed and power.
>> The preliminary road map can be viewed here -> 
>> https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing
>>
>> The core developments can be seen here -> 
>> https://github.com/Kriyszig/magpie
>>
>>
>> -----------------------------
>> Brief idea of what is to come
>> -----------------------------
>>
>> This month
>> ----------
>> * Finish up with structure of DataFrame
>> * Finish Terminal Output (What good is data which cannot be 
>> seen)
>> * Finish writing to CSV
>> * Parsing DataFrame from CSV (Both single and multi-indexed)
>> * Accessing Elements
>> * Accessing Rows and Columns
>> * Assignment of element, an entire row or column
>> * Binary operation on rows and columns
>>
>> Next Month
>> ----------
>> * groupBy
>> * join
>> * Begin writing ops for aggregation
>>
>>
>> -----------
>> Speed Bumps
>> -----------
>>
>> I am relatively new to D and hail from functional C 
>> background. Sometimes (most of the times) my code can start to 
>> look more C than D.
>> However I am adapting thanks to my mentors Nicholas Wilson and 
>> Ilya Yaroshenko. They have helped me a ton - whether it be 
>> with debugging errors or me falling back to my functional C 
>> past, they have always come for my rescue and I am grateful 
>> for their support.
>>
>>
>> -------------------------------------
>> Addressing Suggestions from Community
>> -------------------------------------
>>
>> This suggestion comes from Laeeth Isharc
>> Source: 
>> https://github.com/dlang/projects/issues/15#issuecomment-495831750
>>
>> Though this is not on my current road map, I would love to 
>> pursue this idea. Adding an easy way to inter operate with 
>> other libraries would be very beneficial.
>> Although I haven't formally addressed this in the road map, I 
>> would love to implement a msgpack based I/O as I continue to 
>> develop the library. Also JSON I/O was something on my mind to 
>> implement after the data aggregation part.  (I had prioritised 
>> JSON as I believed there were much more datasets as JSON 
>> compared to any other format)
>
> The outlook looks really great. CSV as a starting point is 
> good. I am not sure wheter adding a second text format like 
> json brings real value.
> Text formats have 2 disadvantages, string to number and vise 
> versa slows down loading and saving files. If I remember 
> correctly saving a file as CSV needed around 50 seconds while 
> saving the same data as binary file (Parquet) took 3 seconds.
> Second issue is the file size. A CSV with a size of 580 MB has 
> a size of 280 MB while saving as e.g. Parquet file.
>
> The file size isn't an issue on your local file system but it 
> is a big issue while storing these file in the cloud e.g. 
> Amazon S3. The file size will cause longer transfer times.
>
> Adding a packed binary format would be great, if possible.
>
> Kind regards
> Andre

Interoperability with pandas would be important in our use case, 
and I think probably for quite a few others.  So yes I agree that 
it's not ideal to use JSON but lots of things are not ideal.  And 
I think people use JSON, msgpack and hdf5 for interop with 
pandas.  CSV is more complicated in practice than one might 
initially think.  Finally of course Excel interop.

It's not my cup of tea but gzipped JSON is quite compact...

I have a little streaming msgpack to our own Variable type 
deserialiser (it can store a primitive or a variant).  It's not 
long and I could share that.

I don't think the initial dataframe needs to have all this stuff 
in it from day one necessarily.

It's worth reusing another JSON implementation.  Since you work 
with Ilya - asdf isn't bad and quite fast though the error 
messages leave something to be desired.

You might see John Colvin repo from his talk on nogc map, 
filter,fold etc.  He didn't do chunkBy yet.