[GSoC] Dataframes for D

Thu May 30 04:26:02 UTC 2019

On Wednesday, 29 May 2019 at 22:31:57 UTC, Laeeth Isharc wrote:
> On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
>> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>>> [...]
>>
>> The outlook looks really great. CSV as a starting point is 
>> good. I am not sure wheter adding a second text format like 
>> json brings real value.
>> Text formats have 2 disadvantages, string to number and vise 
>> versa slows down loading and saving files. If I remember 
>> correctly saving a file as CSV needed around 50 seconds while 
>> saving the same data as binary file (Parquet) took 3 seconds.
>> Second issue is the file size. A CSV with a size of 580 MB has 
>> a size of 280 MB while saving as e.g. Parquet file.
>>
>> The file size isn't an issue on your local file system but it 
>> is a big issue while storing these file in the cloud e.g. 
>> Amazon S3. The file size will cause longer transfer times.
>>
>> Adding a packed binary format would be great, if possible.
>>
>> Kind regards
>> Andre
>
> Interoperability with pandas would be important in our use 
> case, and I think probably for quite a few others.  So yes I 
> agree that it's not ideal to use JSON but lots of things are 
> not ideal.  And I think people use JSON, msgpack and hdf5 for 
> interop with pandas.  CSV is more complicated in practice than 
> one might initially think.  Finally of course Excel interop.
>
> It's not my cup of tea but gzipped JSON is quite compact...
>
> I have a little streaming msgpack to our own Variable type 
> deserialiser (it can store a primitive or a variant).  It's not 
> long and I could share that.
>
> I don't think the initial dataframe needs to have all this 
> stuff in it from day one necessarily.
>
> It's worth reusing another JSON implementation.  Since you work 
> with Ilya - asdf isn't bad and quite fast though the error 
> messages leave something to be desired.
>
> You might see John Colvin repo from his talk on nogc map, 
> filter,fold etc.  He didn't do chunkBy yet.

Interop is important and I see many binary format being suggested 
as a way of interop.
Pandas I/O tools cover some of the popular formats: 
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Apache Arrow was something new that I hadn't heard of before. 
I'll look into it but then again I couldn't find any way to 
integrate it with D right away.

I would love to hear which binary format does the community want 
to be added as a way of interop in the future.