[GSoC] Dataframes for D
Prateek Nayak
lelouch.cpp at gmail.com
Thu May 30 04:26:02 UTC 2019
On Wednesday, 29 May 2019 at 22:31:57 UTC, Laeeth Isharc wrote:
> On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
>> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>>> [...]
>>
>> The outlook looks really great. CSV as a starting point is
>> good. I am not sure wheter adding a second text format like
>> json brings real value.
>> Text formats have 2 disadvantages, string to number and vise
>> versa slows down loading and saving files. If I remember
>> correctly saving a file as CSV needed around 50 seconds while
>> saving the same data as binary file (Parquet) took 3 seconds.
>> Second issue is the file size. A CSV with a size of 580 MB has
>> a size of 280 MB while saving as e.g. Parquet file.
>>
>> The file size isn't an issue on your local file system but it
>> is a big issue while storing these file in the cloud e.g.
>> Amazon S3. The file size will cause longer transfer times.
>>
>> Adding a packed binary format would be great, if possible.
>>
>> Kind regards
>> Andre
>
> Interoperability with pandas would be important in our use
> case, and I think probably for quite a few others. So yes I
> agree that it's not ideal to use JSON but lots of things are
> not ideal. And I think people use JSON, msgpack and hdf5 for
> interop with pandas. CSV is more complicated in practice than
> one might initially think. Finally of course Excel interop.
>
> It's not my cup of tea but gzipped JSON is quite compact...
>
> I have a little streaming msgpack to our own Variable type
> deserialiser (it can store a primitive or a variant). It's not
> long and I could share that.
>
> I don't think the initial dataframe needs to have all this
> stuff in it from day one necessarily.
>
> It's worth reusing another JSON implementation. Since you work
> with Ilya - asdf isn't bad and quite fast though the error
> messages leave something to be desired.
>
> You might see John Colvin repo from his talk on nogc map,
> filter,fold etc. He didn't do chunkBy yet.
Interop is important and I see many binary format being suggested
as a way of interop.
Pandas I/O tools cover some of the popular formats:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
Apache Arrow was something new that I hadn't heard of before.
I'll look into it but then again I couldn't find any way to
integrate it with D right away.
I would love to hear which binary format does the community want
to be added as a way of interop in the future.
More information about the Digitalmars-d
mailing list