[GSoC] Dataframes for D

Prateek Nayak lelouch.cpp at gmail.com
Tue Jun 18 05:10:02 UTC 2019


On Tuesday, 11 June 2019 at 04:35:22 UTC, Prateek Nayak wrote:
> On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
>> On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
>>> On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
>>>> [snip]
>>>>
>>>> Im not familiar with data frames but this sounds like array 
>>>> of structs
>>>
>>> I think I was thinking of it more like a struct of arrays, 
>>> but I think an array of structs may also work (see my 
>>> responses to Prateek)...
>>
>> Due to the popularity of heterogeneous DataFrames, we decided 
>> to take care of it the early stages of development before it's 
>> too late.
>>
>> The heterogeneous DataFrame is now live at: 
>> https://github.com/Kriyszig/magpie/tree/experimental
>>
>> Some parts are still under development but the goals in the 
>> road maps will be reached on time.
>>
>> ---------------------------------
>> Summing up the first week of GSoC
>> ---------------------------------
>>
>> * Base and file I/O ops were built for homogeneous DataFrame
>> * Based on the type of data community has worked with, it 
>> seemed evident homogeneous DataFrames weren't gonna cut it. So 
>> a rebuild was initiated over the weekend to allow for 
>> heterogeneous data.
>> * The API was overhauled to allow for Heterogeneous DataFrames.
>> * New parser that can parse selective columns.
>>
>> The code will land in master once it's cleaned up and is 
>> deemed stable.
>>
>> -----------------------------------
>> Things that will be dealt this week
>> -----------------------------------
>>
>> This week will be for:
>>
>> * Improving Parser
>> * Overhaul code structure (in Experimental)
>> * Adding setters for data and index in DataFrame
>> * Adding functions to create a multi-indexed DataFrame, the 
>> same way one can do in Python.
>> * Adding Documentation and examples
>> * Index Operations
>> * Retrieve rows and columns
>>
>> The last one will set in motion the implementation of Column 
>> Binary ops of the form:
>> df["Index1"] = df["Index2"] + df["Index3"];
>>
>> Meanwhile if you guys have any more suggestion please feel 
>> free to contact me - you can use this thread, open an issue on 
>> Github, reach out to me on slack (Prateek Nayak) or you can 
>> email me directly (lelouch.cpp at gmail.com)
>
> I have decided to make weekly updates for the community to know 
> the progress of the project.
>
> -------------------------------
> Summing up last week's progress
> -------------------------------
>
> * Brought Heterogeneous DataFrames to the same point as the 
> previous Homogeneous DataFrame development.
> * Assignment ops - both direct and indexed.
> * Retrieve an entire column using index operation.
> * Retrieving entire row using index operation.
> * Small redesign here and there to reduce code size.
>
> The index op for rows and columns will return the value as an 
> Axis structure.
> Binary Ops on Axis structure will basically translate to Column 
> and Row binary operation on DataFrame.
>
> ----------------------------------
> Tasks that will be dealt this week
> ----------------------------------
>
> * Column and row binary operation
> * There are a few places where O(n) operations can be converted 
> to O(log n) operations - these few optimisations will be done.
> * Updating the Documentation with the developments of the week.
>
>
> So far no major roadblocks have been encountered.
>
> P.S.
> Found this interesting Reddit post regarding file formats on 
> r/datasets: 
> https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/

-------------
Weekly Update
-------------

This week, the development was bit slower compared to the last 
couple of weeks - I had to attend college for a couple of days 
and it took more time than I would have liked.
However, that said, everything from the last week's goal is 
achieved.

-----------------------
What happened last week
-----------------------

* Redesigned Axis - the structure that returns the values during 
column binary operations
* Added binary operations for Axis [This was equivalent of binary 
operations on DataFrame]
* Tested whether the Binary Operations worked fine on DataFrames
* Fixed couple of tiny bugs here and there.
* Added more ways to build index.
* HashMap like implementation to check for duplicates in index.

-------------------
Goals for this week
-------------------
* Work on apply - That applies a function to values in a 
row/column or the entire DataFrame
* Add some inbuilt operations [This will be operations like mean, 
median - operations that are essential]
* Optimize parts of DataFrame.
* Addition of helpers that will eventually trigger the 
development of groupBy

----------
Roadblocks
----------
There were a few moments where I ran into trouble while returning 
Axis structure because of the variable return type. My mentors 
helped me a lot when I got stuck with the implementation. There 
ware also a couple of bugs which were simple but still took a 
while to solve.
Other than that, things went really smooth.

---------------------
Community Suggestions
---------------------

Petar Kirov [ZombineDev] suggested using static arrays as a way 
to declare a DataFrame.
Soon it will be possible to declare DataFrame as:

DataFrame!(int[5], double[10]) df;

[Implemented in BinaryOps branch - Will land in master with 
BinaryOps implementation soon]
Thank you for the suggestion.


More information about the Digitalmars-d mailing list