[GSoC] Dataframes for D
Prateek Nayak
lelouch.cpp at gmail.com
Tue Jun 11 04:35:22 UTC 2019
On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
> On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
>> On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
>>> [snip]
>>>
>>> Im not familiar with data frames but this sounds like array
>>> of structs
>>
>> I think I was thinking of it more like a struct of arrays, but
>> I think an array of structs may also work (see my responses to
>> Prateek)...
>
> Due to the popularity of heterogeneous DataFrames, we decided
> to take care of it the early stages of development before it's
> too late.
>
> The heterogeneous DataFrame is now live at:
> https://github.com/Kriyszig/magpie/tree/experimental
>
> Some parts are still under development but the goals in the
> road maps will be reached on time.
>
> ---------------------------------
> Summing up the first week of GSoC
> ---------------------------------
>
> * Base and file I/O ops were built for homogeneous DataFrame
> * Based on the type of data community has worked with, it
> seemed evident homogeneous DataFrames weren't gonna cut it. So
> a rebuild was initiated over the weekend to allow for
> heterogeneous data.
> * The API was overhauled to allow for Heterogeneous DataFrames.
> * New parser that can parse selective columns.
>
> The code will land in master once it's cleaned up and is deemed
> stable.
>
> -----------------------------------
> Things that will be dealt this week
> -----------------------------------
>
> This week will be for:
>
> * Improving Parser
> * Overhaul code structure (in Experimental)
> * Adding setters for data and index in DataFrame
> * Adding functions to create a multi-indexed DataFrame, the
> same way one can do in Python.
> * Adding Documentation and examples
> * Index Operations
> * Retrieve rows and columns
>
> The last one will set in motion the implementation of Column
> Binary ops of the form:
> df["Index1"] = df["Index2"] + df["Index3"];
>
> Meanwhile if you guys have any more suggestion please feel free
> to contact me - you can use this thread, open an issue on
> Github, reach out to me on slack (Prateek Nayak) or you can
> email me directly (lelouch.cpp at gmail.com)
I have decided to make weekly updates for the community to know
the progress of the project.
-------------------------------
Summing up last week's progress
-------------------------------
* Brought Heterogeneous DataFrames to the same point as the
previous Homogeneous DataFrame development.
* Assignment ops - both direct and indexed.
* Retrieve an entire column using index operation.
* Retrieving entire row using index operation.
* Small redesign here and there to reduce code size.
The index op for rows and columns will return the value as an
Axis structure.
Binary Ops on Axis structure will basically translate to Column
and Row binary operation on DataFrame.
----------------------------------
Tasks that will be dealt this week
----------------------------------
* Column and row binary operation
* There are a few places where O(n) operations can be converted
to O(log n) operations - these few optimisations will be done.
* Updating the Documentation with the developments of the week.
So far no major roadblocks have been encountered.
P.S.
Found this interesting Reddit post regarding file formats on
r/datasets:
https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/
More information about the Digitalmars-d
mailing list