Pandas example of groupby
Laeeth Isharc via Digitalmars-d
digitalmars-d at puremagic.com
Mon Jan 26 13:05:04 PST 2015
Thank you for the thoughtful reply.
I meant lost in terms of extra processing time and memory
consumption in order to achieve the fairly common use case of a
groupby pivot table or pandas style (ie what you get if you sort
the data by group and then run D groupby)
If you first sort the data, and then run groupby, then it seems
likely in some cases this will be more expensive than a straight
pandas style groupby., and my question was how much more
expensive.
If you needed to sort the data anyway, because you were producing
a report, this doesn't matter. But if you didn't need it sorted
because you just wanted to produce statistics on each group (eg
median, inter quartile difference) then it might.
I hadn't thought clearly before about in memory vs not.
I guess for pandas style groupby, having to sort a dataset
externally just to use groupby is a comparative disaster because
it seems to involve much more I/O than is needed. If I am not
mistaken, this might be one reason to have also at some point a
separate function that does pandas groupby efficiently for data
that may not be held in memory.
Can you explain why for hashgroupby the entire dataset needs to
fit in memory? If I were writing this naively, one could have an
associative array of lists of relevant items hashed by group.
You go through each line once, add the index to the list for that
group, then at the end sort the group keys and return indices or
whatever the slice equivalent is of the lines matching each
group. But I may be missing some obvious points.
Ie my imagined use case has large data for each line, but the
hash table and lists would fit in memory.
If thinking of a case where the data per line is not that large,
but there are so many lines that it becomes an external sort then
I see your point. However I suggest that the use case I describe
- whilst not the only one - is important enough at least to
briefly consider.
"The big disadvantage, however,
> is that
> if your data is not sorted, then the groupings returned won't
> be quite
> what you'd expect, since the predicate is evaluated only over
> consecutive runs of elements, not globally over the entire data
> set."
Well - it is a completely different function...
I really do believe that pandas style dataframe type processing
could be an interesting area for D, and most of what is needed to
construct this is already here.
> As far as I know, the current groupBy docs explain quite
> clearly what it
> does. If you find it still inadequate or unclear, please file
> bugs
> against it so that we can look into improving the docs.
Sorry - I had hadn't thought to check! Will do so now.
Laeeth.
More information about the Digitalmars-d
mailing list