Pandas example of groupby

Laeeth Isharc via Digitalmars-d digitalmars-d at puremagic.com
Mon Jan 26 13:05:04 PST 2015


Thank you for the thoughtful reply.

I meant lost in terms of extra processing time and memory 
consumption in order to achieve the fairly common use case of a 
groupby pivot table or pandas style (ie what you get if you sort 
the data by group and then run D groupby)

If you first sort the data, and then run groupby, then it seems 
likely in some cases this will be more expensive than a straight 
pandas style groupby., and my question was how much more 
expensive.

If you needed to sort the data anyway, because you were producing 
a report, this doesn't matter.  But if you didn't need it sorted 
because you just wanted to produce statistics on each group (eg 
median, inter quartile difference) then it might.

I hadn't thought clearly before about in memory vs not.

I guess for pandas style groupby, having to sort a dataset 
externally just to use groupby is a comparative disaster because 
it seems to involve much more I/O than is needed.  If I am not 
mistaken, this might be one reason to have also at some point a 
separate function that does pandas groupby efficiently for  data 
that may not be held in memory.

Can you explain why for hashgroupby the entire dataset needs to 
fit in memory?  If I were writing this naively, one could have an 
associative array of lists of relevant items hashed by group.  
You go through each line once, add the index to the list for that 
group, then at the end sort the group keys and return indices or 
whatever the slice equivalent is of the lines matching each 
group.  But I may be missing some obvious points.

Ie my imagined use case has large data for each line, but the 
hash table and lists would fit in memory.

If thinking of a case where the data per line is not that large, 
but there are so many lines that it becomes an external sort then 
I see your point.  However I suggest that the use case I describe 
- whilst not the only one - is important enough at least to 
briefly consider.

"The  big disadvantage, however,
> is that
> if your data is not sorted, then the groupings returned won't 
> be quite
> what you'd expect, since the predicate is evaluated only over
> consecutive runs of elements, not globally over the entire data 
> set."

Well - it is a completely different function...

I really do believe that pandas style dataframe type processing 
could be an interesting area for D, and most of what is needed to 
construct this is already here.


> As far as I know, the current groupBy docs explain quite 
> clearly what it
> does.  If you find it still inadequate or unclear, please file 
> bugs
> against it so that we can look into improving the docs.


Sorry - I had hadn't thought to check!  Will do so now.


Laeeth.


More information about the Digitalmars-d mailing list