Pandas example of groupby

H. S. Teoh via Digitalmars-d digitalmars-d at puremagic.com
Mon Jan 26 12:31:03 PST 2015


On Mon, Jan 26, 2015 at 07:59:10PM +0000, Laeeth Isharc via Digitalmars-d wrote:
[...]
> My question is how much is lost by doing it in two steps (sort,
> groupby) rather than one.
[...]

I'm not sure I understand what you mean by "lost"... I assume you're not
talking about data loss, since that's not applicable either way. In any
case, as with all things, there's a tradeoff.

Building a hash table of groups, as some people demand, has the
advantage that the grouping predicate is applied globally, and no
sorting (O(n log n)) is needed. The disadvantage is that the entire data
set has to fit in memory, and you cannot process it lazily. Before
you've seen all of the data, you cannot know whether there are more
distinct groups to come, or whether currently known groups have more
members. Once the data size gets too large, you run into trouble. Plus,
some people are adverse to gratuitous GC use in Phobos algorithms.
Perhaps some of this is misplaced, but that's the perception.

The current implementation has the advantage that it requires very
little memory to work, and can process data lazily. It only needs to see
a tiny portion of the entire data set to do its work. Each group
returned is also lazy, so it doesn't need to store the whole group in
its entirety. It can handle 10GB long groups in a 100GB data set
streamed over the network with very little memory, whereas such a
monstrous thing would be infeasibly resource-hungry if you need to
allocate an on-disk hashtable (very inefficient!), iterate over the
entire dataset, and then be I/O bound as you iterate over each group
loading group members from disk. The big disadvantage, however, is that
if your data is not sorted, then the groupings returned won't be quite
what you'd expect, since the predicate is evaluated only over
consecutive runs of elements, not globally over the entire data set.

Which one is better? That depends on your specific application and its
needs. For some applications, there is simply no way around the fact
that you have a large dataset with randomly-ordered elements, so the
sorting has to be done *somewhere*. For other applications, you *can*
make stream your data in sorted order -- or perhaps it doesn't care if
groupings aren't global -- so you can take advantage of the fact that
the current groupBy is extremely cheap. It requires minimal memory to do
its work, and it can do so inside a pipeline without requiring anything
more than an input range.

Basically, std.algorithm is (currently) primarily focused on linear
algorithms.  While there *are* some sorting algorithms and sublinear
algorithms that take advantage of sorted data, it is still primarily
focused on linear processing. There is no good abstraction as yet for
more complex processing like multidimensional or graph traversals. As a
result, it is best suited for stream-based processing.

Generally speaking, most application code is primarily concerned with
linear processing (copy this text from here to there, scan this text for
some linear string AKA search keyword, render this line of text to the
screen, draw this line around the window, move this item along in the
processing queue, etc.). Non-linear computations generally take place
only in dedicated computing modules of limited scope in the application,
where one tends to write dedicated algorithms rather than reuse stock
generic algorithms.


> In any case, the documentation should be very clear on what groupby
> does, and how the user can do what he might be expecting to achieve,
> coming from a different framework.
[...]

As far as I know, the current groupBy docs explain quite clearly what it
does.  If you find it still inadequate or unclear, please file bugs
against it so that we can look into improving the docs.


T

-- 
GEEK = Gatherer of Extremely Enlightening Knowledge


More information about the Digitalmars-d mailing list