Profiling graph library

Tue Jul 30 06:16:42 PDT 2013

Hello all,

I thought I'd share some profiling results from the graph/network library that
I've been working on -- see:
https://github.com/WebDrake/Dgraph
http://braingam.es/2013/07/complex-networks-in-d/
http://braingam.es/2013/07/complex-networks-in-d-part-2-building-the-network/

I'd like to get others' thoughts and insight on these -- obviously I have some
ideas of my own about how to interpret them and move forward, but it's always
good to have feedback (and there are several things I'd like to make sure I have
my facts straight on before making any more "public" comment).

The basic data structure, which is taken from the igraph library, consists of a
pair of arrays of head and tail vertices for edges, together with indices which
sort the edges according to head or tail vertex, and sums of the total number of
edges whose head/tail ID is less than a certain value:
https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L33-L38

Taken together this offers a very memory-efficient structure that should also be
able to minimize the total number of allocations required to build a graph.

When I started working on the code the initial goal was to have something that
gave a very simple and obvious representation of the algorithms at work, which
would be easy to understand.  So, for example, I used std.algorithm.map quite
freely to make an "easy" representation of how a vertex's neighbours were
calculated:
https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L260-L286

It turns out this results in a fairly major speed hit that stems from a variety
of factors.  The following is the result of profiling a simple function that
just iterates over all the neighbours of each vertex in turn, and sums the
neighbour IDs.  The code is attached in neighbours.d.

---------------------------------------------------------------------------------
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 19.20      0.24     0.24 50000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda46TmZ10__lambda46MFNbNfmZxm
 18.40      0.47     0.23
_D2gc3gcx2GC12mallocNoSyncMFmkPmZPv
 15.60      0.67     0.20 50000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda48TmZ10__lambda48MFNbNfmZxm
  9.60      0.79     0.12   500000     0.00     0.00
_D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFKC6dgraph5graph13__T5GraphVb0Z5GraphZm
  8.00      0.89     0.10 25000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result
  8.00      0.99     0.10                             _D2gc3gcx2GC6mallocMFmkPmZPv
  4.80      1.05     0.06                             _D2gc6gcbits6GCBits3setMFmZv
  4.80      1.11     0.06                             _D2gc3gcx3Gcx11fullcollectMFZm
  4.00      1.16     0.05                             _D2gc6gcbits6GCBits4testMFmZm
  2.40      1.19     0.03
_D4core4sync5mutex5Mutex6unlockMFNeZv
  1.60      1.21     0.02                             _D2gc6gcbits6GCBits5allocMFmZv
  1.20      1.22     0.02
_D3std4conv15__T6toImplTiTkZ6toImplFNaNfkZi15__dgliteral2501MFNaNfZC6object9Throwable
  1.20      1.24     0.02                             gc_malloc
  0.40      1.24     0.01
_D4core4sync5mutex5Mutex12MonitorProxy11__xopEqualsFKxS4core4sync5mutex5Mutex12MonitorProxyKxS4core4sync5mutex5Mutex12MonitorProxyZb
  0.40      1.25     0.01
_D4core4sync5mutex5Mutex4lockMFNeZv
  0.40      1.25     0.01                             gc_clrAttr
---------------------------------------------------------------------------------

What's striking about this is (i) the various range/map based functions carry a
substantial hit in and of themselves; (ii) there's also a fairly nasty hit from
the GC.  (If I run this with a directed graph, which avoids the use of
std.range.chain, there's a little bit of saving, but only a very little.)

As an experiment, I wrote some alternative code that provides a cache for all
the neighbours of all the vertices.  This takes an overall memory hit but means
that one can avoid having to recalculate the list of neighbours so long as the
graph itself doesn't change:
https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L42-L53
https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L358-L379

The speed benefits are immediate:

---------------------------------------------------------------------------------
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 64.29      0.14     0.14 25000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMFNaNbNfymZAm
 33.33      0.21     0.07   500000     0.00     0.00
_D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFNaNbNfKC6dgraph5graph13__T5GraphVb0Z5GraphZm
  2.38      0.21     0.01
_D6dgraph5graph13__T5GraphVb0Z5Graph13incidentEdgesMFNaNbNfymZAm
---------------------------------------------------------------------------------

In fact there's a double benefit here because (i) it's more efficient to write
the neighbour values into an array than to use a map and (ii) because we're
cacheing them, we only have to calculate them once.  The extra memory required
may be bearable given that the overall constraints on network size are probably
going to be down to something other than RAM.

However, I think it might be worth avoiding this if possible.  So I'd like to be
sure that I understand -- first, are map, chain, etc. likely to become more
efficient in terms of speed and memory use?  I recall that there has been some
concern over unnecessary allocations, and I'm not sure if std.algorithm.map or
std.range.chain are victims here.  Alternatively, there might be some benefit in
replacing the maps, chains, etc. with custom data structures that provide
_exactly_ what is wanted.

Finally, whether using the map approach or the cache, neither of these
approaches seems as fast in benchmarking as a simple ad-hoc approach as I might
use for a graph in simulation, such as,

    size_t[][] neighbours;

My instinct is that the latter approach will start to choke when you start
trying to build a really big graph because of the many, many reallocations
required to build the structure.  It will probably also become unwieldy if you
start wanting to allow arbitrary collections of edge and vertex properties
(weight, colour, etc.) because it doesn't allow for a very elegant way to store
and retrieve those properties.

Suffice to say that it looks like there are an interesting range of compromises
to make depending on needs, and it's quite likely there's a benefit in providing
different graph data structures depending on what scale of graph (and what kind
of graph properties) you want to work with.

Anyway, although this is a very limited bit of profiling work, I'd be very
grateful for anyone's thoughts or insight on the issues raised here.

Thanks & best wishes,

     -- Joe
-------------- next part --------------
A non-text attachment was scrubbed...
Name: neighbours.d
Type: text/x-dsrc
Size: 1418 bytes
Desc: not available
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20130730/04bf04be/attachment.d>