D graph library

Wed May 8 09:12:15 PDT 2013

On Wed, May 08, 2013 at 02:34:32PM +0200, Joseph Rushton Wakeling wrote:
> On 05/06/2013 07:09 PM, H. S. Teoh wrote:
> > So what I envision as far as graph libraries in D are concerned is
> > something like the generic range API, where there is a fixed
> > convention as to what constitutes a graph/digraph/etc., and the
> > library itself is a template library that is able to handle any data
> > type that conforms to the requirements.
> 
> So, let's have a bit of a think about what you really, really want in
> a graph interface.
> 
> Something along the lines of the following:
> 
> 	nodes   /* @property that returns a RandomAccessRange to the collection
>                    of nodes.  You could also call this "vertices" or
>                    "vertexes"; "nodes" is shorter.  Should be O(1). */
> 
>         edges   /* Returns a RandomAccessRange to the collection of edges, i.e.
>                    a list of node pairs.  Should be O(1). */
> 
> Note that these properties should be O(1) to return, but obviously
> traversing them is O(N) [O(V)] or O(E) respectively, where N [V] is
> the number of nodes [vertices] and E is the number of edges.

Hmm. Is it necessary to provide a random access ranges? For some graph
algorithms that's necessary, but for others, it may not be.

> Note also that both the above automatically define the count of
> nodes/edges, but we could of course add courtesy functions to give
> them directly.

Something along the lines of std.range's .hasLength, I guess? Makes
sense.

> Finally, a reasonable question is whether they should return ranges of
> actual nodes or edges, or ranges of node/edge IDs.

I think the distinction is mostly superfluous. One could always wrap IDs
in a struct:

	struct NodeID {
		int id;
		MyGraph *g;

		auto neighbors() { return g.nodesImpl[id].neighbors(); }
		... // other wrappers around node methods
	}

So I'd say just return nodes and edges. They can be actual nodes and
edges, or IDs equipped with requisite methods, or references to
node/edge objects (e.g. class instances -- references are essentially a
kind of ID -- just happens to be unique across all objects and directly
mappable to memory addresses), etc..

Another thought I have is whether we should require a method that
enumerates all edges. The number of edges in a graph can grow very
quickly w.r.t. the number of nodes, and some graphs may be so complex
that the edges can't all be stored in memory at once, but only computed
per-node. In such cases, requiring edge enumeration may be
counterproductive. I'm inclined towards leaving out edge enumeration
unless some specific algorithms require it (even then, we could just
make it a requirement for those particular algorithms). I admit my
knowledge of graph algorithms is rather limited, but AFAIK none of them
require edge enumeration on the entire graph.

> You'd also need to define various characteristics such as isDirected,
> isMultiGraph, isMixedGraph, isSimpleGraph, isWeighted, etc.  Ideally
> most of these would be compile-time constraints but might also be
> implemented as runtime queries (e.g. isBipartiteGraph could be a
> compile-time constraint but equally one could at runtime query a graph
> without this constraint and prove that it could indeed be divided into
> two disjoint sets which satisfy the conditions of bipartitivity
> [bipartiteness?]).

Does it make sense to have a common interface between, say, directed and
non-directed graphs? That is, are there algorithms that can operate on
both, or do most of the interesting algorithms only work on one or the
other? I'm just wondering if it's worth the trouble of doing
compile-time (or runtime) checks if we could just have two unrelated
graph types.

Also, I'm not sure about the usefulness of isBipartiteGraph, unless it
entails some specific graph method (say a method that returns nodes
grouped by which partition it lies in) that takes advantage of this
fact. OTOH, maybe it does serve the purpose of letting the user promise
that the input graph will be bipartite, so if it isn't and the algorithm
blows up, it's not our fault. :-P

> Then for nodes/vertices you'd want to define a number of properties:
> 
>        neighbours   /* OK, OK, neighbors.  Purely in order to save keystroke
>                        time for programmers.  Returns a RandomAccessRange to
>                        the collection of nodes that the given node links to.
>                        For directed graphs you might want to add a template
>                        qualifier of "out" (default) or "in" to allow getting
>                        neighbours that this node points to, or neighbours that
>                        point to this node.  Should be O(1) [but maybe needs to
>                        be O(d)?] */

Personally, I'd shorten it to ngbrs, but most people here would
disagree. :-P

>        edges        /* Returns RandomAccessRange of edges that connect to this
>                        node.  For a directed graph you might want again a
>                        template qualifier of "out" (default), "in" or "all".
>                        Should be O(1) [but maybe needs to be O(d)?] */

I'm not so sure about providing both in and out neighbours / edges.  In
some cases this may be very inefficient. I think it may be best simply
to adopt the convention that a directed graph will always return out
neighbours, and undirected graphs will always return all neighbours.  We
can always provide an optional method for inverting a graph where this
is useful / necessary.

> Again, these should be ideally be O(1) to return but are obviously
> O(d) to traverse where d is the degree of the node.  Note that degree
> (and in/out degree for directed graphs) can be implicitly calculated
> from the above, but we could add courtesy functions to this extent.
> 
> It should also be possible to define arbitrary collections of node
> properties, such as colour, type, textual labels, numerical properties
> etc., and to query if a node has them (preferably at compile-time).

Hmm. I'm not sure about requiring such attributes. I'd go for a more
generic approach, such as if an algorithm requires some attribute (say
weight), then it could take a compile-time parameter that maps .weight
to whatever actual attribute is in the graph node / edge. For example:

	auto findShortestPath(alias weight,G)(G graph, G.node start,
		G.node end) { ... }

	struct Graph {
		struct Node { ... }
		alias node = Node;
		struct Edge {
			double length;
			@property double euclideanDistance();
			...
		}
	}

	Graph g;
	Graph.Node x, y;
	auto pathByLength = findShortestPath!"length"(g, x, y);
	auto pathByDist = findShortestPath!"euclideanDistance"(g, x, y);

This allows the same graph to be evaluated multiple ways, as this
example shows, without needing to create wrappers or proxy objects.

I'm tentatively considering defining the node type within the graph
itself (G.node above), as a way of reducing the number of compile-time
parameters, but this may not be necessary, and may be limiting in some
cases. Have to think about it a bit more.

> You'd also want to define functions on node [node ID?] pairs, such as
> a function that returns the edge [edge ID?] linking two nodes if they
> _are_ linked, and null otherwise.

I would evaluate the graph algorithms we want to implement before doing
something like this. I think it would be best to try to keep to a
minimal API as much as is possible. Phobos ranges, for example, while
very powerful, also suffer a bit from too many possible parameters,
leading to a lot of complicated interactions that increase code
complexity (hence Jonathan's push for simplifying the range API as much
as possible).

> Similarly for edges you might want to define various properties:
> 
>       head, tail     /* the origin and destination nodes respectively */
> 
>       isDirected     /* is this a directed link? */
> 
>       isLoop         /* does this edge connect a node to itself? */
> 
>       isWeighted     /* does this edge have an associated weight? */
> 
>       weight         /* ... if it does ... */
> 
> And again, you'd want to be able to define arbitrary collections of
> edge properties besides "weight", and be able to query (preferably at
> compile-time) if an edge has them.

I think it's better if each algorithm has a specific set of properties
they expect to be present, and use compile-time parameters to map them
onto concrete graph properties.

Also, I'm not sure about isWeighted... I would expect most graph
algorithms to either not care about weights (in which case the entire
graph can be weightless) or require weights on all edges (in which case
isWeighted is redundant). Or am I ignorant of algorithms that process
weights that only exist in some edges but not others?

> OK, those are some VERY provisional thoughts (on a very tentative
> proposal:-) But I'd be happy to receive feedback, particularly on some
> of the choices (e.g.  should we consider it essential for the ranges
> in all circumstances to be RandomAccessRanges or will it help to make
> some of them simpler?  Should the ranges be of actual node/edge
> entities or node/edge IDs? etc.)

I think it would help discussion if we have a list (even if tentative)
of algorithms we'd like to implement, along with what each algorithm
expects from the graph (weights, colors, directedness, etc.). Then we
can evaluate which properties/functions/etc. are really necessary, and
which are just icing on the cake that can be omitted with no real
drawbacks. I think the more minimal the API the better -- it makes the
resulting module easier to understand and use, and makes it more likely
we'll actually finish implementing it. :-P

Anyway, some further thoughts: we could have graph adaptors in the same
vein as the range constructors in std.range, that allows one to wrap a
graph in another form. E.g., transposeGraph(G) that returns a graph with
the sense of edges reversed, etc.

T

-- 
Computers shouldn't beep through the keyhole.