D graph library

Fri May 10 10:52:30 PDT 2013

On Wed, May 08, 2013 at 07:25:26PM +0200, Joseph Rushton Wakeling wrote:
> On 05/08/2013 06:12 PM, H. S. Teoh wrote:
> > Hmm. Is it necessary to provide a random access ranges? For some
> > graph algorithms that's necessary, but for others, it may not be.
> 
> Agree that it's debatable.  It's probably desirable but not necessary.

I suppose we can make it an optional thing, such that algorithm A may
ask for only an InputRange (and a graph with bidirectional range of
nodes will still work) but algorithm B may ask for more, say a
ForwardRange or RandomAccessRange. Each algorithm will require the
minimum necessary to perform efficiently (though of course it can be
special-cased to take advantage of additional features of a higher
range, if that is available in a given instantiation).

In fact, now that I think of it... most of the features you listed can
arguably be optional: only a few algorithms (none that I know of,
actually) require enumeration of edges, some algorithms may only require
an input range of nodes, each of which gives an input range of edges,
etc.. Would it make sense to define a set of graph properties (e.g., via
templates ala std.range.is{Input,Forward,Bidirectional,...}Range, or
hasLength, hasSwappableElements, etc.), and have each algorithm require
whatever minimal subset of them is necessary to do the work? Because it
makes little sense to require, say, a graph that returns a total count
of nodes, if a given algorithm never needs to use that information
anyway. This way, a structure for which the total number of nodes is
expensive to compute can still be used with that algorithm.

IOW, we can have things like hasRandomAccessNodes, hasNodeCount,
hasEdgeCount, canEnumerateEdges, etc., and each algorithm will use a
subset of these in their signature constraints.

> >> Note also that both the above automatically define the count of
> >> nodes/edges, but we could of course add courtesy functions to give
> >> them directly.
> > 
> > Something along the lines of std.range's .hasLength, I guess? Makes
> > sense.
> 
> If nodes() and edges() return a RandomAccessRanges then they will
> automatically have the .length property.  So you could add a courtesy
> function like nodecount() or edgecount() that in this case would just
> wrap nodes.length but for other cases might calculate the node or edge
> count in a different way.

Makes sense.

> >> Finally, a reasonable question is whether they should return ranges
> >> of actual nodes or edges, or ranges of node/edge IDs.
> > 
> > I think the distinction is mostly superfluous. One could always wrap
> > IDs in a struct
> 
> Well, I was considering whether making it a range of IDs might be more
> flexible with respect to underlying data structures.  I may be falling
> into "premature optimization" here. :-)

Hmm. I'm undecided here until I see some actual algorithms in which this
choice makes a difference. I don't want to fall into premature
generalization here. ;-)

> > Another thought I have is whether we should require a method that
> > enumerates all edges. The number of edges in a graph can grow very
> > quickly w.r.t. the number of nodes, and some graphs may be so
> > complex that the edges can't all be stored in memory at once, but
> > only computed per-node. In such cases, requiring edge enumeration
> > may be counterproductive. I'm inclined towards leaving out edge
> > enumeration unless some specific algorithms require it (even then,
> > we could just make it a requirement for those particular
> > algorithms). I admit my knowledge of graph algorithms is rather
> > limited, but AFAIK none of them require edge enumeration on the
> > entire graph.
> 
> Well, if we accept that we're going to be offering the user a range
> that lists all edges, the underlying code might handle that in many
> different ways -- just reading from an array, cacheing input from a
> database, or whatever else is appropriate.  That shouldn't make it
> difficult to get the overall number of edges -- that's always going to
> be a stored number of one kind or another, whether it's an array
> length or a record of the number of fields in a database, or whatever.

Not if the graph is computed on-the-fly. E.g. chess analysis
applications in which the complete graph is infeasible to compute.

> > Does it make sense to have a common interface between, say, directed
> > and non-directed graphs? That is, are there algorithms that can
> > operate on both, or do most of the interesting algorithms only work
> > on one or the other? I'm just wondering if it's worth the trouble of
> > doing compile-time (or runtime) checks if we could just have two
> > unrelated graph types.
> 
> Reasonable question.  As far as I can tell most graph libraries offer
> a broadly common interface even though the underlying data structure
> may vary quite a bit.
> 
> The practical implementation of those checks might well be simply
> reading an immutable boolean variable in the graph type.

In that case, I wonder if it's even necessary to unify the two graph
types. Why not just have two separate types and have the type system
sort out compatibility with algorithms for us?

> > Also, I'm not sure about the usefulness of isBipartiteGraph, unless
> > it entails some specific graph method (say a method that returns
> > nodes grouped by which partition it lies in) that takes advantage of
> > this fact. OTOH, maybe it does serve the purpose of letting the user
> > promise that the input graph will be bipartite, so if it isn't and
> > the algorithm blows up, it's not our fault. :-P
> 
> Yes, that would be the purpose.  The bipartite graph might also be
> just a different data type.

Are there any algorithms that need to do something different depending
on whether the input graph is bipartite or not? Just wondering if it's
worth the trouble of introducing an additional field (it may well be
necessary -- I admit my knowledge of graph algorithms is rather spotty).

> > Personally, I'd shorten it to ngbrs, but most people here would
> > disagree. :-P
> 
> I don't like those kinds of abbreviations in general, I like English.
> :-P

Thought so. :)

> > I'm not so sure about providing both in and out neighbours / edges.
> > In some cases this may be very inefficient. I think it may be best
> > simply to adopt the convention that a directed graph will always
> > return out neighbours, and undirected graphs will always return all
> > neighbours.  We can always provide an optional method for inverting
> > a graph where this is useful / necessary.
> 
> Maybe, but in some cases you want to be able to quickly get the
> in-degree or whatever.  Inverting the graph is going to be an
> expensive method that will require a lot of calculation (and maybe
> also a lot more memory, if you wind up storing both versions).
> 
> I'd much rather have the option to get in-links and say, "Hey, in some
> cases this may be really expensive" than to have no option at all by
> design.

Right. Which brings me back to my idea that perhaps these graph
properties should be optional, or perhaps we should have some kind of
hierarchy of graph types (ala input range, forward range, bidirectional
range, etc.), so that individual algorithms can choose which features
are mandatory and which are optional.

(I really dislike algorithms that are artificially restricted just
because the author decided that feature X is required, no matter if the
algorithm doesn't even use it.)

On Fri, May 10, 2013 at 02:04:28PM +0200, Joseph Rushton Wakeling wrote:
> On 05/08/2013 06:12 PM, H. S. Teoh wrote:
> > On Wed, May 08, 2013 at 02:34:32PM +0200, Joseph Rushton Wakeling wrote:
> >> Finally, a reasonable question is whether they should return ranges
> >> of actual nodes or edges, or ranges of node/edge IDs.
> > 
> > I think the distinction is mostly superfluous. One could always wrap
> > IDs in a struct:
> 
> I'm being daft.  Of _course_ the ranges should be of nodes or edges,
> because you want to be able to do things like
> 
> 	foreach(n; g.nodes) {   // g is the graph ... :-)
> 		foreach(m; n.neighbours) {
> 			...
> 		}
> 	}

True. Though another way of designing the API would be:

	foreach (n; g.nodes) {
		foreach (m; g.neighbours(n)) {
			...
		}
	}

which would work with ID's. And arguably, this may be more efficient in
some cases.

> >> It should also be possible to define arbitrary collections of node
> >> properties, such as colour, type, textual labels, numerical
> >> properties etc., and to query if a node has them (preferably at
> >> compile-time).
> > 
> > Hmm. I'm not sure about requiring such attributes. I'd go for a more
> > generic approach, such as if an algorithm requires some attribute
> > (say weight), then it could take a compile-time parameter that maps
> > .weight to whatever actual attribute is in the graph node / edge.
> > For example:
> > 
> > 	auto findShortestPath(alias weight,G)(G graph, G.node start,
> > 		G.node end) { ... }
> > 
> > 	struct Graph {
> > 		struct Node { ... }
> > 		alias node = Node;
> > 		struct Edge {
> > 			double length;
> > 			@property double euclideanDistance();
> > 			...
> > 		}
> > 	}
> > 
> > 	Graph g;
> > 	Graph.Node x, y;
> > 	auto pathByLength = findShortestPath!"length"(g, x, y);
> > 	auto pathByDist = findShortestPath!"euclideanDistance"(g, x, y);
> > 
> > This allows the same graph to be evaluated multiple ways, as this
> > example shows, without needing to create wrappers or proxy objects.
> 
> I don't think what I was proposing was at odds with what you describe
> here.  For sure it's helpful to be able to map attributes as you
> describe here.  What I mean is simply it's helpful to be able to
> define arbitrary properties (and property names) that will be
> associated with edges, not that edges should be _obliged_ to have a
> certain set of associated parameters.

I'm kinda torn about that one. Should the graph structure itself keep
track of these annotations (which is basically what these properties
are)? Or should it be the algorithm that tracks them? I can imagine
cases for which it's useful to do something like this:

	auto myAlgo(G)(G graph) if (isGraph!G) {
		struct Annotations {
			float weight;
			float length;
			Color color;
		}
		Annotations[G.NodeIdType] annotations;
		...
		foreach (n; graph.nodes) {
			...
			annotations[n].weight = 1.0;
			...
		}
		while (...) {
			auto n = graph.nodes[someId];
			if (annotations[n].weight == 1.0) {
				...
			}
		}
	}

The advantage of not storing these annotations in G is that it allows
myAlgo to apply to a far wider variety of graph-like things. Say very
large graphs that are computed on-the-fly (e.g. chess positions), for
which it would be infeasible to store arbitrary data with every node or
edge.

> > I'm tentatively considering defining the node type within the graph
> > itself (G.node above), as a way of reducing the number of
> > compile-time parameters, but this may not be necessary, and may be
> > limiting in some cases. Have to think about it a bit more.
> 
> Can you expand on this?

I was just thinking it would be nice if we could write:

	auto pathLength(G)(G graph, G.NodeType start, G.NodeType end) {
		...
	}

instead of:

	auto pathLength(G,N)(G graph, N start, N end)
		if (is(N == typeof(G.nodes.front)))
	{
		...
	}

Or worse:

	auto pathLength(G,N)(G graph, typeof(G.nodes.front) start, typeof(G.nodes.front) end)
	{
		...
	}

The first example is the most self-documenting and pleasant to read,
IMO. But now that I think about it, perhaps the best way is just:

	template NodeType(G) if (isGraph!G)
	{
		alias NodeType = typeof(G.init.nodes.front);
	}

	auto pathLength(G)(G graph, NodeType!G start, NodeType!G end)
	{
		...
	}

OK, so I retract what I said earlier. :)

[...]
> > Also, I'm not sure about isWeighted... I would expect most graph
> > algorithms to either not care about weights (in which case the
> > entire graph can be weightless) or require weights on all edges (in
> > which case isWeighted is redundant). Or am I ignorant of algorithms
> > that process weights that only exist in some edges but not others?
> 
> Consider path length -- e.g. a path v1 -> v2 -> v3 has length 2 for an
> unweighted graph, length w_{1, 2} + w_{2, 3} for a weighted graph.
> That in turn has implications for shortest-path-length calculations,
> and the optimal algorithm to calculate shortest paths thus depends on
> whether you're dealing with the weighted or unweighted case.

Ah, I see.

> All of this in turn impacts on any metrics or algorithms that depend
> on distances between nodes.  For example, today I'm going to be
> implementing [*] an algorithm to calculate betweenness centrality:
> http://www.cs.ucc.ie/~rb4/resources/Brandes.pdf  This does need to be modified,
> albeit trivially, for the weighted case.
>
> [* Inside some custom-written code for a very particular application,
> I'm afraid.  This is why I want to create a general-purpose graph
> library, to stop me having to do things like this :-P ]

Right. For maximum generality, I'm thinking something along these lines:

	auto findPathLength(G)(G graph, NodeType!G start, NodeType!G end)
	{
		... // weightless version
	}

	auto findPathLength(string weightField, G)(G graph, NodeType!G start, NodeType!G end)
	{
		// weighted version
		alias weight = __traits(getMember, G, weightField);
		...
		auto w = node.weight;
		...
	}

So basically, the user specifies which property should be treated as
weight, and that automatically selects the weighted version of the
algorithm. If omitted, the unweighted version is selected instead. This
lets us do things like:

	void main() {
		FlowGraph g;
		...
		auto numHops = findPathLength(g, n1, n2);
		auto totalFlow = findPathLength!"flowRate"(g, n1, n2);
		auto totalDistance = findPathLength!"distance"(g, n1, n2);
		...
	}

> > I think it would help discussion if we have a list (even if
> > tentative) of algorithms we'd like to implement, along with what
> > each algorithm expects from the graph (weights, colors,
> > directedness, etc.). Then we can evaluate which
> > properties/functions/etc. are really necessary, and which are just
> > icing on the cake that can be omitted with no real drawbacks. I
> > think the more minimal the API the better -- it makes the resulting
> > module easier to understand and use, and makes it more likely we'll
> > actually finish implementing it. :-P
> 
> Agree about minimal API, but I think what I've described previously is
> pretty minimal ... :-)
> 
> As for algorithms, igraph has a very substantial list:
> http://igraph.sourceforge.net/doc/html/igraph-Structural.html

Thanks! I'll take a look when I have time.

> We could probably "scratch our own itch" to a degree (ha!) in terms of
> choosing what to implement, but stuff like shortest path calculation,
> centrality measures and so on seem like good starting points.
[...]

Yeah, shortest path is probably the most basic class of algorithms to
implement. Though there *are* quite a variety of them -- Dijkstra, A*,
etc.. I've been looking into A* recently, since for some applications
the graph is simply too large for Dijkstra to be feasible.

T

-- 
It won't be covered in the book. The source code has to be useful for
something, after all. -- Larry Wall