# Serializing a computationally expensive graph

I’d like to recount an experience where I think I came up with a good solution using Scala. The implementation uses classical tail recursive approach, because that seems most elegant to me. However, I’m open to criticism about whether the code might be better (by some definition of better) using some library. I don’t really see how to do this with `fold` or `reduce`.

What is the problem to solve?

I have a graph (graph in the mathematical sense). The graph consists of a finite set of vertices and a finite set of directed, labeled edges. However, I don’t yet know the graph. Rather I have a starting vertex, and a function which when given a vertex will return a set of edges, each having a label and a destination vertex.

So in Scala I have a starting vertex `v0` of type `V`, and I have a function `edges` of type `V=>Seq[(L,V)]`. So `V` is the type of the vertex and `L` is type type of the label.

Now, since the `edges` function is computationally difficult to compute (for my application), and because I’d like to easily be able to iterate through all the vertices, I’d like to generate a data structure which helps me do that. I.e., a data structure which contains all the vertex/edge information which the `edges` function is able to compute.

Furthermore, I would like to use an array with integer indices in the end so I can access any vertex in constant time.

So my data structure should contain an array, `vs` of vertices so that `vs(n)` is vertex `n`. And an array, `es`, of edge sequences so that `es(n)` is a sequence of all edges with origin `n`.

My final code is given below. Several observations about the experience of writing this function.

1. I had to think a couple of days how to factor the problem. What made it doable (for me) was forgetting the specific types of in the graph, and the specific edge generation function, and just think about it as an abstract graph.
2. I’m using parameterized types, not make the function reusable, but simply to make it more understandable.
3. Internal to the function there is a tail recursive helper function which uses `Map` to associate integers with vertices, and vertices with integers. This is easier for debugging, but they `Map[Int,X]` objects are converted to `Vector[X]` in the final return value
4. I use a single tail recursive helper function to achieve a loop-within-a-loop iteration. The outer loop recurs on vertices; the inner loop recurs on edges of that vertex.
5. the graph is neither traversed in BFS nor DFS order, but rather in random order. Each time we simply traverse into the vertex which is in one `Map` but not in the other. So finally all vertices are in both `Map`s, as they are finitely many such vertices. This code is concise, but sadly not efficient.
``````  def traceGraph[V, L](v0: V, edges: V => Seq[(L, V)]): (Vector[V], Vector[Seq[(L, Int)]]) = {
val s0: Seq[(L, Int)] = Seq()

@tailrec
def recur(v: V,
i: Int,
es: List[(L, V)],
intToV: Map[Int, V],
vToInt: Map[V, Int],
m: Map[Int, Seq[(L, Int)]]): (Map[Int,V], Map[Int,Seq[(L, Int)]]) = {
es match {
case (label, v1) :: lvs if !vToInt.contains(v1) =>
recur(v, i + 1, es, intToV + (i -> v1), vToInt + (v1 -> i), m)
case (label, v1) :: lvs =>
val currentEdges: Seq[(L, Int)] = m.getOrElse(vToInt(v), s0)
recur(v, i, lvs, intToV, vToInt, m + (vToInt(v) -> conj(currentEdges, label -> vToInt(v1))))
case Nil =>
// TODO, certainly, there is a way to compute v2 without a linear search.
//  we want to find a vertex which we have learned exists, but whose
//  edges have not yet been computed.
intToV.keys.find(k => !m.contains(k)).map(intToV) match {
case Some(v2) => recur(v2, i, edges(v2).toList, intToV, vToInt, m)
case None => (intToV, m)
}
}
}

val (vertexMap,edgeMap) = recur(v0, 1, edges(v0).toList, Map(0 -> v0), Map(v0 -> 0), Map())
(Vector.tabulate(vertexMap.size)(vertexMap),
Vector.tabulate(edgeMap.size)(edgeMap))
}
``````

One optimization which can be made pretty easily is to remove the linear search

``````intToV.keys.find(k => !m.contains(k)).map(intToV)
``````

by simply traversing the vertices in increasing order, first vertex 0, then vertex 1, and so on.
We can detect when the vertex number is larger than the largest vertex by comparing to the variable `i` in the recursive function, which is alway (coincidentally) equal to the number of vertices already visited.

This optimisation saves a linear search, but on the other hand is a pretty cryptic termination condition.

Just to clarify, given vertices `A` and `B`, `(A, B, Label1)` is distinct and desirable from `(B, A, Label2)`? The assumption comes from your having mentioned the “…directed labeled edges…”.

And secondly, are there any requirements (working memory constraints) for the final target result be produced in a single pass? Or are two passes acceptable? IOW, can an interim result of `Map[(V, V), L]` be produced, and then the two `Arrays` specified as your desired result be derived/produced from that `Map`?

Yes, the edges are directed and labeled. Thus from A to B is different from B to A, and labeled X is different than labeled Y. On the other hand self loops are allowed. And if two parallel edges have the same label, that is indeed something I didn’t define. I.e., are two edges from A to B labeled L considered the same edge or different edges? Yes, I haven’t specified that. I need to think about it. For the particular application I’m using the function in, they are considered the same edge.

Here’s an optimization, which I think is more cryptic, but eliminates the redundant linear search.

``````  def traceGraph[V, L](v0: V, edges: V => Seq[(L, V)]): (Vector[V], Vector[Seq[(L, Int)]]) = {
val s0: Seq[(L, Int)] = Seq()
// The local function, recur, traverses the graph generated by the function, edges.
// The traversal is in numerical order.  The initial vertex is 0, the next
// vertex generated is 1.  Once the edges of vertex 0 are generated, several
// vertices may have been generated, but vertex 1 is traversed next, then
// 2, 3, ...
// The variable, nextAvailable, keeps track of the next available
//    vertex number.  When the edges function returns a new v which
//    has not yet been numbered, then it gets assigned nextAvailable.
// nextAvailable gets incremented (via recursion) whenever a new vertex
//    is encountered, and associated with the current value of nextAvailable.
// The recursion terminates when the index of v is == to nextAvailable
//    because that means no new vertices have been encountered, thus
//    all vertices have been visited.
@tailrec
def recur(v: V,
nextAvailable: Int,
es: List[(L, V)],
intToV: Map[Int, V],
vToInt: Map[V, Int],
m: Map[Int, Seq[(L, Int)]]): (Map[Int,V], Map[Int,Seq[(L, Int)]]) = {

es match {
case (label, v1) :: lvs if !vToInt.contains(v1) =>
// if we are seeing v1 for the first time, then register it
//   in intToV and in vToInt.
recur(v, nextAvailable + 1, es, intToV + (nextAvailable -> v1), vToInt + (v1 -> nextAvailable), m)
case (label, v1) :: lvs =>
val currentEdges: Seq[(L, Int)] = m.getOrElse(vToInt(v), s0)

recur(v, nextAvailable, lvs, intToV, vToInt,
m + (vToInt(v) -> conj(currentEdges, label -> vToInt(v1))))
case Nil =>
val next = vToInt(v)+1
// The termination condition occurs if next == nextAvailable, because
//   this means now new vertices have been encountered which have not
//   yet been treated.
if (next < nextAvailable) {
val v2 = intToV(next)
recur(v2, nextAvailable, edges(v2).toList, intToV, vToInt, m)
}
else
(intToV, m)
}
}

val (vertexMap,edgeMap) = recur(v0, 1, edges(v0).toList, Map(0 -> v0), Map(v0 -> 0), Map())
(Vector.tabulate(vertexMap.size)(vertexMap),
Vector.tabulate(edgeMap.size)(edgeMap))
}
``````

This uses function `conj` defined here.

I’m not sure what the memory requirements are for the construction phase. I suppose the entire process (for which this function is only a part) might be extremely memory intensive. However, the final result represents a finite-state-machine (DFA). The transitions enumerate transition conditions and destination states. So the fact that destination states are integers, which also happen to be indexes into the array of that actual state, means the DFA can be simulated efficiently.

1 Like

That is a good point, it is probably the case that the final arrays can be created directly without needing the intermediate `Map`s. It could be a good idea in the future to refactor to create those vectors directly. Before doing this I’d want a good test suite to verify that the translation into terser code hasn’t introduced a bug.

1 Like

Agreed. My preference on problems like this is to first ensure proper function (sacrificing efficiency). And then optimize for efficiency (against which the tests validating the proper function will increase confidence that optimizations don’t break test assumptions). And optimizing for efficiency might end up including a complete re-implementation of the interface.

Hence, my asking about the `Map[(V, V), L]`. The recursive function to produce that interim `Map` is substantially simpler such that it likely can even be accomplished with the standard library (already deeply validated) iterators. Same for the functions to produce the two desired `Array`s.

It sounds like you have already invested quite a bit in the original solution pattern. So, it might not be useful to re-think and refactor to this simpler case. However, if you end up on enough “oops, I gotta touch this, and fix that, etc.” tangents (which seem inevitable when I attempt to combine functionality with efficiency in a single pass), then you might consider a quick (less than 30m) run at this simpler two stage approach to see if the complexity reduction helps you get past your block.

1 Like

BTW @ chaotic3quilibrium, I finally refactored the code to eliminate two of the `Map` intermediate variables. The reverse mapping from object to integer I didn’t manage to eliminate.

Funny thing. I started composing a post to explain why such a refactoring was difficult. In trying to explain why it was difficult, I realised that it wasn’t so difficult after all. The new version of the code is indeed simpler than before.

After reviewing the code again, what was the reason(s) you chose to define the function parameter with `edges: V => Seq[(L, V)]` as opposed to `edges: V => Set[(L, V)]`? Technically, a duplicate could be inserted into the `Seq` that is prevented at the type level by specifying a `Set`? Am I missing some fundamental requirement which would prioritize ordering (Seq) over uniqueness (Set)?

1 Like

Good point. I think you are right. It is probably defined this way simply because that’s the form the data was in when I needed the function in the first place. Thanks for pointing that out.

1 Like

I’m confused between some of your requirements specification and then your implementation code. To simplify the discussion, I decided to take a stab at the problem. And then refactor until it converged on your requirements.

I also had the agenda to move it in the direction of making it more amenable to multi-threading such that you can gain a performance advantage. So, it is now moved in that direction (by focusing on a specific shape to the recursion).

However, I didn’t want to invest much more time without validating what I have already assumed and completed.

Here’s my code:

``````def traceGraph[V, L](
vStart: V
, fVtoVAndLs: V => Set[(L, V)] //does not ensure direction symmetry; i.e. could provide edge VA -> VB and then not provide edge VB -> VA
): (Vector[V], Map[(V, V), L]) = {
@tailrec
def recursive(
remainingVs: List[V] = List(vStart)
, visitedVs: Set[V] = Set()
, accumulator: Map[(V, V), L] = Map()
): (Vector[V], Map[(V, V), L]) =
case Some(currentV) =>
val newAccumulatorEntries =
fVtoVAndLs(currentV).map {
case (l, v) => ((currentV, v), l)
}.toMap
val newRemainingVs: Set[V] =
newAccumulatorEntries.map {
case ((_, v), _) => v
}.toSet - visitedVs
recursive(
remainingVs.tail ++ newRemainingVs.toList
, visitedVs + currentV
, accumulator + newAccumulatorEntries
)
case None =>
(
visitedVs
.toVector
//.sorted  //If ordering is available on V, this will ensure deterministic repeatability
, accumulator
)
}

recursive()
}``````

• Is L ensured to be unique across all edges?
• IOW, can `Map[(V, V), L]` be losslessly inversed as `Map[L, (V, V)]`?

This seems essential given your final target of `Vector[Seq[(L, Int)]]` of which I am still not quite sure how to losslessly reproduce.

many edges might have the same `L`.

The graph basically represents a finite automaton. and the L values represent transition criteria between states.

However, for any given vertex of the graph, there is a maximum of one edge labeled by a particular value of type L.

1 Like

Okay. Below is an updated version which is now exactly matched to your return tuple pair value.

My goal was to only use Scala collections library iterators and entirely avoid writing any iterators of my own. There is likely a trivial way to eliminate the `recursive` function with an iterator. I just didn’t didn’t dwell too hard on it as I suspect this is also the place where you could engage a multi-threaded refactoring to enable performing the `fVtoVAndLs` function in parallel since it is driven off the “Streamable” `List[V]`.

This was very enjoyable. Tysvm for the exploratory interaction.

``````def traceGraph[V, L](
vStart: V
, fVtoVAndLs: V => Set[(L, V)] //does not ensure direction symmetry; i.e. could provide edge VA -> VB and then not provide edge VB -> VA
): (Vector[V], Vector[Set[(L, Int)]]) = {
@tailrec
def recursive(
remainingVs: List[V] = List(vStart)
, visitedVs: Set[V] = Set()
, accumulator: Map[V, Set[(L, V)]] = Map()
): Map[V, Set[(L, V)]] =
case Some(currentV) =>
val newAccumulatorEntries =
fVtoVAndLs(currentV)
val newRemainingVs: Set[V] =
newAccumulatorEntries.map(_._2) - visitedVs
recursive(
remainingVs.tail ++ newRemainingVs.toList
, visitedVs + currentV
, accumulator + (currentV, newAccumulatorEntries)
)
case None =>
accumulator
}

val (vs, lAndVss) =
recursive()
.toList
.unzip
val vsIndexed =
vs.zipWithIndex
val indexByV =
vsIndexed.toMap
(
vsIndexed.map(_._1).toVector
, lAndVss.map (
_.map {
case (l, v) => (l, indexByV(v))
}
).toVector
)
}``````

what’s the motivation for multithreading? In my application the huge amount of time and memory is that being consumed by the `edges` function, which you have called `fVtoVAndLs`. In my case this function is performing an extremely expensive computation. The function is more or less the traversal function for the graph, ie. given a vertex, it computes how to traverse to neighboring vertices–given a vertex, the function will tell which are the neighbor vertices and how to get to them. I need to convert this virtual graph into an explicit graph so I can reason about it. E.g., to characterize the paths from initial state to final state, or merge different paths with complicated logic into fewer paths with simpler logic.

Given your graph is large enough, and given your computation is intense enough, and given you have access to serious multi-processor hardware, wouldn’t you be interested in figuring out ways to maximize the parallelism so as to reduce the amount of time spent in `traceGraph`?

If time isn’t a constraint now, could it become so in the future?

In my own related work (which was around AI/ML for the game Go/Baduk/Weiqi years ago), when I crawled graphs similar to this, I was constantly looking for ways to move parallel computation/multi-threading, especially since I was using ANNs (Artificial Neural Nets) and EAs (Evolutionary Algorithms).

1 Like

BTW, with just a bit more thinking, I suspect you will want to change the return value of your function to `Vector[Map[L, Int]]`.

Originally the change was from `Vector[Seq[(L, Int)]]` to `Vector[Set[(L, Int)]]`. However, this is still not quite tuned to the domain requirements you specified where an `V` will have a unique set of `L`s. IOW, using `Set[(L, Int)]`, it is possible for the `Set` to contain an `L` more than once (where the tuple uniqueness is in the variation of the `Int`).

By representing it as `Vector[Map[L, Int]]`, it is now matched exactly to the domain requirements leaving the person following you to need to make far fewer implementation assumptions. And this is important given they may need to refactor it for something like…multi-threading, LOL!

Here’s the code with the change to `Vector[Map[L, Int]]`:

``````def traceGraph[V, L](
vStart: V
, fVtoVAndLs: V => Set[(L, V)] //does not ensure direction symmetry; i.e. could provide edge VA -> VB and then not provide edge VB -> VA
): (Vector[V], Vector[Map[L, Int]]) = {
@tailrec
def recursive(
remainingVs: List[V] = List(vStart)
, visitedVs: Set[V] = Set()
, accumulator: Map[V, Map[L, V]] = Map()
): Map[V, Map[L, V]] =
case Some(currentV) =>
val newAccumulatorEntries =
fVtoVAndLs(currentV)
val newRemainingVs: Set[V] =
newAccumulatorEntries.map(_._2) - visitedVs
recursive(
remainingVs.tail ++ newRemainingVs.toList
, visitedVs + currentV
, accumulator + (currentV, newAccumulatorEntries)
)
case None =>
accumulator
}

val (vs, lAndVss) =
recursive()
.toList
.unzip
val vsIndexed =
vs.zipWithIndex
val indexByV =
vsIndexed.toMap
(
vsIndexed.map(_._1).toVector
, lAndVss.map (
_.map {
case (l, v) => (l, indexByV(v))
}
).toVector
)
}``````