Serializing a computationally expensive graph

jimka · May 12, 2021, 8:55am

I’d like to recount an experience where I think I came up with a good solution using Scala. The implementation uses classical tail recursive approach, because that seems most elegant to me. However, I’m open to criticism about whether the code might be better (by some definition of better) using some library. I don’t really see how to do this with fold or reduce.

What is the problem to solve?

I have a graph (graph in the mathematical sense). The graph consists of a finite set of vertices and a finite set of directed, labeled edges. However, I don’t yet know the graph. Rather I have a starting vertex, and a function which when given a vertex will return a set of edges, each having a label and a destination vertex.

So in Scala I have a starting vertex v0 of type V, and I have a function edges of type V=>Seq[(L,V)]. So V is the type of the vertex and L is type type of the label.

Now, since the edges function is computationally difficult to compute (for my application), and because I’d like to easily be able to iterate through all the vertices, I’d like to generate a data structure which helps me do that. I.e., a data structure which contains all the vertex/edge information which the edges function is able to compute.

Furthermore, I would like to use an array with integer indices in the end so I can access any vertex in constant time.

So my data structure should contain an array, vs of vertices so that vs(n) is vertex n. And an array, es, of edge sequences so that es(n) is a sequence of all edges with origin n.

My final code is given below. Several observations about the experience of writing this function.

I had to think a couple of days how to factor the problem. What made it doable (for me) was forgetting the specific types of in the graph, and the specific edge generation function, and just think about it as an abstract graph.
I’m using parameterized types, not make the function reusable, but simply to make it more understandable.
Internal to the function there is a tail recursive helper function which uses Map to associate integers with vertices, and vertices with integers. This is easier for debugging, but they Map[Int,X] objects are converted to Vector[X] in the final return value
I use a single tail recursive helper function to achieve a loop-within-a-loop iteration. The outer loop recurs on vertices; the inner loop recurs on edges of that vertex.
the graph is neither traversed in BFS nor DFS order, but rather in random order. Each time we simply traverse into the vertex which is in one Map but not in the other. So finally all vertices are in both Maps, as they are finitely many such vertices. This code is concise, but sadly not efficient.

  def traceGraph[V, L](v0: V, edges: V => Seq[(L, V)]): (Vector[V], Vector[Seq[(L, Int)]]) = {
    val s0: Seq[(L, Int)] = Seq()

    @tailrec
    def recur(v: V,
              i: Int,
              es: List[(L, V)],
              intToV: Map[Int, V],
              vToInt: Map[V, Int],
              m: Map[Int, Seq[(L, Int)]]): (Map[Int,V], Map[Int,Seq[(L, Int)]]) = {
      es match {
        case (label, v1) :: lvs if !vToInt.contains(v1) =>
          recur(v, i + 1, es, intToV + (i -> v1), vToInt + (v1 -> i), m)
        case (label, v1) :: lvs =>
          val currentEdges: Seq[(L, Int)] = m.getOrElse(vToInt(v), s0)
          recur(v, i, lvs, intToV, vToInt, m + (vToInt(v) -> conj(currentEdges, label -> vToInt(v1))))
        case Nil =>
          // TODO, certainly, there is a way to compute v2 without a linear search.
          //  we want to find a vertex which we have learned exists, but whose
          //  edges have not yet been computed.
          intToV.keys.find(k => !m.contains(k)).map(intToV) match {
            case Some(v2) => recur(v2, i, edges(v2).toList, intToV, vToInt, m)
            case None => (intToV, m)
          }
      }
    }

    val (vertexMap,edgeMap) = recur(v0, 1, edges(v0).toList, Map(0 -> v0), Map(v0 -> 0), Map())
    (Vector.tabulate(vertexMap.size)(vertexMap),
      Vector.tabulate(edgeMap.size)(edgeMap))
  }

jimka · May 12, 2021, 10:40am

One optimization which can be made pretty easily is to remove the linear search

intToV.keys.find(k => !m.contains(k)).map(intToV)

by simply traversing the vertices in increasing order, first vertex 0, then vertex 1, and so on.
We can detect when the vertex number is larger than the largest vertex by comparing to the variable i in the recursive function, which is alway (coincidentally) equal to the number of vertices already visited.

This optimisation saves a linear search, but on the other hand is a pretty cryptic termination condition.

chaotic3quilibrium · May 12, 2021, 2:06pm

Just to clarify, given vertices A and B, (A, B, Label1) is distinct and desirable from (B, A, Label2)? The assumption comes from your having mentioned the “…directed labeled edges…”.

And secondly, are there any requirements (working memory constraints) for the final target result be produced in a single pass? Or are two passes acceptable? IOW, can an interim result of Map[(V, V), L] be produced, and then the two Arrays specified as your desired result be derived/produced from that Map?

jimka · May 13, 2021, 9:19am

Yes, the edges are directed and labeled. Thus from A to B is different from B to A, and labeled X is different than labeled Y. On the other hand self loops are allowed. And if two parallel edges have the same label, that is indeed something I didn’t define. I.e., are two edges from A to B labeled L considered the same edge or different edges? Yes, I haven’t specified that. I need to think about it. For the particular application I’m using the function in, they are considered the same edge.

jimka · May 13, 2021, 9:21am

Here’s an optimization, which I think is more cryptic, but eliminates the redundant linear search.

  def traceGraph[V, L](v0: V, edges: V => Seq[(L, V)]): (Vector[V], Vector[Seq[(L, Int)]]) = {
    val s0: Seq[(L, Int)] = Seq()
    // The local function, recur, traverses the graph generated by the function, edges.
    // The traversal is in numerical order.  The initial vertex is 0, the next
    // vertex generated is 1.  Once the edges of vertex 0 are generated, several
    // vertices may have been generated, but vertex 1 is traversed next, then
    // 2, 3, ...
    // The variable, nextAvailable, keeps track of the next available
    //    vertex number.  When the edges function returns a new v which
    //    has not yet been numbered, then it gets assigned nextAvailable.
    // nextAvailable gets incremented (via recursion) whenever a new vertex
    //    is encountered, and associated with the current value of nextAvailable.
    // The recursion terminates when the index of v is == to nextAvailable
    //    because that means no new vertices have been encountered, thus
    //    all vertices have been visited.
    @tailrec
    def recur(v: V,
              nextAvailable: Int,
              es: List[(L, V)],
              intToV: Map[Int, V],
              vToInt: Map[V, Int],
              m: Map[Int, Seq[(L, Int)]]): (Map[Int,V], Map[Int,Seq[(L, Int)]]) = {

      es match {
        case (label, v1) :: lvs if !vToInt.contains(v1) =>
          // if we are seeing v1 for the first time, then register it
          //   in intToV and in vToInt.
          recur(v, nextAvailable + 1, es, intToV + (nextAvailable -> v1), vToInt + (v1 -> nextAvailable), m)
        case (label, v1) :: lvs =>
          val currentEdges: Seq[(L, Int)] = m.getOrElse(vToInt(v), s0)

          recur(v, nextAvailable, lvs, intToV, vToInt,
                m + (vToInt(v) -> conj(currentEdges, label -> vToInt(v1))))
        case Nil =>
          val next = vToInt(v)+1
          // The termination condition occurs if next == nextAvailable, because
          //   this means now new vertices have been encountered which have not
          //   yet been treated.
          if (next < nextAvailable) {
            val v2 = intToV(next)
            recur(v2, nextAvailable, edges(v2).toList, intToV, vToInt, m)
          }
          else
            (intToV, m)
      }
    }

    val (vertexMap,edgeMap) = recur(v0, 1, edges(v0).toList, Map(0 -> v0), Map(v0 -> 0), Map())
    (Vector.tabulate(vertexMap.size)(vertexMap),
      Vector.tabulate(edgeMap.size)(edgeMap))
  }

This uses function conj defined here.

jimka · May 13, 2021, 9:28am

I’m not sure what the memory requirements are for the construction phase. I suppose the entire process (for which this function is only a part) might be extremely memory intensive. However, the final result represents a finite-state-machine (DFA). The transitions enumerate transition conditions and destination states. So the fact that destination states are integers, which also happen to be indexes into the array of that actual state, means the DFA can be simulated efficiently.

jimka · May 13, 2021, 9:30am

That is a good point, it is probably the case that the final arrays can be created directly without needing the intermediate Maps. It could be a good idea in the future to refactor to create those vectors directly. Before doing this I’d want a good test suite to verify that the translation into terser code hasn’t introduced a bug.

chaotic3quilibrium · May 13, 2021, 2:37pm

Agreed. My preference on problems like this is to first ensure proper function (sacrificing efficiency). And then optimize for efficiency (against which the tests validating the proper function will increase confidence that optimizations don’t break test assumptions). And optimizing for efficiency might end up including a complete re-implementation of the interface.

Hence, my asking about the Map[(V, V), L]. The recursive function to produce that interim Map is substantially simpler such that it likely can even be accomplished with the standard library (already deeply validated) iterators. Same for the functions to produce the two desired Arrays.

It sounds like you have already invested quite a bit in the original solution pattern. So, it might not be useful to re-think and refactor to this simpler case. However, if you end up on enough “oops, I gotta touch this, and fix that, etc.” tangents (which seem inevitable when I attempt to combine functionality with efficiency in a single pass), then you might consider a quick (less than 30m) run at this simpler two stage approach to see if the complexity reduction helps you get past your block.

Best of luck with your Scala directed graph adventure!

jimka · May 17, 2021, 6:42am

BTW @ chaotic3quilibrium, I finally refactored the code to eliminate two of the Map intermediate variables. The reverse mapping from object to integer I didn’t manage to eliminate.

Funny thing. I started composing a post to explain why such a refactoring was difficult. In trying to explain why it was difficult, I realised that it wasn’t so difficult after all. The new version of the code is indeed simpler than before.

chaotic3quilibrium · May 17, 2021, 10:50pm

After reviewing the code again, what was the reason(s) you chose to define the function parameter with edges: V => Seq[(L, V)] as opposed to edges: V => Set[(L, V)]? Technically, a duplicate could be inserted into the Seq that is prevented at the type level by specifying a Set? Am I missing some fundamental requirement which would prioritize ordering (Seq) over uniqueness (Set)?

jimka · May 18, 2021, 7:03am

Good point. I think you are right. It is probably defined this way simply because that’s the form the data was in when I needed the function in the first place. Thanks for pointing that out.

chaotic3quilibrium · May 18, 2021, 4:41pm

I’m confused between some of your requirements specification and then your implementation code. To simplify the discussion, I decided to take a stab at the problem. And then refactor until it converged on your requirements.

I also had the agenda to move it in the direction of making it more amenable to multi-threading such that you can gain a performance advantage. So, it is now moved in that direction (by focusing on a specific shape to the recursion).

However, I didn’t want to invest much more time without validating what I have already assumed and completed.

Here’s my code:

def traceGraph[V, L](
    vStart: V
  , fVtoVAndLs: V => Set[(L, V)] //does not ensure direction symmetry; i.e. could provide edge VA -> VB and then not provide edge VB -> VA
): (Vector[V], Map[(V, V), L]) = {
  @tailrec
  def recursive(
      remainingVs: List[V] = List(vStart)
    , visitedVs: Set[V] = Set()
    , accumulator: Map[(V, V), L] = Map()
  ): (Vector[V], Map[(V, V), L]) =
    remainingVs.headOption match {
      case Some(currentV) =>
        val newAccumulatorEntries =
          fVtoVAndLs(currentV).map {
            case (l, v) => ((currentV, v), l)
          }.toMap
        val newRemainingVs: Set[V] =
          newAccumulatorEntries.map {
            case ((_, v), _) => v
          }.toSet - visitedVs
        recursive(
            remainingVs.tail ++ newRemainingVs.toList
          , visitedVs + currentV
          , accumulator + newAccumulatorEntries
        )
      case None =>
        (
            visitedVs
              .toVector
              //.sorted  //If ordering is available on V, this will ensure deterministic repeatability
          , accumulator
        )
    }

  recursive()
}

chaotic3quilibrium · May 18, 2021, 4:45pm

I have one additional question:

Is L ensured to be unique across all edges?
IOW, can Map[(V, V), L] be losslessly inversed as Map[L, (V, V)]?

This seems essential given your final target of Vector[Seq[(L, Int)]] of which I am still not quite sure how to losslessly reproduce.

jimka · May 18, 2021, 5:45pm

many edges might have the same L.

The graph basically represents a finite automaton. and the L values represent transition criteria between states.

However, for any given vertex of the graph, there is a maximum of one edge labeled by a particular value of type L.

chaotic3quilibrium · May 18, 2021, 7:34pm

Okay. Below is an updated version which is now exactly matched to your return tuple pair value.

My goal was to only use Scala collections library iterators and entirely avoid writing any iterators of my own. There is likely a trivial way to eliminate the recursive function with an iterator. I just didn’t didn’t dwell too hard on it as I suspect this is also the place where you could engage a multi-threaded refactoring to enable performing the fVtoVAndLs function in parallel since it is driven off the “Streamable” List[V].

This was very enjoyable. Tysvm for the exploratory interaction.

def traceGraph[V, L](
    vStart: V
  , fVtoVAndLs: V => Set[(L, V)] //does not ensure direction symmetry; i.e. could provide edge VA -> VB and then not provide edge VB -> VA
): (Vector[V], Vector[Set[(L, Int)]]) = {
  @tailrec
  def recursive(
      remainingVs: List[V] = List(vStart)
    , visitedVs: Set[V] = Set()
    , accumulator: Map[V, Set[(L, V)]] = Map()
  ): Map[V, Set[(L, V)]] =
    remainingVs.headOption match {
      case Some(currentV) =>
        val newAccumulatorEntries =
          fVtoVAndLs(currentV)
        val newRemainingVs: Set[V] =
          newAccumulatorEntries.map(_._2) - visitedVs
        recursive(
            remainingVs.tail ++ newRemainingVs.toList
          , visitedVs + currentV
          , accumulator + (currentV, newAccumulatorEntries)
        )
      case None =>
        accumulator
    }

  val (vs, lAndVss) =
    recursive()
    .toList
    .unzip
  val vsIndexed =
    vs.zipWithIndex
  val indexByV =
    vsIndexed.toMap
  (
      vsIndexed.map(_._1).toVector
    , lAndVss.map (
        _.map {
          case (l, v) => (l, indexByV(v))
        }
      ).toVector
  )
}

jimka · May 19, 2021, 7:50am

what’s the motivation for multithreading? In my application the huge amount of time and memory is that being consumed by the edges function, which you have called fVtoVAndLs. In my case this function is performing an extremely expensive computation. The function is more or less the traversal function for the graph, ie. given a vertex, it computes how to traverse to neighboring vertices–given a vertex, the function will tell which are the neighbor vertices and how to get to them. I need to convert this virtual graph into an explicit graph so I can reason about it. E.g., to characterize the paths from initial state to final state, or merge different paths with complicated logic into fewer paths with simpler logic.

chaotic3quilibrium · May 19, 2021, 1:28pm

Given your graph is large enough, and given your computation is intense enough, and given you have access to serious multi-processor hardware, wouldn’t you be interested in figuring out ways to maximize the parallelism so as to reduce the amount of time spent in traceGraph?

If time isn’t a constraint now, could it become so in the future?

In my own related work (which was around AI/ML for the game Go/Baduk/Weiqi years ago), when I crawled graphs similar to this, I was constantly looking for ways to move parallel computation/multi-threading, especially since I was using ANNs (Artificial Neural Nets) and EAs (Evolutionary Algorithms).

chaotic3quilibrium · May 19, 2021, 1:40pm

BTW, with just a bit more thinking, I suspect you will want to change the return value of your function to Vector[Map[L, Int]].

Originally the change was from Vector[Seq[(L, Int)]] to Vector[Set[(L, Int)]]. However, this is still not quite tuned to the domain requirements you specified where an V will have a unique set of Ls. IOW, using Set[(L, Int)], it is possible for the Set to contain an L more than once (where the tuple uniqueness is in the variation of the Int).

By representing it as Vector[Map[L, Int]], it is now matched exactly to the domain requirements leaving the person following you to need to make far fewer implementation assumptions. And this is important given they may need to refactor it for something like…multi-threading, LOL!

Here’s the code with the change to Vector[Map[L, Int]]:

def traceGraph[V, L](
    vStart: V
  , fVtoVAndLs: V => Set[(L, V)] //does not ensure direction symmetry; i.e. could provide edge VA -> VB and then not provide edge VB -> VA
): (Vector[V], Vector[Map[L, Int]]) = {
  @tailrec
  def recursive(
      remainingVs: List[V] = List(vStart)
    , visitedVs: Set[V] = Set()
    , accumulator: Map[V, Map[L, V]] = Map()
  ): Map[V, Map[L, V]] =
    remainingVs.headOption match {
      case Some(currentV) =>
        val newAccumulatorEntries =
          fVtoVAndLs(currentV)
        val newRemainingVs: Set[V] =
          newAccumulatorEntries.map(_._2) - visitedVs
        recursive(
            remainingVs.tail ++ newRemainingVs.toList
          , visitedVs + currentV
          , accumulator + (currentV, newAccumulatorEntries)
        )
      case None =>
        accumulator
    }

  val (vs, lAndVss) =
    recursive()
    .toList
    .unzip
  val vsIndexed =
    vs.zipWithIndex
  val indexByV =
    vsIndexed.toMap
  (
      vsIndexed.map(_._1).toVector
    , lAndVss.map (
        _.map {
          case (l, v) => (l, indexByV(v))
        }
      ).toVector
  )
}