Best way to generate pairs

jimka · December 20, 2019, 4:28pm

Is there a better way to generate a list of pairs (x,y) from a given list which preserve the order of x and y. For example computePairs(List(1,2,3)) returns a list containing (1,2), (1,3), and (2,3) but not (2,1), (3,1), or (3,2). It seems my solution is more complicated than necessary.

It seems fairly common that I encounter functions which I only know how to write using tail recursion.

  def computePairs[T](data:List[T]):List[(T,T)] = {
    val nil: List[(T, T)] = Nil

    def recur(data: List[T], data2: List[T], acc: List[(T, T)]): List[(T, T)] = {
      (data, data2) match {
        case (Nil, _) => acc
        case (h1::Nil, Nil) => acc
        case (h1:: _, h2 :: tail) => recur(data, tail, (h1 -> h2) :: acc)
        case (h1 :: _ :: tail, Nil) => recur(data.tail, data.tail.tail, acc)
      }
    }
    if (data.isEmpty)
      nil
    else
      recur(data, data.tail, nil)
  }

Jasper-M · December 20, 2019, 4:47pm

Something to this effect

def computePairs[T](data: List[T]): List[(T, T)] =
  data
    .tails
    .filter(_.nonEmpty)
    .flatMap(xs => xs.tail.map((xs.head, _)))
    .toList

Jasper-M · December 20, 2019, 5:02pm

Or if you wanna be really lame:

def computePairs[T](data: List[T]): List[(T, T)] =
  data
    .combinations(2)
    .map{ case List(a, b) => (a, b) }
    .toList

But that just takes all the fun out of it, doesn’t it?

It’s also not exactly the same thing in case your data can contain duplicates.

jimka · December 20, 2019, 6:09pm

That implementation is REALLY SLOW. on a 7311 length test case, my implementation finishes in about a second (wall clock timing) I let Jasper-M’s implementation run for 5 minutes and it never finished.

Jasper-M · December 20, 2019, 6:38pm

That’s strange. All 3 seem to be going out of memory for me on a default setup.

jimka · December 20, 2019, 6:50pm

I changed my heap size to 2500 Mbytes

Jasper-M · December 20, 2019, 7:02pm

Without really measuring yours and my tails solution seem to be similar in memory and time usage. Though they take considerably longer than a second for me. I wonder what kind of machine you’re using.
That the combinations one is so slow might be related to the duplicate checking.

jimka · December 20, 2019, 7:48pm

Here’s what I’m using

jimka · December 20, 2019, 8:55pm

Here are my three functions


  def computePairs3[T](data: List[T]): List[(T, T)] =
    data
      .tails
      .filter(_.nonEmpty)
      .flatMap(xs => xs.tail.map((xs.head, _)))
      .toList

  def computePairs2[T](data: List[T]): List[(T, T)] =
    data
      .combinations(2)
      .map{ case List(a, b) => (a, b) }
      .toList

  def computePairs[T](data:List[T]):List[(T,T)] = {
    val nil: List[(T, T)] = Nil

    def recur(data: List[T], data2: List[T], acc: List[(T, T)]): List[(T, T)] = {
      (data, data2) match {
        case (Nil, _) => acc
        case (h1::Nil, Nil) => acc
        case (h1:: _, h2 :: tail) => recur(data, tail, (h1 -> h2) :: acc)
        case (h1 :: _ :: tail, Nil) => recur(data.tail, data.tail.tail, acc)
      }
    }
    if (data.isEmpty)
      nil
    else
      recur(data, data.tail, nil)
  }

I’ve set up a small function to test the execution using System.nanoTime


  def main(argv: Array[String]): Unit = {
    //val localTemps = locateTemperatures(1975, "/stations.csv", "/head1975.csv")
    // triangularMesh()
   for { i <- 800 to 5000 by 100
         data = (1 to i).toList} {
     println(s"length = $i")
     val t0 = System.nanoTime()
     computePairs(data)
     val t1 = System.nanoTime()
     println(s"computePairs ${(t1 - t0)/1e9}")
     computePairs2(data)
     val t2 = System.nanoTime()
     println(s"computePairs2 ${(t2 - t1)/1e9}")
     computePairs3(data)
     val t3 = System.nanoTime()
     println(s"computePairs3 ${(t3 - t2)/1e9}")
   }
  }

It seems the tail recursive version is by far the fastest.

length = 2000
computePairs 0.011500182
computePairs2 19.731632092
computePairs3 0.05089523
length = 2100
computePairs 0.0128521
computePairs2 21.493444822
computePairs3 0.061381673
length = 2200
computePairs 0.013651585
computePairs2 25.100234025
computePairs3 0.066952978
length = 2300
computePairs 0.014679647
computePairs2 29.491569607
computePairs3 0.074526357
length = 2400
computePairs 0.018036686
computePairs2 31.925753898
computePairs3 0.07187178
length = 2500
computePairs 0.017542373
computePairs2 36.101928877
computePairs3 0.086673232

tarsa · December 20, 2019, 9:12pm

How about that:

val list = List(1, 2, 3, 4)

val result = list.tails.foldLeft[List[(Int, Int)]](Nil)((acc, subList) => {
  subList match {
    case head :: tail => tail.map((head, _)).reverse ::: acc
    case _ => acc
  }
}).reverse

println(result) // prints: List((1,2), (1,3), (1,4), (2,3), (2,4), (3,4))

???

Or that:

val list = List(1, 2, 3, 4)

val result = List.unfold(list) {
  case head :: tail => Some(tail.map((head, _)) -> tail)
  case Nil => None
}.flatten

println(result) // prints: List((1,2), (1,3), (1,4), (2,3), (2,4), (3,4))

jimka · December 20, 2019, 11:04pm

.reverse is not necessary. I don’t care about the order of the computed list.

jimka · December 20, 2019, 11:06pm

I don’t seem to have List.unfold

jimka · December 20, 2019, 11:31pm

I added the function

  def computePairs4[T](data:List[T]):List[(T,T)] =
    data.tails.foldLeft[List[(T, T)]](Nil)((acc, subList) => {
      subList match {
        case head :: tail => tail.map((head, _)) ::: acc
        case _ => acc
      }
    })

So I ran the timing test again.

length = 1000
computePairs 0.033336759
computePairs2 2.66880161
computePairs3 0.030787732
computePairs4 0.01413223
length = 2000
computePairs 0.125602738
computePairs2 18.915179939
computePairs3 0.036282752
computePairs4 0.026724497
length = 3000
computePairs 0.026496565
computePairs2 62.851550954
computePairs3 0.09019425
computePairs4 0.069243984
length = 4000
computePairs 4.581079107
computePairs2 272.994857273
computePairs3 1.659703356
computePairs4 13.428980719

tarsa · December 21, 2019, 6:23am

List.unfold was added in Scala 2.13. There is also Iterator.unfold and many more unfolds.

tarsa · December 21, 2019, 6:36am

So now computePairs3 is faster than computePairs? Looks like your benchmarks results are rather unstable. I would consider using sbt-jmh if performance numbers are really important.

I have a suspicion that your recursive version is the fastest because it generates the least number of non-optimizable garbage. Even here: tail.map((head, _)) ::: acc - first the prefix is generated (tail.map((head, _))) and then preprending it to acc required rebuilding the prefix to point to new tail.

For reducing garbage probably something like that would be good:

  def computePairs[T](data: List[T]): List[(T, T)] = {
    (for (head :: tail <- data.tails) yield {
      tail.iterator.map((head, _))
    }).flatten.toList
  }

or if you manage to get Scala 2.13 working then:

  def computePairs[T](data: List[T]): List[(T, T)] = {
    Iterator.unfold(data) {
      case head :: tail => Some(tail.iterator.map((head, _)) -> tail)
      case Nil => None
    }.flatten.toList
  }

jimka · December 21, 2019, 3:03pm

It seems to me that building a list just in order to flatten it is inherently wasteful. If one can use flatMap that avoids the needs to create an intermediate object just to flatten it later.

jimka · December 21, 2019, 3:04pm

Yes, I want to convert the projected to a newer Scala version eventually. Perhaps it is time to do so.

jimka · December 21, 2019, 3:07pm

The test case is done simply with lists of integers. But in my actual program the objects are heftier. I suppose this is why I was getting an out-of-memory error.

tarsa · December 21, 2019, 3:51pm

In some examples I’ve build lists of lists, but in other I’ve built iterators of iterators. Creating or flattening iterators in Scala 2.13 should be constant time operation as there are many iterator implementations and specialized methods on iterators.

list.tails returns Iterator. list.iterator and Iterator.unfold too (obviously).

jimka · December 23, 2019, 10:41am

Another scheme-like approach is to write a function to map over the cons cells, and a function to do collection using a poor man’s continuation passing style. Here is an example implementation:

  // call the given client function on each cons cell in the given list,
  // discarding the return values
  def mapConsCells[T](data:List[T])(client:List[T]=>Unit):Unit = {
    @tailrec
    def recur(data: List[T]):Unit = {
      if (data.nonEmpty) {
        client(data)
        recur(data.tail)
      }
    }
    recur(data)
  }

  def withCollector[T](continuation: (T => Unit) => Unit): List[T] = {
    var objects: List[T] = Nil

    def collect(obj: T): Unit = {
      objects = obj :: objects
    }

    continuation(collect)
    objects
  }

  def computePairs5[T](data:List[T]):List[(T,T)] = {
    withCollector(collect => {
      mapConsCells(data) { c: List[T] =>
        val a = c.head
        c.tail.foreach { b: T => collect((a, b)) } }
    })
  }