Help joining two lists

jimka · July 19, 2017, 8:35am

Can someone give me advise about how to eliminate the need for the 3rd argument, less, in the following function. Is it really necessary? The function does a “join” of two iterables of tuples, a tuple of (K,V1) and a tuple of (K,V2). The speed comes from the fact that I require type K to be Orderable, and I require the caller to give me the less-than function to use.

Isn’t there some way syntactically to demand that K is itself is an ordered structure, and just use the k1 < k2 syntax, letting the compiler figure out how to compare the elements?

In practice, I always call the function with K being (String,String), and my less function compares the first elements unless they are == in which case it compares the second elements. However, any well defined order would suffice for my function to work.

def join[K,V1,V2](seq1:Iterable[(K,V1)],
                  seq2:Iterable[(K,V2)],
                  less:(K,K)=>Boolean):Iterable[(K,(V1,V2))] = {

  val sorted1 = seq1.toList.sortWith{(a,b) => less(a._1,b._1)}
  val sorted2 = seq2.toList.sortWith{(a,b) => less(a._1,b._1)}

  def joinTail(
                s1:List[(K,V1)],
                s2:List[(K,V2)],
                joined:List[(K,(V1,V2))]):List[(K,(V1,V2))] = {
    (s1,s2) match {
      case (Nil,_) => joined
      case (_,Nil) => joined
      case ((k1,v1)::t1, (k2,v2)::t2) => {
        def common[V](e:(K,V)):Boolean = {
          e._1 == k1
        }
        if (k1 == k2) {
          val (prefix1, suffix1) = s1.span(common)
          val (prefix2, suffix2) = s2.span(common)
          val prefix3 = for {
            x <- prefix1;
            y <- prefix2
          } yield (k1, (x._2, y._2))
          joinTail(suffix1, suffix2, prefix3 ++ joined)
        }
        else if (less(k1, k2)) // k1 < k2
          joinTail(t1, s2, joined)
        else //if ( k2 < k1)
          joinTail(s1, t2, joined)
      }
    }
  }
  joinTail(sorted1,sorted2,List())
}

jimka · July 19, 2017, 8:47am

The second thing I don’t like about my implementation is that I convert the given iterables to Lists. I do this for two reasons, both of which seem a bit dubious to me.

I need to sort them. My suspicion is that it might be faster to sort before converting to a list, but I’m not guaranteed that the given objects, seq1 and seq1, are sort-able. Is there some way to enforce the sortability of seq1 and seq2, or is my assumption wrong that array sorting is faster than list sorting?
The local function joinTail is tail recursive. Given that the lists s1 and s2 are pre-sorted, I can join the lists but “popping” off the leading elements of the list whose first element has key “less-than” the other, until I find two heads with equal keys. In which case I call s1.span and s2.span to collect the corresponding elements which have equivalent keys at the head, and generate the cross product of those.

So I guess the question is, can (should) I avoid converting to List? Can I somehow do something span-like and something tail-recursive-like? Perhaps this can be done with some clever call to fold?

ivanopagano · July 19, 2017, 5:14pm

You would want to look at this doc for a starter

It defines what is usually called a type class to provide custom ordering on any type.

You need to implement the trait and then require a “type bound” to your K type in the signature, constraining it to have an implicit Ordering for it in scope.

Actually this is what the sortBy method does for sequences whose content type has a natural ordering availlable (look here for an example)

ivanopagano · July 19, 2017, 5:19pm

Looking at the Ordering object docs, you’ll also discover that you just need an Ordering[K] and an Ordering[V] to have a derived ordering for the tuple, Ordering[(K, V)], which could end up being useful for what you need.

ivanopagano · July 19, 2017, 5:41pm

a sample code could be

  /* these are bounds on the types
   * K: Ordering is converted to an additioanl implicit parameter
   * (implicit o: Ordering[K]) and so on
   *
   * having those in scope, lets you call the list.sorted methods
   */
  def join[K: Ordering, V1: Ordering, V2: Ordering](
    seq1: Iterable[(K, V1)],
    seq2: Iterable[(K, V2)]): Iterable[(K, (V1, V2))] = {

    //this import allows you to use the symbolic infix operators to compare
    import scala.math.Ordering.Implicits._

    val sorted1 = seq1.toList.sorted
    val sorted2 = seq2.toList.sorted

    def joinTail(
                  s1: List[(K, V1)],
                  s2: List[(K, V2)],
                  joined: List[(K, (V1, V2))]): List[(K, (V1, V2))] = {
      (s1, s2) match {
        case (Nil,_) => joined
        case (_,Nil) => joined
        case ((k1, v1) :: t1, (k2, v2) :: t2) => {
          def common[V](e: (K, V)): Boolean = e._1 == k1
          if (k1 == k2) {
            val (prefix1, suffix1) = s1.span(common)
            val (prefix2, suffix2) = s2.span(common)
            val prefix3 = for {
              x <- prefix1;
              y <- prefix2
            } yield (k1, (x._2, y._2))
            joinTail(suffix1, suffix2, prefix3 ++ joined)
          }
          else if (k1 < k2) // k1 < k2 is now available thanks to the import above
            joinTail(t1, s2, joined)
          else //if ( k2 < k1)
            joinTail(s1, t2, joined)
        }
      }
    }
    joinTail(sorted1, sorted2, List.empty)
  }

ivanopagano · July 19, 2017, 5:46pm

For the moment I’d say that you’re bound to the List type for two main reasons:

Iterable in general doesn’t have a sorted method and relatives
You’re pattern matching on the :: constructor in the tail-rec method, and you can’t do it on any other collection, that being specific to the List data structure

jimka · July 20, 2017, 9:10am

Thanks Ivanopanago, But I don’t understand why V1 and V2 need to be ordered. I’m only ever using < with objects of type K, never with objects of type V1 or V2. Would you mind explaining your reasoning. Thanks for the advise.

jimka · July 20, 2017, 9:21am

Hi ivanopagano, can I ask another question about your changes to my code. Syntax. Learning syntax for a new language is hard because the new user can’t just look it up, as syntax has context.
My question is about the way you’ve declared join.
In particular join[K: Ordering, V1: Ordering, V2: Ordering] , I’ve seen the syntax of [K <: Ordering] to declare a subclass relationship; that’s somewhat standard notation in functional programming [cf Pierce]. However, I don’t understand what [K:Ordering] means in a function declaration. Could you please explain the difference?

Jasper-M · July 20, 2017, 9:56am

The : in def foo[A: Ordering] = ??? is called a context bound. It is short for def foo[A](implicit evidence$1: Ordering[A]) = ???.

jimka · July 20, 2017, 3:25pm

Here is an updated version of the function. I’ve simplified it a bit, at least lines-of-code-wise, not sure if i’ve worsened or improved the complexity. What I’ve done is used groupBy to assure that each key exists only once in each given iterable seq1 and seq2. This eliminates the need to call span in the tail call sub-function. It also seems that only K needs to be orderable, not V1 nor V2.

def join[K : Ordering,V1,V2](
         seq1:Iterable[(K,V1)],
         seq2:Iterable[(K,V2)])  :   Iterable[(K,(V1,V2))] = {

  val grouped1: List[(K, Iterable[(K, V1)])] = seq1.groupBy(_._1).toList.sortBy{_._1}
  val grouped2: List[(K, Iterable[(K, V2)])] = seq2.groupBy(_._1).toList.sortBy{_._1}

  def joinTail(
                s1:List[(K,Iterable[(K, V1)])],
                s2:List[(K,Iterable[(K, V2)])],
                joined:List[(K,(V1,V2))]) : List[(K,(V1,V2))] = {

    (s1,s2) match {
      case (Nil,_) => joined
      case (_,Nil) => joined
      case ((k1,i1)::t1, (k2,i2)::t2) => {
        if (k1 == k2) {
          val prefix = for {
            x <- i1;
            y <- i2
          } yield (k1, (x._2, y._2))
          joinTail(t1, t2, prefix.toList ++ joined)
        }
        else if ( k1 < k2)
          joinTail(t1, s2, joined)
        else //if ( k2 < k1)
          joinTail(s1, t2, joined)
      }
    }
  }
  joinTail(grouped1,grouped2,List())
}

jimka · July 21, 2017, 9:51am

I can also convert this tail recursive function on lists, into a tail recursive function on array indices. But it is not clear to me whether working with very large lists is better than large arrays, especially since the arguments seq1 and seq1 are Iterables, rather than some concrete type.

The important difference is the call to seq1.groupBy(_._1).toArray.sortBy{_._1}, which coverts some potentially HUGE object seq of unknown type into another HUGE object of unknown type by a call to groupBy, then explicitly to a HUGE Array by a call to toArray, then to yet another HUGE array by a call to sortBy. There is of course some hope that groupBy produces an array which is smaller than the original because hopefully many keys are repeated.

def join[K : Ordering,V1,V2](seq1:Iterable[(K,V1)],seq2:Iterable[(K,V2)]):Iterable[(K,(V1,V2))] = {

  val grouped1: Array[(K, Iterable[(K, V1)])] = seq1.groupBy(_._1).toArray.sortBy{_._1}
  val grouped2: Array[(K, Iterable[(K, V2)])] = seq2.groupBy(_._1).toArray.sortBy{_._1}

  def joinTail(
                n1:Int,
                n2:Int,
                joined:List[(K,(V1,V2))]) : List[(K,(V1,V2))] = {
    if ( n1 == grouped1.length)
      joined
    else if ( n2 == grouped2.length)
      joined
    else (grouped1(n1),grouped2(n2)) match {
      case ((k1,iter1), (k2,iter2)) => {
        if (k1 == k2) {
          val prefix = for {
            x <- iter1;
            y <- iter2
          } yield (k1, (x._2, y._2))
          joinTail(n1+1, n2+1, prefix.toList ++ joined )
        }
        else if ( k1 < k2)
          joinTail(n1+1, n2, joined)
        else // if ( k2 < k1)
          joinTail(n1, n2+1, joined)
      }
    }
  }
  joinTail(0,0,List())
}