Merging Scala Maps Efficiently

borde · March 18, 2017, 6:50pm

I have two Scala Maps and I wanted to merge them. Signature of Maps look like this :

case Person(name: String, dob: Option[Int] = None, ssn: Option[Int] = None)

Map[Int, Person]. Person is Case class object.

1.Map((1->Person(name=“abc”,dob=None, ssn=1234)),(2->Person(name=“pqr”,dob=22900123, ssn=None)) )
2.Map((1->Person(name=“abc”,dob=11111, ssn=None)),(4->Person(name=“mno”,dob=22900123, ssn=None)))

I have these two maps and want to merge them. End map should have missing information from colliding keys.
Result will be :

Map((1->Person(name=“abc”,dob=11111, ssn=1234)), (2->Person(name=“pqr”,dob=22900123, ssn=None)), (4->Person(name=“mno”,dob=22900123, ssn=None)))

As both map has Key =1 and while merging Person information is merged if it is not available in any of the Map
and for any other case I should be adding it into final Map.

Blaisorblade · March 18, 2017, 8:00pm

To merge the actual dob and ssn fields on collision, you can just use orElse:
dob1.orElse(dob2) will do what you want (favoring dob1 if both are defined).

The harder part is merging the maps efficiently. You could just iterate over the second map and add its elements one-by-one, and I think this has the right asymptotic cost, but there are more efficient alternatives.

I had this problem once, and a good solution is not trivial. I forgot too many details to give a good answer quickly, but I remember that the best efficient answer I found used HashMap.merged:

def merged[B1 >: B](that: HashMap[A, B1])(mergef: MergeFunction[A, B1]): HashMap[A, B1]
Creates a new map which is the merge of this and the argument hash map.

Uses the specified collision resolution function if two keys are the same. The collision resolution function will always take the first argument from this hash map and the second from that.

The merged method is on average more performant than doing a traversal and reconstructing a new immutable hash map from scratch, or ++.

That does not work on arbitrary Map but you can convert your Map to HashMap (or change your code to give you HashMap, quite possibly more efficient).

That method takes a function that handles collisions… but which, sometimes, will be called with null arguments when it shouldn’t (which seems a bug but I never reported it), and then you need a workaround (see my code). In the end, my implementation was the one here—that’s implementing bags so it’s not quite what you need but I hope it helps.

(Hm, I’m restricted from linking… EDIT I’m a new user and suffering from Onebox prevents new user posting due to too many links - bug - Discourse Meta).

Blaisorblade · March 18, 2017, 8:02pm

Here’s the link to my code (sorry I can’t post a real link, please copy-n-paste it):

https://github.com/inc-lc/ilc-scala/blob/9a158ae7757eab62b91ef2fd2196adda810f6ad5/src/main/scala/ilc/feature/bags/Library.scala#L11-L37

And here’s the link to docs: HashMap.merged docs

borde · March 19, 2017, 6:10pm

@Blaisorblade : Hey Thank you for helping. As I’m pretty new to the Scala, Could you please provide actual solution or function (even it is not efficient) using the one you were saying by dob1.orElse(dob2). At least I can learn what you are thinking there.
By the way thank you again.

SethTisue · March 20, 2017, 1:48pm

The piece involving orElse is:

def mergePersons(p1: Person, p2: Person) =
  Person(p1.name, p1.dob.orElse(p2.dob), p1.ssn.orElse(p2.ssn))

(Paolo, your trust has been upgraded so you can post links.)

LannyRipple · March 22, 2017, 5:05pm

For anyone else following the docs for HashMap.merged

  type HashMap[A, +B]

  def merged[B1 >: B](that: HashMap[A, B1])(mergef: MergeFunction[A, B1]): HashMap[A, B1]

  // with
  type MergeFunction[A1, B1] = ((A1, B1), (A1, B1)) => (A1, B1)

curoli · March 22, 2017, 5:40pm

Hello,

I once did something like this (not sure how it performs ):

val mergeValues : (V, V) => V = …
val map1 : Map[K, V] = …
val map2 : Map[K, V] = …
val jointKeys = map1.keys.intersect(map2.keys)
val overlap = jointKeys.map(key => (key, mergeValues(map1(key),
map2(key))).toMap
val merged = overlap ++ (map1 – jointKeys) ++ (map2 – jointKeys)

 Best, Oliver