Which is fastest map or for

dahaynes · September 26, 2018, 7:22pm

Hello, I am newish to Scala and I have a question about the implementation of some code. In every programming, language performance is affected by how the user writes the code. In certain languages, some ways are faster when compared to others. I realize that this is likely a result of the library (Geotrellis) I’m using but I’d thought I’d ask about general guidelines or best of practice.

The code is written both in a MapValues and using a For Loop.
The For loop executes twice as fast.

MapValues
val multiGeom = multiPolygons.mapValues(x => x.geom)
val multiHistogram = multiGeom.mapValues(x => rasterTileLayerRDD.polygonalHistogram(x))
val multiPolyStats = multiHistogram.mapValues(x => x.statistics.toList)
for (k <- multiPolyStats.keys) {
println(k, multiPolyStats(k))

For
val theMultiPolygonsKeys = multiPolygons.keys.toList
for (i<-0 to theMultiPolygonsKeys.length-1){

      var geom = multiPolygons.get(theMultiPolygonsKeys(i).toString).get.geom
      var histogram = rasterTileLayerRDD.polygonalHistogram(geom)
      var theStats = histogram.statistics

      println(theMultiPolygonsKeys(i).toString, theStats)

    }

jducoeur · September 26, 2018, 9:12pm

The answer’s not nearly that simple.

First, it’s important to keep in mind that the mapValues function has essentially nothing to do with map, which is a much more general function that is implemented on many data types. mapValues is much more complex and much more specific.

And for actually is a complete fiction: it’s “syntax sugar”, a syntactic structure that the compiler turns into calls to some combination of map, flatMap, foreach and withFilter under the hood. So for almost can’t be slower than map – it’s typically made of calls to map. (But it depends on the details.)

So it all depends on the exact data structures and what you are doing. for usually compiles to call to map – but in your case it isn’t doing so (because you aren’t using the yield keyword), and mapValues is very specialized.

Also, I have a suspicion that you’re not testing with large enough numbers. Using .toList to create theMultiPolygonKeys, and then calling theMultiPolygonKeys(i) over and over like that, is just about the slowest possible way to do this. It’s super-fast on short Lists, but as your list gets longer and longer, that’s a seriously O(N**2) operation. Every call to theMultiPolygonKeys(i) is starting from the beginning of the list, and walking i steps – very quick for 5 items, incredibly slow for a million.

But also note, almost nobody writes Scala code like this. This way of using for, with indexes for a loop, is common in many languages but fairly rare in Scala, and pretty much never used with Lists. You’d instead say:

for (key <- multiPolygons.keys.toList) {
  val geom = multiPolygons.get(key.toString).get.geom
  val histogram = rasterTileLayerRDD.polygonalHistogram(geom)
  val theStats = histogram.statistics

  println(theMultiPolygonsKeys(i).toString, theStats)
}

The two primary differences are:

You rarely work with indexes – you just assign the desired value directly in your for clause.
You usually should favor val unless there is a specific reason to use var. (Which is actually a little unusual.)

Hope this helps. Getting a realistic sense for how fast different operations are is really important, but the honest answer is usually going to be, “It depends on the details”. That’s why there are so many different data structures – each is better at some things, worse at others.

batman · September 27, 2018, 2:16am

Another thing to notice is that your mapValues solution is calling map thrice on subsequent lists. You are iterating again to do some op and create a new collection.

dahaynes · October 2, 2018, 5:14pm

Thanks this has been very helpful. I imagined that there was more to this and your answer has been very helpful. I will have to look at the library more carefully, because the improvement differences remain quite large. The “for” method is 2x faster than “MapValues”.

You bring up the item of the List, but in this case, it refers to a number of features. Specifically, every state, county, or census tract in the United States. So the length of the list grows from 50 features to 72,000 which is not really big. However, the code I provided the above steps are lazy evaluated until the print statement. So, I’m confused to why a performance changes across either. This has been very helpful, but I’m guessing the performance is really the result of the library and not the language.

I especially like the code you provided its more clear, then what I had before.

jducoeur · October 2, 2018, 6:49pm

True – I’m mainly thinking in terms of algorithmic complexity, which is usually what folks care about when asking how fast something is. But for any given problem, the details matter a lot.

I think you’re incorrect – in particular, theMultiPolygonsKeys(i) ought to be eager, and that’s where most of the expense is coming in.

Oh, absolutely. The language is pretty well optimized in most ways – there are some subtle gotchas (for example, lazy val is surprisingly slow, because it has to do a huge amount under the hood), but the vast majority of performance questions will mainly wind up being about the libraries…