# Which is fastest map or for

Hello, I am newish to Scala and I have a question about the implementation of some code. In every programming, language performance is affected by how the user writes the code. In certain languages, some ways are faster when compared to others. I realize that this is likely a result of the library (Geotrellis) I’m using but I’d thought I’d ask about general guidelines or best of practice.

The code is written both in a MapValues and using a For Loop.
The For loop executes twice as fast.

MapValues
val multiGeom = multiPolygons.mapValues(x => x.geom)
val multiHistogram = multiGeom.mapValues(x => rasterTileLayerRDD.polygonalHistogram(x))
val multiPolyStats = multiHistogram.mapValues(x => x.statistics.toList)
for (k <- multiPolyStats.keys) {
println(k, multiPolyStats(k))

For
val theMultiPolygonsKeys = multiPolygons.keys.toList
for (i<-0 to theMultiPolygonsKeys.length-1){

``````      var geom = multiPolygons.get(theMultiPolygonsKeys(i).toString).get.geom
var histogram = rasterTileLayerRDD.polygonalHistogram(geom)
var theStats = histogram.statistics

println(theMultiPolygonsKeys(i).toString, theStats)

}``````

The answer’s not nearly that simple.

First, it’s important to keep in mind that the `mapValues` function has essentially nothing to do with `map`, which is a much more general function that is implemented on many data types. `mapValues` is much more complex and much more specific.

And `for` actually is a complete fiction: it’s “syntax sugar”, a syntactic structure that the compiler turns into calls to some combination of `map`, `flatMap`, `foreach` and `withFilter` under the hood. So `for` almost can’t be slower than `map` – it’s typically made of calls to `map`. (But it depends on the details.)

So it all depends on the exact data structures and what you are doing. `for` usually compiles to call to `map` – but in your case it isn’t doing so (because you aren’t using the `yield` keyword), and `mapValues` is very specialized.

Also, I have a suspicion that you’re not testing with large enough numbers. Using `.toList` to create `theMultiPolygonKeys`, and then calling `theMultiPolygonKeys(i)` over and over like that, is just about the slowest possible way to do this. It’s super-fast on short Lists, but as your list gets longer and longer, that’s a seriously O(N**2) operation. Every call to `theMultiPolygonKeys(i)` is starting from the beginning of the list, and walking `i` steps – very quick for 5 items, incredibly slow for a million.

But also note, almost nobody writes Scala code like this. This way of using `for`, with indexes for a loop, is common in many languages but fairly rare in Scala, and pretty much never used with Lists. You’d instead say:

``````for (key <- multiPolygons.keys.toList) {
val geom = multiPolygons.get(key.toString).get.geom
val histogram = rasterTileLayerRDD.polygonalHistogram(geom)
val theStats = histogram.statistics

println(theMultiPolygonsKeys(i).toString, theStats)
}
``````

The two primary differences are:

• You rarely work with indexes – you just assign the desired value directly in your `for` clause.
• You usually should favor `val` unless there is a specific reason to use `var`. (Which is actually a little unusual.)

Hope this helps. Getting a realistic sense for how fast different operations are is really important, but the honest answer is usually going to be, “It depends on the details”. That’s why there are so many different data structures – each is better at some things, worse at others.

2 Likes

Another thing to notice is that your mapValues solution is calling map thrice on subsequent lists. You are iterating again to do some op and create a new collection.

Thanks this has been very helpful. I imagined that there was more to this and your answer has been very helpful. I will have to look at the library more carefully, because the improvement differences remain quite large. The “for” method is 2x faster than “MapValues”.

You bring up the item of the List, but in this case, it refers to a number of features. Specifically, every state, county, or census tract in the United States. So the length of the list grows from 50 features to 72,000 which is not really big. However, the code I provided the above steps are lazy evaluated until the print statement. So, I’m confused to why a performance changes across either. This has been very helpful, but I’m guessing the performance is really the result of the library and not the language.

I especially like the code you provided its more clear, then what I had before.

True – I’m mainly thinking in terms of algorithmic complexity, which is usually what folks care about when asking how fast something is. But for any given problem, the details matter a lot.

I think you’re incorrect – in particular, `theMultiPolygonsKeys(i)` ought to be eager, and that’s where most of the expense is coming in.

Oh, absolutely. The language is pretty well optimized in most ways – there are some subtle gotchas (for example, `lazy val` is surprisingly slow, because it has to do a huge amount under the hood), but the vast majority of performance questions will mainly wind up being about the libraries…