Performance: Inheritance vs. pattern matching

nda · December 3, 2019, 9:12am

Hi,

recently I stumbled upon unexpected behaviour regarding access to a value either via inheritance or via pattern matching.

I have a number of case classes inheriting a (zero argument) function from a trait like this

trait MixedIn {
   def value: Int
}

//simplified
case class Variant1(override val value: Int) extends MixedIn
case class Variant2(override val value: Int) extends MixedIn
case class Variant3(override val value: Int) extends MixedIn
case class Variant4(override val value: Int) extends MixedIn
case class Variant5(override val value: Int) extends MixedIn

I then wrote a function for accessing the value via pattern matching.

def value(mixedIn: MixedIn): Int = {
   mixedIn match {
      case Variant1(value, _) => value
      case Variant2(value, _) => value
      case Variant3(value, _) => value
      case Variant4(value, _) => value
      case Variant5(value, _) => value
   }
}

My expectation was that using v.value for some v: MixedIn should be faster than using value(v). I wrote micro benchmarks using 1000 uniformly distributed values:

//List of 1000 random instances of MixedIn, distributed uniformly among the above case classes.
val testData: List[MixedIn]

val valueFunction: MixedIn => Int = _.value

@Benchmark
def directly(): List[Int] = testData.map(valueFunction)

@Benchmark
def indirectly(): List[Int] = testData.map(value)

To my surprise, the directly function takes about thrice as much time as indirectly.
Upon tuning the overall setup, I made the following additional observations.

There is no difference between using def and val in the trait.
The number of VariantX has an influence on performance - using only one or two variants yields essentially the same times for both benchmarks, starting with three the directly benchmark becomes slower.
The measurements are the same when I replace case classes with classes and manually written apply/unapply functions.
I used Mode.AverageTime (time per operation) for the actual time measurements. The mode Mode.Throughput (operations per time) showed corresponding results (around one third of operations for the directly benchmark in comparison with the indirectly benchmark).

Can anyone shed some light on why this might be the case? Is there something obvious that I am missing?

Best regards

Nikita

Jasper-M · December 3, 2019, 10:20am

What happens when you replace Int with String? I ask because there might be some difference in the amount of boxing and unboxing that’s happening. I’m not sure why that would be though.

Also, are you compiling with the optimizer enabled or disabled? Which Scala version?

nda · December 3, 2019, 11:10am

Thanks for the follow-up.

Replacing Int with String slightly improves the time for directly (about 10%, which in this particular case is about 2µs). The indirectly benchmark is unaffected by this change.
I use Scala 2.12 and the benchmarks are executed without optimisations. I will try with optimisations enabled.

Jasper-M · December 3, 2019, 11:19am

You can enable as much optimisations as possible with -opt:l:inline -opt-inline-from:**. (https://www.lightbend.com/blog/scala-inliner-optimizer for further reference). I hear that 2.13’s optimizer should be even better.

My best guess right now is that for some reason the JVM is failing to do some optimizations in one of both benchmarks. Possibly because of the extra megamorphic callsite.

nda · December 3, 2019, 1:08pm

With the suggested optimisations the results are very much the same. Still, the suggestion of the megamorphic call site appears to be right on the money - this is in accord with the observation that the number of possible variants is relevant (second additional observation in the OP).

Indeed, if I make the call site monomorphic, I get the picture I was expecting.

I have switched all values in the testData list to instances of Variant5, so that pattern matching encounters a worst case scenario (every value requires four misses).
With optimisations the directly benchmark takes around 6.5µs, while the indirectly benchmark requires about 8.5µs.
Without optimisations the directly benchmark takes around 6.5µs, while the indirectly benchmark requires about 10.5µs.

The indirectly benchmark now encounters a worst case scenario. This constitutes a nice example of where the indirect access is worse, but I am still surprised by how little.