Supporting Existing Binary Data During Scala 2.13 Transition

After upgrading the service from Scala 2.12 to Scala 2.13, we encountered an issue where the service fails to read existing binary data from disk, resulting in deserialization exceptions. The binary data is of type Array[Byte], and the problem appears to stem from changes in Scala’s internal collection implementations. We are using the Twitter Chill library (which is built on top of Kryo) for binary serialization and deserialization. Has anyone else experienced similar issues, or is there a recommended approach to resolve this incompatibility?

Exception:

aggregated state deserialization failure java.lang.IndexOutOfBoundsException: Index 103 out of bounds for length 4

Example:

// Scala 2.13
import com.twitter.chill.{KryoPool, ScalaKryoInstantiator}
val kryoPool: KryoPool = KryoPool.withByteArrayOutputStream(30, new ScalaKryoInstantiator())

val data = Map("string-key" ->  Map("inner-key" -> "this value will convert to array of bytes".getBytes))

val data: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Array[Byte]]] = Map(string-key -> Map(inner-key -> Array(116, 104, 105, 115, 32, 118, 97, 108, 117, 101, 32, 119, 105, 108, 108, 32, 99, 111, 110, 118, 101, 114, 116, 32, 116, 111, 32, 97, 114, 114, 97, 121, 32, 111, 102, 32, 98, 121, 116, 101, 115)))


kryoPool.toBytesWithoutClass(data)

val res0: Array[Byte] = Array(1, 1, 39, 1, 3, 1, 115, 116, 114, 105, 110, 103, 45, 107, 101, -7, 26, 1, 1, 39, 1, 3, 1, 105, 110, 110, 101, 114, 45, 107, 101, -7, 96, 1, 42, 116, 104, 105, 115, 32, 118, 97, 108, 117, 101, 32, 119, 105, 108, 108, 32, 99, 111, 110, 118, 101, 114, 116, 32, 116, 111, 32, 97, 114, 114, 97, 121, 32, 111, 102, 32, 98, 121, 116, 101, 115)

kryoPool.fromBytes(res0, classOf[Map[String, Map[String, Array[Byte]]]])

val res1: Map[String,Map[String,Array[Byte]]] = Map(string-key -> Map(inner-key -> Array(116, 104, 105, 115, 32, 118, 97, 108, 117, 101, 32, 119, 105, 108, 108, 32, 99, 111, 110, 118, 101, 114, 116, 32, 116, 111, 32, 97, 114, 114, 97, 121, 32, 111, 102, 32, 98, 121, 116, 101, 115)))


// Scala 2.12

import com.twitter.chill.{KryoPool, ScalaKryoInstantiator}

val kryoPool: KryoPool = KryoPool.withByteArrayOutputStream(30, new ScalaKryoInstantiator())

val data = Map("string-key" ->  Map("inner-key" -> "this value will convert to array of bytes".getBytes))

data: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Array[Byte]]] = Map(string-key -> Map(inner-key -> Array(116, 104, 105, 115, 32, 118, 97, 108, 117, 101, 32, 119, 105, 108, 108, 32, 99, 111, 110, 118, 101, 114, 116, 32, 116, 111, 32, 97, 114, 114, 97, 121, 32, 111, 102, 32, 98, 121, 116, 101, 115)))

kryoPool.toBytesWithoutClass(data)

res0: Array[Byte] = Array(1, 1, 39, 1, 3, 1, 115, 116, 114, 105, 110, 103, 45, 107, 101, -7, 26, 1, 1, 39, 1, 3, 1, 105, 110, 110, 101, 114, 45, 107, 101, -7, 95, 1, 42, 116, 104, 105, 115, 32, 118, 97, 108, 117, 101, 32, 119, 105, 108, 108, 32, 99, 111, 110, 118, 101, 114, 116, 32, 116, 111, 32, 97, 114, 114, 97, 121, 32, 111, 102, 32, 98, 121, 116, 101, 115)

kryoPool.fromBytes(res0, classOf[Map[String, Map[String, Array[Byte]]]])

res1: Map[String,Map[String,Array[Byte]]] = Map(string-key -> Map(inner-key -> Array(116, 104, 105, 115, 32, 118, 97, 108, 117, 101, 32, 119, 105, 108, 108, 32, 99, 111, 110, 118, 101, 114, 116, 32, 116, 111, 32, 97, 114, 114, 97, 121, 32, 111, 102, 32, 98, 121, 116, 101, 115)))

Difference:

Is there a way to ensure compatibility with existing data after upgrading to Scala 2.13?

That use case is not supported, Scala does not maintain serialization compatibility between 2.12 / 2.13.

We attempt to maintain serialization compatibility within 2.13, mostly for collections. They serialize via a proxy and we have a SerializationStabilityTest. But we don’t have a rigorous way to ensure serialization stability.

2 Likes

I’m a bit surprised by this, given the use of Kryo, but something I recall being bitten by is expecting the Scala-specific serialiers to be registered, but actually seeing the default FieldSerialiser being used under the hood by Kryo for various Scala collection types.

There are a bunch of specialized serializers that unpick the Scala collections, serializing one element at a time - on deserialization, they pack the individually deserialized elements into a freshly instantiated empty collection.

I wonder whether FieldSerializer was being picked up?

Anyway, if you are using Kryo with Scala, I would definitely recommend cutting over to using GitHub - altoo-ag/scala-kryo-serialization: kryo-based serializers for Scala instead of Chill. Last time I looked, the latter was unmaintained, whereas the former is still worked on.

I’ve used Chill to start with and cutover to the Altoo stuff, and I’m quite happy with it.

In case it’s of interest, some hints on how to cutover can be seen in this diff. I did more following work afterwards, but you’ll get the idea.

1 Like

Yeah. Kryo is great, but you always have to make sure that you’re properly controlling all of the serialization, especially for collections.

(I wound up needing to enhance the akka-kryo-serialization library some years back, in order to deal correctly with all of the hidden special cases that exist for optimization of small collections. It’s subtle stuff.)

A word of advice to others - that library has been split into multiple bits. There is now a Pekko variant of the orginal akka-kryo-serialization, and scala-kryo-serialization is a smaller dependency of the bigger Pekko piece.

I was caught out by this recently when upgrading!

However, this approach (using altoo-scala-serialization) will not facilitate the deserialization of existing binary data into a Scala Map

Ah, I see - I was being naive in thinking you simply wanted to interchange data between processes running with different Scala versions. You want to push out a production update and read your old data, then.

Up a certain creek, forgot the paddle. Oops.

You could trash your old data and rebuild on the fly if that’s feasible …
… or (assuming you have some kind of master version of your data format), cutover to the new stuff while remaining on Scala 2.12, saving in the new format in the new way. Then you do a later cutover to Scala 2.13 once the dust has settled. You would need to do that irrespectively of whether you do or don’t cut over to Altoo - if I’m right, you need to get off FieldSerializer
… or you could switch things around, register an Altoo serializer with Chill (that’s feasible via the Kryo API), do the read-and-write roundtrip, then bump your Scala version …
… or read it while still on Scala 2.12, save it as JSON if you can (did someone say Circe?), then cutover to reading JSON and then to Scala 2.13 …
… or read it while still on Scala 2.12, save it in a database if you can, then cutover to reading the database and then to Scala 2.13 …
…or ask someone to send your data all over again from the big switch-on so you can rebuild everything from scratch.

Good luck.

1 Like

Good to know, thanks. Querki is currently running on ancient tech, but I have ambitions of bringing it more up to date. (Assuming the split happened after my contributions, which is likely, they’re hopefully in the resulting child projects.)

It occured to me, I’m not entirely sure that’s what’s going on here, there is (at least) another explanation…

Looking at Chill, there are different serializer registrations for Scala 2.12 and Scala 2.13, that will have an effect on what serializer ids are stored in the serialized (or is it pickled or even frozen?) data.

WrappedArraySerializer is also subtly different for the two language versions - given the use of toBytesWithoutClass, it’s probably a red herring, but you never know.

My advice is still the same as before, though - you need to either regenerate data or do a two-stage cutover to clean up what’s already there.

One thing I think you’ll be bitten by if you mix Altoo and Chill together is - IIRC - Altoo is built against a major upgrade of Kryo in comparison to Chill. That might have changed, but beware. Maybe the conversion to JSON or a database is the way to go…

2 Likes

Serialization which depends on language version! That’s really not great.

1 Like