Importing ScalaPB in Databricks


I have a very simple helloworld project that I am trying to run in Databricks. End goal would be comparing Pandas_UDF and Scala UDF performance (converting protobuf message to JSON string). I have a similar library written in Python but its performance isn’t that great.

The project folder (helloworld) contains the following files. File names have been stripped from other subdirs than src. However, I am having some dependency problems with scalapb in my very first trials. In the test trial, I am using only a single message. The message is b64-encoded string.

|-- project
|   |-- project
|   `-- target
|-- src
|   |-- main
|   |   |-- protobuf
|   |   |   `-- my_proto_definition.proto
|   |   `-- scala
|   |       `-- example
|   |           |-- b64_to_json.scala
|   |           |-- gzip.scala
|   |           `-- sample.scala
    `-- test
        `-- scala
            `-- example
`-- target
    |-- global-logging
    |-- scala-2.12
    |-- streams
    `-- task-temp-directory

The ./buildt.sbt has contents of:

import Dependencies._

ThisBuild / scalaVersion     := "2.12.15"
ThisBuild / version          := "0.1.0-SNAPSHOT"
ThisBuild / organization     := "com.example"
ThisBuild / organizationName := "example"

lazy val root = (project in file("."))
    name := "helloworld",
    libraryDependencies += scalaTest % Test

// gdpr=false ignores the stubs
Compile / PB.targets := Seq(
  scalapb.gen(grpc=false) -> (Compile / sourceManaged).value / "scalapb"

libraryDependencies ++= Seq(
    "io.grpc" % "grpc-netty" % scalapb.compiler.Version.grpcJavaVersion,
    "com.thesamet.scalapb" %% "scalapb-runtime-grpc" % scalapb.compiler.Version.scalapbVersion

libraryDependencies += "com.thesamet.scalapb" %% "scalapb-json4s" % "0.11.1"

The files in ./src/main/scala/example contain:

  • sample.scala - a b64-encoded binary message in a string called sample_str
  • gzip.scala - an object Gzip with a method decompress. It decompresses a gzip.
  • b64_to_json.scala - Check the code below

b64_to_json.scala code below:

package example

import example.Sample.sample_str
import <hidden>.my_proto_definition.Msg
import scalapb.json4s.JsonFormat
import java.util.Base64
import example.Gzip.decompress

object b64_to_json {
     def to_json( a:String = sample_str) : String = {

        val decoded_temp: Array[Byte] = Base64.getDecoder().decode(a)
        val decoded: Array[Byte] = decoded_temp.slice(1, decoded_temp.length)
        val decompressed: Array[Byte] = decompress(decoded)
        val parsed: Msg = Msg.parseFrom(decompressed)
        val r: String = JsonFormat.toJsonString(parsed)
        return r

When I run the code with sbt console, I can import the scalapb.json4s.JsonFormat fine. I am able to convert the b64-encoded message into JSON.

But when I install the packaged JAR file (target/scala-2.12/helloworld_2.12-0.1.0-SNAPSHOT.jar) to a cluster running in Databricks,trying to import anything from scalapb gives an error message:

error: object json4s is not a member of package scalapb

What am I understanding wrong from how Scala/sbt solves the dependencies? I tried reading the Scala Docs, sbt docs and the O’Reilly book Programming Scala (2nd ed), but I just can’t seem to figure out what needs to be added and where. Calling the sbt dependencyTree outputs a tree that includes the com.thesamet.scalapb.

Have you tried adding both the helloworld and scalapb jars to databricks cluster? Or creating a fat jar - eg

1 Like


Fat JAR seemed to help to the key problem!

Now I can run this in Databricks:

import example.b64_to_json.to_json

Having that said, it raises an error that is different depending on whether I run the to_json() for the first or the second time (or third or any consecutive):

1nd time run NoSuchMethodError: scala.collection.compat.package$.canBuildFromIterableViewMapLike()Lscala/collection/generic/CanBuildFrom;

Consecutive run: NoClassDefFoundError: Could not initialize class scalapb.json4s.JsonFormat$

I suppose I have to start reading this with thought? Databricks Runtime 7.0 (Unsupported) | Databricks on AWS


It seems that the fix for this problem was present in the documentation already. I was just blind. The “collection.compat” put me on the right path. Adding shade rules fixed the NoClassDefFoundError and/or the NoSuchMethodError.

ScalaPB doc’s SparkSQL side contained this bit of information:

Spark ships with an old version of Google’s Protocol Buffers runtime that is not compatible with the current version. Therefore, we need to shade our copy of the Protocol Buffer runtime. Spark 3 also ships with an incompatible version of scala-collection-compat. Add the following to your build.sbt:

Source: Using ScalaPB with Spark | ScalaPB

1 Like