Hi,
I have a very simple helloworld project that I am trying to run in Databricks. End goal would be comparing Pandas_UDF and Scala UDF performance (converting protobuf message to JSON string). I have a similar library written in Python but its performance isn’t that great.
The project folder (helloworld) contains the following files. File names have been stripped from other subdirs than src. However, I am having some dependency problems with scalapb in my very first trials. In the test trial, I am using only a single message. The message is b64-encoded string.
.
|-- project
| |-- project
| `-- target
|-- src
| |-- main
| | |-- protobuf
| | | `-- my_proto_definition.proto
| | `-- scala
| | `-- example
| | |-- b64_to_json.scala
| | |-- gzip.scala
| | `-- sample.scala
`-- test
`-- scala
`-- example
`-- target
|-- global-logging
|-- scala-2.12
|-- streams
`-- task-temp-directory
The ./buildt.sbt has contents of:
import Dependencies._
ThisBuild / scalaVersion := "2.12.15"
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / organization := "com.example"
ThisBuild / organizationName := "example"
lazy val root = (project in file("."))
.settings(
name := "helloworld",
libraryDependencies += scalaTest % Test
)
// gdpr=false ignores the stubs
Compile / PB.targets := Seq(
scalapb.gen(grpc=false) -> (Compile / sourceManaged).value / "scalapb"
)
libraryDependencies ++= Seq(
"io.grpc" % "grpc-netty" % scalapb.compiler.Version.grpcJavaVersion,
"com.thesamet.scalapb" %% "scalapb-runtime-grpc" % scalapb.compiler.Version.scalapbVersion
)
libraryDependencies += "com.thesamet.scalapb" %% "scalapb-json4s" % "0.11.1"
The files in ./src/main/scala/example contain:
- sample.scala - a b64-encoded binary message in a string called sample_str
- gzip.scala - an object Gzip with a method decompress. It decompresses a gzip.
- b64_to_json.scala - Check the code below
b64_to_json.scala code below:
package example
import example.Sample.sample_str
import <hidden>.my_proto_definition.Msg
import scalapb.json4s.JsonFormat
import java.util.Base64
import example.Gzip.decompress
object b64_to_json {
def to_json( a:String = sample_str) : String = {
val decoded_temp: Array[Byte] = Base64.getDecoder().decode(a)
val decoded: Array[Byte] = decoded_temp.slice(1, decoded_temp.length)
val decompressed: Array[Byte] = decompress(decoded)
val parsed: Msg = Msg.parseFrom(decompressed)
val r: String = JsonFormat.toJsonString(parsed)
return r
}
}
When I run the code with sbt console
, I can import the scalapb.json4s.JsonFormat
fine. I am able to convert the b64-encoded message into JSON.
But when I install the packaged JAR file (target/scala-2.12/helloworld_2.12-0.1.0-SNAPSHOT.jar) to a cluster running in Databricks,trying to import anything from scalapb gives an error message:
error: object json4s is not a member of package scalapb
What am I understanding wrong from how Scala/sbt solves the dependencies? I tried reading the Scala Docs, sbt docs and the O’Reilly book Programming Scala (2nd ed), but I just can’t seem to figure out what needs to be added and where. Calling the sbt dependencyTree
outputs a tree that includes the com.thesamet.scalapb.