Runtime reflection in Scala

jimka · June 15, 2020, 8:25am

What kind of run-time reflection can I do in Scala as a mortal programmer? Is it possible to ask an object what type it is (what class it has) and get types or class back as values? Can I then make run-time decisions based on the class of the object? It seems to me that pattern matching must be doing some of this? Is that capability something the mortal programmer can use, or it reserved for the scala compiler?

jimka · June 15, 2020, 8:33am

The way this works in the clojure language is that I can ask any value what it’s class is, and I get back an object whose type is class and whose value somehow represents the java class of the object.

clojure-rte.core> (class 123)
java.lang.Long
clojure-rte.core> (class (class 123))
java.lang.Class
clojure-rte.core> (class [1 2 3])
clojure.lang.PersistentVector
clojure-rte.core> (class (class [1 2 3]))
java.lang.Class
clojure-rte.core>

Is this a special feature that the clojure language has invented, or is it a feature of every language build on the JVM, in particular can I do something similar in Scala?

I’m interested in using this for example to read unsafe data, such as contents of csv files to figure out information about the types of the contents, before launching the scala functions which will make assumptions about the type of data.

sangamon · June 15, 2020, 9:36am

jimka:

clojure-rte.core> (class 123)
java.lang.Long
clojure-rte.core> (class (class 123))
java.lang.Class
clojure-rte.core> (class [1 2 3])
clojure.lang.PersistentVector
clojure-rte.core> (class (class [1 2 3]))
java.lang.Class

In Scala:

scala> 123.getClass
val res0: Class[Int] = int

scala> 123.getClass.getClass
val res1: Class[_ <: Class[Int]] = class java.lang.Class

scala> Vector(1, 2, 3).getClass
val res2: Class[_ <: scala.collection.immutable.Vector[Int]] = class scala.collection.immutable.Vector1

scala> Vector(1, 2, 3).getClass.getClass
                                ^
       warning: inferred existential type Class[T] forSome { type T <: Class[T]; type T <: scala.collection.immutable.Vector[Int] }, which cannot be expressed by wildcards, should be enabled
       by making the implicit value scala.language.existentials visible.
       This can be achieved by adding the import clause 'import scala.language.existentials'
       or by setting the compiler option -language:existentials.
       See the Scaladoc for value scala.language.existentials for a discussion
       why the feature should be explicitly enabled.
val res3: Class[T] forSome { type T <: Class[T]; type T <: scala.collection.immutable.Vector[Int] } = class java.lang.Class

Note the discrepancy between compiler-level inferred types and runtime representations for the parameterized types!

There’s custom Scala capabilities on top of vanilla JVM reflection.

Reflection doesn’t sound like a promising approach to this problem. (It rarely ever does.) How would you want to use reflection to accomplish this?

BalmungSan · June 15, 2020, 2:25pm

Reflection, specially runtime reflection is discouraged in Scala, it is slow, insecure and unsafe. And usually, it would be the worst way to solve the problem; and usually when it is the only way to solve your problem, you have an XY problem.

For your use case, first your are mixing the words types & classes, which probably means you do not understand the differences between those. Also, unless you have a very special JVM CSV file, which somehow has classes associated to its columns, I do not see how reflection would help.

curoli · June 15, 2020, 2:26pm

I understand you want to split your CSV file into value strings, and then send every value string to the Scala compiler to get the value it represents.

For this to even work requires that the syntax of your CSV files is exactly that for Scala literals. I’d be surprised if it is, but I can’t tell for sure, since there are different flavors of CSV.

Assuming this works, it is still massive overhead, since parsing and evaluating literals is only a small fraction of what the Scala compiler is prepared to do. In fact, those reflection methods that simply take an expression and evaluate it will under the hood wrap that expression into a method wrapped into a Scala object so that it becomes a complete Scala file.

You probably want to write your own parser. You can use regexes to decide whether it is an integer, floating point, boolean or string, and then parse each separately. It is pretty straight-forward. Don’t forget to unescape the string.

bjornregnell · June 15, 2020, 3:41pm

@jimka Sometimes I just go with ‘‘dynamic’’ tables typed as Vector[Vector[Any]] or Vector[Vector[String]] and match and check as needed at runtime, but if you want more info in the types, then its not that much work (although somewhat “boilerplaty”) to implement from scratch something like @curoli pruposes.

Below a version for “mortals” (although with a simple type class to make it a bit more general if you want to have different types of tables):

trait RowParser[R] {
  def parseRow(csvRow: String): R
}

case class MyRow(col1: Option[Int], col2: Option[String], col3: Option[Double])

implicit object myRowParser extends RowParser[MyRow] {
  override def parseRow(csvRow: String): MyRow = {
    val elems: Array[String] = csvRow.split(',')
    val q = "\""
    def quotedString(s: String): Option[String] = 
      if (s.length >= 2 && s.startsWith(q) && s.endsWith(q)) 
        Some(s.drop(1).dropRight(1))
      else None
    MyRow(
      elems.lift(0).flatMap(s => util.Try(s.toInt).toOption), 
      elems.lift(1).flatMap(quotedString), 
      elems.lift(2).flatMap(s => util.Try(s.toDouble).toOption)
    )
  }
}

object FromCsv {
  def parse[R: RowParser](csv: String): Vector[R] = 
    csv.split('\n').toVector.map(implicitly[RowParser[R]].parseRow)
}

And then:

scala> val csv = s"""1,"hello",1.0\n2,"world",2.0\n"hi",3.0"""
val csv: String =
1,"hello",1.0
2,"world",2.0
"hi",3.0

scala> val table = FromCsv.parse(csv)
val table: Vector[MyRow] = Vector(MyRow(Some(1),Some(hello),Some(1.0)),
   MyRow(Some(2),Some(world),Some(2.0)), MyRow(None,None,None))

In Scala 3 there will be more support for getting around static typing to some extent:
https://dotty.epfl.ch/docs/reference/changed-features/structural-types.html

yawaramin · June 15, 2020, 4:17pm

Check out https://nrinaudo.github.io/kantan.csv/ , typically we would do https://nrinaudo.github.io/kantan.csv/rows_as_case_classes.html

jimka · June 17, 2020, 8:29am

Perhaps my CSV comment was a red herring. My goal is not really to parse particular CSV. Rather my goal is to better understand how a strictly language handles generic data.

In a dynamically typed language, a function may accept unsafe input data and validate it before proceeding. This validation can sometimes be expressed in terms of the type system, in which the validation can be fast and efficient, but sometimes the data needs to be validated by smarter predicates.

In some dynamically typed languages (e.g., Common Lisp and Clojure) these two cases generalize. I.e., the programmer can extend the type system to accommodate arbitrary predicates including recursively defined predicates on collections, even tests for regularity (regular in the sense of regular languages).

Of course once the user has extended the type system in this manner, the compiler can no longer reason about your code and must assume worst case scenarios.

More specifically what I was wondering is whether it is possible to have a collection of Any in Scala, and perform run-time type checks and other predicate based checks in a way similar to a dynamic language. I.e., look at the data, decide what format it is in, and then pass it to the well-typed code to handle that scenario? “Whether it is possible” means “Is there enough reflective information in objects of type Any” to perform such run-time checking?

An example of such a predicate might be: Is this Array[Any] such that it has an odd index for which the object at that index is a List which contains at least one negative Double? Or is that level of reflection simply impossible in Scala?

sangamon · June 17, 2020, 10:36am

You could write this predicate just using #exists() and pattern matching (which at runtime resolves to JVM instanceof reflection), no?

The question is how you ended up with Array[Any] in the first place. Any conversion to proper types should be done as early as possible, i.e. when raw data enters the system. And parts of this must have been done already - we may not know the types, but the array elements do have types, and some code must have created them accordingly. So why not get it right at the first take?

jimka · June 17, 2020, 10:59am

Yes indeed, when the data enters the system, it seems to me there’s a need at that point to treat the data as you would in a dynamically typed language.

I recall one nightmarish experience I had trying to parse JSON in Scala. It finally worked, but I don’t want to touch the code again. It was really painful to do. There were several libraries to pick from, they all had different abstractions the user (me) had to learn, and when it didn’t work (as expected the first time) it was horribly difficult to debug.

The main problem (as I recall) there was one piece of information in the JSON whose value indicated whether another piece of data was an Array of Array of Array of Double or simply an Array of Array of Double.

In a dynamically typed language, I’d have just treated the data as hostile, and written code to traverse and run-time type check, to extract the information I needed and build the data structures for my application.

I imagine (maybe I’m wrong) that in Scala (and other statically typed languages) you have 100s different incompatible abstractions for each different format: JSON, XML, CVS, Exel, s-expression, Foo, Bar, Baz … When it seems what you really need is just a way to examine collections of Any during the phase of parsing/verification until getting to the point when you can tell the compiler precisely the application-specific types of your data.

Without having much experience with static typing, I’m not sure if my impression is completely wrong.

jimka · June 17, 2020, 11:00am

@BalmungSan, perhaps my CSV example was a poor example. Sorry about that.

jimka · June 17, 2020, 11:13am

Take a look for example at kantan.csv. In the motivation explanation the author explained that he eventually developed yet another abstraction on top of csv because working with the raw data was so difficult.

CSV is an unreasonably popular data exchange format. It suffers from poor (or at the very least late) standardisation, and is often a nightmare to work with when it contains more complex data than just lists of numerical values.
I started writing kantan.csv when I realised I was spending more time dealing with the data container than the data itself. My goal is to abstract CSV away as much as possible and allow developers to describe their data and where it comes from, and then just work with it.

The tactic of nrinaudo (as brilliant as he is) seems to be, let’s limit the types of data we can work with, rather than make our language better able to handle hostile data.

I realize my point of view bay be naïve. And I know it is nice to work with data with obeys rules which match your type system. But from my history using dynamic languages on hostile data whose format may be poorly documented, or even changing from version to version, it seems there is need for tools to handle it without abandoning static typing completely. Dynamically typed languages work well on unstructured data, but once the program has structured it, it would be nice to have a stricter type system. Statically typed languages work well on well structured data, but it would be nice to have a more dynamic type system at times as well.

jimka · June 17, 2020, 11:40am

That’s a valid criticism. We lack enough words without baggage. I’m guilty of sometimes using the word type in its generic sense. I.e., a type is a set of values. Such a set may be the set of values designated by a type name Seq[List[Map[Int,String]]] or it might be set of odd Integers which form pythagorean triples, or it might be just the set of values "hello", 42, and List('x','y','z').

Jasper-M · June 17, 2020, 12:51pm

I don’t completely agree here. When data enters the system, usually all of that data are strings, or even byte arrays. Parsing a String or a Array[Byte] does not require runtime reflection.

How are dynamically and statically typed languages different in this respect?
In Scala you might have a json parsing library which exposes the following ADT to the user

sealed trait JsValue
case object JsNull extends JsValue
case class JsString(value: String) extends JsValue
case class JsNum(value: Double) extends JsValue
case class JsBool(vakue: Boolean) extends JsValue
case class JsObject(value: Map[String, JsValue]) extends JsValue
case class JsArray(value: ArraySeq[JsValue]) extends JsValue

As a user you would traverse and pattern match over the data structure the json library gives you to build your application specific data structures.

sangamon · June 17, 2020, 1:07pm

I don’t think I’d do anything conceptually different in a dynamically typed language.

There’s a stream of chars or bytes entering my system.
I require this stream to be in some (semi-)structured format: JSON, CSV, XML,…, so I attempt to parse according to that format. This will usually result in a format-level representation like a JsonValue tree with spray-json. If this fails, the data is garbage.
Now I expect this structure to further adhere to some specific domain schema. Usually this schema will be given by protocol, and I’ll just try to convert to my code representation of this schema. This can happen through explicit traversal of the format-level structure, by using a type system guided mapping mechanism (like JsonFormat with spray-json) or a mix of both. If this fails, the data is garbage.
There are cases when there is no fixed schema, although usually this should be avoided. Then I’ll just skip the schema parsing step and continue with the format-level representation or convert to a dedicated representation that’s still somewhat “amorphous”, but better suited for the purpose of my app than the format-level one.

I can’t really relate to this view. I’ve been using play-json, spray-json and circe, and the underlying concepts and mechanisms felt pretty similar between them. And of course one JSON library should usually be enough.

Sure, that’s a way to encode (subtyping) polymorphism in JSON. How would you handle this any different in Lisp? At some point you’ll surely need to distinguish between the two flavors in order to process the data…?

Either this is exactly what I’ve described above for Scala JSON, CSV,… libs, or I’m missing the point. How/what would you “runtime type check” before you even have a format-level representation?

Yes, because they have different structures, different sets of supported “primitive” data types,… I don’t see any advantage of having List[Any] as a “lexer” result instead of a JsonValue tree - to the contrary.

This point should be as early as possible, so you don’t need an Any representation in between. And if you really want this, converting e.g. a JsonValue tree to a nested List[Any] should only be a few lines of code.

jducoeur · June 17, 2020, 1:38pm

I think you’re being too reductive here – there’s an enormous amount of room between “everything is very precisely typed” and “give up and just do everything with no types at all”. (That is, just use Any.)

To the core of your points:

It doesn’t have to be that bad, but you do need to think about the structure of your data in a finer-grained way than that. I mean, JSON and CSV are structurally wildly different, so yes – you can’t easily use the same abstraction for both. But CSV is structurally like any other tabular format, and JSON is like any other property bag – you can build valuable constructs that cover those categories pretty broadly.

To illustrate that, since you brought up JSON, consider the weePickle library, which Rally (my employer) has been working on in recent months. This is a JSON library first and foremost – but it also handles YAML, SMILE, MsgPack, and potentially any other data format that is “property-bag-shaped” in the same way that JSON is, and interoperates with most of the other major JSON libraries. It provides you with high-efficiency transformation directly between any two supported types (including strong Scala types), or lets you turn things into an easily-introspectable unstructured AST if you want to think about the data in a less-structured way.

(Credit where credit is due: this is a shaded and heavily modified fork of uPickle, using Jackson under the hood and taking contributions from jsoniter to improve efficiency. It’s all about combining every best idea we can find. It’s still a work in progress, but by now, it’s getting pretty sweet, and we’re starting to use it heavily in production.)

Anyway: the point is, this isn’t a simple either/or. You have to think about your needs, and how those needs work in terms of types. If you get those types right, you can often be moderately general and code with confidence. That’s usually much better than throwing types away entirely…

jimka · June 17, 2020, 1:45pm

Here is an example, a small excerpt of the json file I needed to parse.

{"type":"FeatureCollection",
 "features":[
             {"type":"Feature","id":"AFG",
              "properties":{"name":"Afghanistan"},
              "geometry":{"type":"Polygon",
                          "coordinates":[[[61.210817,35.650072],[62.230651,35.270664],[60.803193,34.404102],[61.210817,35.650072]]]}},
             {"type":"Feature","id":"AGO",
              "properties":{"name":"Angola"},
              "geometry":{"type":"MultiPolygon",
                          "coordinates":[[[[16.326528,-5.87747],[16.57318,-6.622645]]],
                                         [[[12.436688,-5.684304],[12.182337,-5.789931],[11.914963,-5.037987],[12.436688,-5.684304]]]]}},
             {"type":"Feature","id":"ALB",
              "properties":{"name":"Albania"},
              "geometry":{"type":"Polygon",
                          "coordinates":[[[20.590247,41.855404],[20.463175,41.515089],[20.605182,41.086226],[20.590247,41.855404]]]}},
             {"type":"Feature","id":"ARE",
              "properties":{"name":"United Arab Emirates"},
              "geometry":{"type":"Polygon",
                          "coordinates":[[[51.579519,24.245497],[51.757441,24.294073],[51.579519,24.245497]]]}}
            ]}

If I parse this in clojure using (json/read-str (slurp "/tmp/small-example.json")), then I get not a JSON object containing JSON types which I need to study and understand, rather I get a Map whose keys are either maps, or arrays, or numbers, or strings, or lists etc all the way down.

clojure-rte.core> (json/read-str (slurp "/tmp/small-example.json"))
{"type" "FeatureCollection",
 "features"
 [{"type" "Feature",
   "id" "AFG",
   "properties" {"name" "Afghanistan"},
   "geometry"
   {"type" "Polygon",
    "coordinates"
    [[[61.210817 35.650072]
      [62.230651 35.270664]
      [60.803193 34.404102]
      [61.210817 35.650072]]]}}
  {"type" "Feature",
   "id" "AGO",
   "properties" {"name" "Angola"},
   "geometry"
   {"type" "MultiPolygon",
    "coordinates"
    [[[[16.326528 -5.87747] [16.57318 -6.622645]]]
     [[[12.436688 -5.684304]
       [12.182337 -5.789931]
       [11.914963 -5.037987]
       [12.436688 -5.684304]]]]}}
  {"type" "Feature",
   "id" "ALB",
   "properties" {"name" "Albania"},
   "geometry"
   {"type" "Polygon",
    "coordinates"
    [[[20.590247 41.855404]
      [20.463175 41.515089]
      [20.605182 41.086226]
      [20.590247 41.855404]]]}}
  {"type" "Feature",
   "id" "ARE",
   "properties" {"name" "United Arab Emirates"},
   "geometry"
   {"type" "Polygon",
    "coordinates"
    [[[51.579519 24.245497]
      [51.757441 24.294073]
      [51.579519 24.245497]]]}}]}

I don’t need to be an expert at JSON parsing libraries to find the coordinates and figure out if each is array of array of array of array, or simply array of array of array.

What I wanted to do with this was build a map of country name to list of perimeters describing its border. Once that mapping was created, then I was very happy to have well typed functions to manipulate them. But extracting that data in Scala was really difficult, especially for a non-expert.

object Geo {

  type Perimeter = List[Location]
  def buildBorders(epsilon:Double):Map[String,List[Perimeter]] = {
    import java.io.InputStream

    import scala.io.Source

    val t:InputStream = getClass.getResourceAsStream("/countries.geo.json")
    val jsonString: String = Source.createBufferedSource(t).getLines.fold("")(_ ++ _)
    buildBorders(jsonString,epsilon)
  }

  def buildBorders(jsonString:String,epsilon:Double):Map[String,List[Perimeter]] = {
    import io.circe._
    import io.circe.parser._
    val json = parse(jsonString).getOrElse(Json.Null)
    def extractPolygons(polygons: List[List[List[Double]]]):List[Perimeter] = {
      polygons.map { linearRing =>
        val perimeter: Perimeter = linearRing.map { xy =>
          assert(xy.length == 2)
          // xy is longitude, latitude, by Location(latitude,longitude)
          val long :: lat :: _ = xy
          Location(lat, long)
        }
        perimeter
      }
    }

    case class Feature(name:String, perimeters:List[Perimeter])

    implicit val memberDecoder: Decoder[Feature] =
      (hCursor: HCursor) => {
        for {
          name <- hCursor.downField("properties").downField("name").as[String]
          geometryType <- hCursor.downField("geometry").downField("type").as[String]
          coords = hCursor.downField("geometry").downField("coordinates")
          perimeters <- geometryType match {
            case "Polygon" => coords.as[List[List[List[Double]]]].map(extractPolygons)
            case "MultiPolygon" => coords.as[List[List[List[List[Double]]]]].map(_.flatMap(extractPolygons))
            case _ => sys.error("not handled type=" + geometryType)
          }
        } yield Feature(name,perimeters)
      }

    val features: Option[Json] = json.hcursor.downField("features").focus

    features match {
      case None => sys.error("cannot find members in the json")
      case Some(features) => {
        val maybeFeatureList = features.hcursor.as[List[Feature]]
        maybeFeatureList match {
          case Right(features) => features.map{f:Feature => (f.name, f.perimeters)}.toMap
          case Left(error) => sys.error(error.getMessage)
        }
      }
    }
  }
}

jimka · June 17, 2020, 1:59pm

Truthfully, I was tempted to write a one-off clojure program which would read the JSON and write out a syntactially correct .scala file which I could then compile.

I also wondered why I couldn’t write the clojure function and export it as a jar file to call directly from Scala. That’s something I’d still love to learn to do, but is currently beyond my expertise.

jducoeur · June 17, 2020, 1:59pm

I’m really not seeing the difference as all that dramatic. I can’t speak to Circe, but if you really want to look at it in that sort of unstructured way, most of the JSON libraries (certainly uPickle, play-json, and weePickle, which I’ve worked with most) provide ASTs that are basically exactly this. I mean, yes, the names are different, but a JsObject is basically a Map, a JsNumber is a version of a number, a JsString is a String, etc. How is that meaningfully different from what you’re asking for?

jimka · June 17, 2020, 2:12pm

It’s difficult for someone who understands to understand why it is confusing for someone who doesn’t understand. As a professor I try very hard to put myself in the shoes of the student, and explain things including the things that are potentially confusing. It’s hard to imagine which things will be confusing for someone seeing it for the first time.

In the case of the Scala code above, it is really a lot of code involving advanced concepts (cursors, walking up and down, implicit decoder, lots of new type names) to do something which is simple. I recall spending several days on it, because it was failing to parse the file and giving me no usable feedback about what I was doing wrong.

The model presented by the dynamically typed approach is give the user a parsed JSON structure completely in terms of basic types, arrays, lists, maps, numbers strings. Such an object is printable, and traversable with the language features he learned during week-1.

Please don’t misunderstand. I love the language. Scala allows me to do fun things and a cool way. But it I believe there are cases where the language gets in the way and makes easy problems really difficult.