Runtime reflection in Scala

Yes indeed, when the data enters the system, it seems to me there’s a need at that point to treat the data as you would in a dynamically typed language.

I recall one nightmarish experience I had trying to parse JSON in Scala. It finally worked, but I don’t want to touch the code again. It was really painful to do. There were several libraries to pick from, they all had different abstractions the user (me) had to learn, and when it didn’t work (as expected the first time) it was horribly difficult to debug.

The main problem (as I recall) there was one piece of information in the JSON whose value indicated whether another piece of data was an Array of Array of Array of Double or simply an Array of Array of Double.

In a dynamically typed language, I’d have just treated the data as hostile, and written code to traverse and run-time type check, to extract the information I needed and build the data structures for my application.

I imagine (maybe I’m wrong) that in Scala (and other statically typed languages) you have 100s different incompatible abstractions for each different format: JSON, XML, CVS, Exel, s-expression, Foo, Bar, Baz … When it seems what you really need is just a way to examine collections of Any during the phase of parsing/verification until getting to the point when you can tell the compiler precisely the application-specific types of your data.

Without having much experience with static typing, I’m not sure if my impression is completely wrong.

@BalmungSan, perhaps my CSV example was a poor example. Sorry about that.

Take a look for example at kantan.csv. In the motivation explanation the author explained that he eventually developed yet another abstraction on top of csv because working with the raw data was so difficult.

CSV is an unreasonably popular data exchange format. It suffers from poor (or at the very least late) standardisation, and is often a nightmare to work with when it contains more complex data than just lists of numerical values.
I started writing kantan.csv when I realised I was spending more time dealing with the data container than the data itself. My goal is to abstract CSV away as much as possible and allow developers to describe their data and where it comes from, and then just work with it.

The tactic of nrinaudo (as brilliant as he is) seems to be, let’s limit the types of data we can work with, rather than make our language better able to handle hostile data.

I realize my point of view bay be naïve. And I know it is nice to work with data with obeys rules which match your type system. But from my history using dynamic languages on hostile data whose format may be poorly documented, or even changing from version to version, it seems there is need for tools to handle it without abandoning static typing completely. Dynamically typed languages work well on unstructured data, but once the program has structured it, it would be nice to have a stricter type system. Statically typed languages work well on well structured data, but it would be nice to have a more dynamic type system at times as well.

That’s a valid criticism. We lack enough words without baggage. I’m guilty of sometimes using the word type in its generic sense. I.e., a type is a set of values. Such a set may be the set of values designated by a type name Seq[List[Map[Int,String]]] or it might be set of odd Integers which form pythagorean triples, or it might be just the set of values "hello", 42, and List('x','y','z').

I don’t completely agree here. When data enters the system, usually all of that data are strings, or even byte arrays. Parsing a String or a Array[Byte] does not require runtime reflection.

How are dynamically and statically typed languages different in this respect?
In Scala you might have a json parsing library which exposes the following ADT to the user

sealed trait JsValue
case object JsNull extends JsValue
case class JsString(value: String) extends JsValue
case class JsNum(value: Double) extends JsValue
case class JsBool(vakue: Boolean) extends JsValue
case class JsObject(value: Map[String, JsValue]) extends JsValue
case class JsArray(value: ArraySeq[JsValue]) extends JsValue

As a user you would traverse and pattern match over the data structure the json library gives you to build your application specific data structures.

2 Likes

I don’t think I’d do anything conceptually different in a dynamically typed language.

  • There’s a stream of chars or bytes entering my system.
  • I require this stream to be in some (semi-)structured format: JSON, CSV, XML,…, so I attempt to parse according to that format. This will usually result in a format-level representation like a JsonValue tree with spray-json. If this fails, the data is garbage.
  • Now I expect this structure to further adhere to some specific domain schema. Usually this schema will be given by protocol, and I’ll just try to convert to my code representation of this schema. This can happen through explicit traversal of the format-level structure, by using a type system guided mapping mechanism (like JsonFormat with spray-json) or a mix of both. If this fails, the data is garbage.
  • There are cases when there is no fixed schema, although usually this should be avoided. Then I’ll just skip the schema parsing step and continue with the format-level representation or convert to a dedicated representation that’s still somewhat “amorphous”, but better suited for the purpose of my app than the format-level one.

I can’t really relate to this view. I’ve been using play-json, spray-json and circe, and the underlying concepts and mechanisms felt pretty similar between them. And of course one JSON library should usually be enough. :slight_smile:

Sure, that’s a way to encode (subtyping) polymorphism in JSON. How would you handle this any different in Lisp? At some point you’ll surely need to distinguish between the two flavors in order to process the data…?

Either this is exactly what I’ve described above for Scala JSON, CSV,… libs, or I’m missing the point. How/what would you “runtime type check” before you even have a format-level representation?

Yes, because they have different structures, different sets of supported “primitive” data types,… I don’t see any advantage of having List[Any] as a “lexer” result instead of a JsonValue tree - to the contrary.

This point should be as early as possible, so you don’t need an Any representation in between. And if you really want this, converting e.g. a JsonValue tree to a nested List[Any] should only be a few lines of code.

2 Likes

I think you’re being too reductive here – there’s an enormous amount of room between “everything is very precisely typed” and “give up and just do everything with no types at all”. (That is, just use Any.)

To the core of your points:

It doesn’t have to be that bad, but you do need to think about the structure of your data in a finer-grained way than that. I mean, JSON and CSV are structurally wildly different, so yes – you can’t easily use the same abstraction for both. But CSV is structurally like any other tabular format, and JSON is like any other property bag – you can build valuable constructs that cover those categories pretty broadly.

To illustrate that, since you brought up JSON, consider the weePickle library, which Rally (my employer) has been working on in recent months. This is a JSON library first and foremost – but it also handles YAML, SMILE, MsgPack, and potentially any other data format that is “property-bag-shaped” in the same way that JSON is, and interoperates with most of the other major JSON libraries. It provides you with high-efficiency transformation directly between any two supported types (including strong Scala types), or lets you turn things into an easily-introspectable unstructured AST if you want to think about the data in a less-structured way.

(Credit where credit is due: this is a shaded and heavily modified fork of uPickle, using Jackson under the hood and taking contributions from jsoniter to improve efficiency. It’s all about combining every best idea we can find. It’s still a work in progress, but by now, it’s getting pretty sweet, and we’re starting to use it heavily in production.)

Anyway: the point is, this isn’t a simple either/or. You have to think about your needs, and how those needs work in terms of types. If you get those types right, you can often be moderately general and code with confidence. That’s usually much better than throwing types away entirely…

Here is an example, a small excerpt of the json file I needed to parse.

{"type":"FeatureCollection",
 "features":[
             {"type":"Feature","id":"AFG",
              "properties":{"name":"Afghanistan"},
              "geometry":{"type":"Polygon",
                          "coordinates":[[[61.210817,35.650072],[62.230651,35.270664],[60.803193,34.404102],[61.210817,35.650072]]]}},
             {"type":"Feature","id":"AGO",
              "properties":{"name":"Angola"},
              "geometry":{"type":"MultiPolygon",
                          "coordinates":[[[[16.326528,-5.87747],[16.57318,-6.622645]]],
                                         [[[12.436688,-5.684304],[12.182337,-5.789931],[11.914963,-5.037987],[12.436688,-5.684304]]]]}},
             {"type":"Feature","id":"ALB",
              "properties":{"name":"Albania"},
              "geometry":{"type":"Polygon",
                          "coordinates":[[[20.590247,41.855404],[20.463175,41.515089],[20.605182,41.086226],[20.590247,41.855404]]]}},
             {"type":"Feature","id":"ARE",
              "properties":{"name":"United Arab Emirates"},
              "geometry":{"type":"Polygon",
                          "coordinates":[[[51.579519,24.245497],[51.757441,24.294073],[51.579519,24.245497]]]}}
            ]}

If I parse this in clojure using (json/read-str (slurp "/tmp/small-example.json")), then I get not a JSON object containing JSON types which I need to study and understand, rather I get a Map whose keys are either maps, or arrays, or numbers, or strings, or lists etc all the way down.

clojure-rte.core> (json/read-str (slurp "/tmp/small-example.json"))
{"type" "FeatureCollection",
 "features"
 [{"type" "Feature",
   "id" "AFG",
   "properties" {"name" "Afghanistan"},
   "geometry"
   {"type" "Polygon",
    "coordinates"
    [[[61.210817 35.650072]
      [62.230651 35.270664]
      [60.803193 34.404102]
      [61.210817 35.650072]]]}}
  {"type" "Feature",
   "id" "AGO",
   "properties" {"name" "Angola"},
   "geometry"
   {"type" "MultiPolygon",
    "coordinates"
    [[[[16.326528 -5.87747] [16.57318 -6.622645]]]
     [[[12.436688 -5.684304]
       [12.182337 -5.789931]
       [11.914963 -5.037987]
       [12.436688 -5.684304]]]]}}
  {"type" "Feature",
   "id" "ALB",
   "properties" {"name" "Albania"},
   "geometry"
   {"type" "Polygon",
    "coordinates"
    [[[20.590247 41.855404]
      [20.463175 41.515089]
      [20.605182 41.086226]
      [20.590247 41.855404]]]}}
  {"type" "Feature",
   "id" "ARE",
   "properties" {"name" "United Arab Emirates"},
   "geometry"
   {"type" "Polygon",
    "coordinates"
    [[[51.579519 24.245497]
      [51.757441 24.294073]
      [51.579519 24.245497]]]}}]}

I don’t need to be an expert at JSON parsing libraries to find the coordinates and figure out if each is array of array of array of array, or simply array of array of array.

What I wanted to do with this was build a map of country name to list of perimeters describing its border. Once that mapping was created, then I was very happy to have well typed functions to manipulate them. But extracting that data in Scala was really difficult, especially for a non-expert.

object Geo {

  type Perimeter = List[Location]
  def buildBorders(epsilon:Double):Map[String,List[Perimeter]] = {
    import java.io.InputStream

    import scala.io.Source

    val t:InputStream = getClass.getResourceAsStream("/countries.geo.json")
    val jsonString: String = Source.createBufferedSource(t).getLines.fold("")(_ ++ _)
    buildBorders(jsonString,epsilon)
  }

  def buildBorders(jsonString:String,epsilon:Double):Map[String,List[Perimeter]] = {
    import io.circe._
    import io.circe.parser._
    val json = parse(jsonString).getOrElse(Json.Null)
    def extractPolygons(polygons: List[List[List[Double]]]):List[Perimeter] = {
      polygons.map { linearRing =>
        val perimeter: Perimeter = linearRing.map { xy =>
          assert(xy.length == 2)
          // xy is longitude, latitude, by Location(latitude,longitude)
          val long :: lat :: _ = xy
          Location(lat, long)
        }
        perimeter
      }
    }

    case class Feature(name:String, perimeters:List[Perimeter])

    implicit val memberDecoder: Decoder[Feature] =
      (hCursor: HCursor) => {
        for {
          name <- hCursor.downField("properties").downField("name").as[String]
          geometryType <- hCursor.downField("geometry").downField("type").as[String]
          coords = hCursor.downField("geometry").downField("coordinates")
          perimeters <- geometryType match {
            case "Polygon" => coords.as[List[List[List[Double]]]].map(extractPolygons)
            case "MultiPolygon" => coords.as[List[List[List[List[Double]]]]].map(_.flatMap(extractPolygons))
            case _ => sys.error("not handled type=" + geometryType)
          }
        } yield Feature(name,perimeters)
      }

    val features: Option[Json] = json.hcursor.downField("features").focus

    features match {
      case None => sys.error("cannot find members in the json")
      case Some(features) => {
        val maybeFeatureList = features.hcursor.as[List[Feature]]
        maybeFeatureList match {
          case Right(features) => features.map{f:Feature => (f.name, f.perimeters)}.toMap
          case Left(error) => sys.error(error.getMessage)
        }
      }
    }
  }
}

Truthfully, I was tempted to write a one-off clojure program which would read the JSON and write out a syntactially correct .scala file which I could then compile.

I also wondered why I couldn’t write the clojure function and export it as a jar file to call directly from Scala. That’s something I’d still love to learn to do, but is currently beyond my expertise.

I’m really not seeing the difference as all that dramatic. I can’t speak to Circe, but if you really want to look at it in that sort of unstructured way, most of the JSON libraries (certainly uPickle, play-json, and weePickle, which I’ve worked with most) provide ASTs that are basically exactly this. I mean, yes, the names are different, but a JsObject is basically a Map, a JsNumber is a version of a number, a JsString is a String, etc. How is that meaningfully different from what you’re asking for?

It’s difficult for someone who understands to understand why it is confusing for someone who doesn’t understand. As a professor I try very hard to put myself in the shoes of the student, and explain things including the things that are potentially confusing. It’s hard to imagine which things will be confusing for someone seeing it for the first time.

In the case of the Scala code above, it is really a lot of code involving advanced concepts (cursors, walking up and down, implicit decoder, lots of new type names) to do something which is simple. I recall spending several days on it, because it was failing to parse the file and giving me no usable feedback about what I was doing wrong.

The model presented by the dynamically typed approach is give the user a parsed JSON structure completely in terms of basic types, arrays, lists, maps, numbers strings. Such an object is printable, and traversable with the language features he learned during week-1.

Please don’t misunderstand. I love the language. Scala allows me to do fun things and a cool way. But it I believe there are cases where the language gets in the way and makes easy problems really difficult.

First I’d think that e.g. JsValue shouldn’t require a huge amount of studying. Second, having a restricted set of types is a feature. With a JsValue, there’s exactly 8 different forms it can take. A Map[String, Any] can literally contain anything. I’ll have to guess which types are used to encode JSON arrays/numbers/…, I can’t rely on the compiler to tell me when I’ve forgotten to cover one of the cases, and so on.

I think you are conflating two things here. The JsValue structure is completely equivalent to the Map[String, Any] structure, and you could do the same check for array nesting depth in both. The convention of adding polymorphic subtype information as JSON attributes just makes this distinction easier, and it covers more ambiguous cases - imagine a list that can contain both absolute and relative coordinate pairs, both consisting of two numbers.

That’s the part you have completely omitted on the Clojure side. The equivalent of json/read-str (slurp ...) with spray-json would simply be JsonParser("/tmp/small-example.json"). That’s the format-level part, and the resulting structure is fully equivalent. Converting to the domain level (i.e. Map[String,List[Perimeter]]) is the interesting part. (And I think I recall that I had given some suggestions at the time how this could be approached a bit more conveniently.)

And again, if you don’t like JsObject and friends, a conversion to Map[String, Any] should be feasible in a single, recursive pattern-matching function, if I’m not missing anything.

1 Like

I would not characterize the philosophy of dynamically typed languages as “data is unsafe/hostile and needs to be checked all the time”, but rather as “it’s up to the user to make sure the data has the correct type”

In many dynamically typed languages, providing an argument of the wrong type will rarely result in a type error and more likely result in unexpected behavior. JavaScript is particularly notorious for that.

In a statically typed language, if you have input of unknown type, the first thing is usually to project it to some expected type and flag it as an error if it is not. Therefore, Scala libraries that deal with querying a database or parsing JSON allow you to define precisely what type you expect and will deliver you an object of that type or an error.

1 Like

(Context: I’ve been a full-time Scala teacher, and still spend a fair bit of my time tutoring new folks in the language, including completely new programmers as part of ScalaBridge. I get the “put yourself in the student’s shoes” thing.)

Right – my point is, that’s all optional. You need that in order to get the strongest possible types from Circe. You’re saying that you don’t want strong types, which is fine – don’t use them. Instead, use a JSON parser designed to support a less strongly-typed version, like play-json, uPickle or weePickle.

I mean, the total parse code for play-json is:

Json.parse(inputString)

That’s it – and that gives you an AST that is structurally identical to the one you say you’re looking for. Yes, you have to explain that JsObject is a specialized version of Map that specifically means that this is a JSON Map, but that’s not a hard concept, and it helps get across the Scala way of thinking, that Types Are Good.

(The actual functions and types are different for uPickle and weePickle, but otherwise they’re identical to this.)

All the effort is in producing strong types; those are much easier to work with in nearly every business case, but they are entirely optional. If what you want is a simple AST, just use a simple AST.

4 Likes

I should note an important corollary: you’re thinking about this as “this is how Scala does it”. That’s usually wrong – it’s all about how a specific library does it. If you don’t like the way that library works, there tends to be another one that is more like what you want…

1 Like

I’ve just revisited that thread and your code, and I still cannot really understand your complaints. I still like my spray-json/lenses approach a bit better, and I probably would do a few things different than you with circe, but I don’t think that the HCursor API is that terrible. I’m really wondering how an Any-based implementation should fare significantly better.

In case you’re curious to try it out, here’s a naive conversion from circe Json to Map[String, Any]:

def toUntyped(json: Json): Option[Map[String, Any]] = {
  def cnv(j: Json): Any =
    j.fold(
      null,
      identity,
      _.toDouble,
      identity,
      _.toList.map(cnv),
      cnvObj
    )
  def cnvObj(jo: JsonObject): Map[String, Any] =
    jo.toMap.view.mapValues(cnv).toMap
  json.asObject.map(cnvObj)
}

Another thing is that your use case is somewhat special, though not completely unusual. You are reading the JSON data while transforming/extracting at the same time. This is perfectly fine, but the more common case is that one really wants to map JSON data to an equivalent representation in the code. If we take this route, the picture is somewhat different.

import io.circe.generic.auto._

sealed trait Geometry
case class Polygon(coordinates: List[List[List[Double]]]) extends Geometry
case class MultiPolygon(coordinates: List[List[List[List[Double]]]]) extends Geometry
case class Properties(name: String)
case class Feature(properties: Properties, geometry: Geometry)
case class FeatureCollection(features: List[Feature])

implicit val decodeGeometry: Decoder[Geometry] =
  (c: HCursor) =>
    c.downField("type").as[String].flatMap {
      case "Polygon" => c.as[Polygon]
      case "MultiPolygon" => c.as[MultiPolygon]
    }

val features = parse(jsonStr).flatMap(_.as[FeatureCollection])

circe in particular is pretty much opiniated in that regard, as expressed in the design document:

You generally shouldn’t need or want to work with JSON ASTs directly.

Once you have this direct representation of the JSON model, it should be much easier to transform it into the representation your application actually wants, i.e. Map[String, List[List[Location]]]. This is certainly less efficient than the direct transformation upon parsing - on the other hand now you have the full content of the JSON data available in a nicely typed fashion for whatever else you may want to repurpose it to.

HI Sangamon, is what you copied into that message correct? Maybe I’m blind, but I don’t see the difference between the first two expressions which evaluate differently.
123.getClass.getClass vs 123.getClass.getClass

The first 123.getClass.getClass should have just been 123.getClass

2 Likes

Yeah, sorry, Murphy’s law - I think I missed the first line when copying the REPL session and then somehow managed to grab the wrong line when trying to fix it up. :roll_eyes: I’ll edit it for posteriority.

1 Like

Not really intending to resurrect this thread, but I just came across a blog post that reminded me of this discussion: Parse, don’t validate.

1 Like