Immutable, huge and nested objects

simon-gunacker · October 10, 2018, 9:38am

Hi there,

let’s assume I would like to parse a huge YAML (JSON, XML, whatever) file being nested deeply. Let’s not discuss what ‘huge’ and ‘deeply’ means; my question is more general: should I design the data structure to be immutable? The parsed YAML file lives in memory throughout the time the program runs and values at any level might be changed while the program runs. Eventually, changes are stored to disk. To my understanding, immutability is highly preferred over mutable objects and I definitely understand why. But considering my YAML file, immutability would mean that every change in that data structure would cause the object to be copied completely (with that one single value changed). This seems to be much overhead to me. What would you suggest to do in this case? I am new to Scala and I am eager to learn to use the language as it is intended to be used. Therefore I would prefer to stick to the immutable variant but copying large data structures everytime single values change seems to be a high price for immutability. Should I be willing to pay that price or am I missing some concept that allows “cheaper” immutability here?

Thanks, Simon

jducoeur · October 10, 2018, 12:19pm

This is a common misunderstanding. If done correctly, making a change usually means N copies, where N is the depth of the change in your data structure, rather than copying the entire data structure. It’s typically not more than half a dozen modest allocations, usually less.

The thing is, you generally only do “shallow” copies at each level. So if one part of the structure has 20 fields under it, and you are changing 1 field, you reuse the other 19 in the new copy. (Since it is all immutable, this sort of reuse is safe, which it wouldn’t be in a mutable data structure.)

So while it’s not free, it usually isn’t terribly expensive. In my system, the core data structure is frequently a megabyte, made up of hundreds or thousands of sub-objects, and the immutable copy is still cheap enough to be a non-issue.

This is a key aspect of programming immutably. Immutable data structures (including Scala case classes) are generally designed so that the amount you have to actually copy when you make a change is minimized.

You may also want to look into a “lens” library such as Monocle, which can handle some of the boilerplate of deeply-nested fetches and copies.

simon-gunacker · October 10, 2018, 1:35pm

Thanks for your answer. I just try to demonstrate what I was talking using a simple example of nested objects (no YAML or JSON considered here):

abstract class Person(val name: String) {
  override def toString: String = name
}

case class Teacher(override val name: String) extends Person(name)

case class Student(override val name: String) extends Person(name)

class Course(val courseName: String, val teacherName: String) {
  private val name: String = courseName
  private val teacher: Teacher = Teacher(teacherName)
  private var students: List[Student] = List()

  def registerStudent(student: Student): Unit = {
    students ++= List(student)
  }

  def dump(): Unit = {
    println(f"$name (taught by $teacher)")
    students.foreach(student => println(f"  Student: $student"))
  }
}

class School() {
  private var courses: Map[String, Course] = Map()

  def createEmptyCourse(courseName: String, teacherName: String) = {
    courses += (courseName -> new Course(courseName, teacherName))
  }

  def registerStudentToCourse(courseName: String, studentName: String) = {
    courses(courseName) registerStudent Student(studentName)
  }

  def dump(): Unit = {
    courses.values.foreach(course => course.dump())
  }
}

object Test extends App {
  val school = new School()
  school.createEmptyCourse("Math", "Mark")
  school.registerStudentToCourse("Math", "Josh")
  school.registerStudentToCourse("Math", "Lisa")

  school.createEmptyCourse("Biology", "Mary")
  school.registerStudentToCourse("Biology", "Josh")
  school.registerStudentToCourse("Biology", "Henry")
  school.dump()
}

I guess my question is basically: how can the program be written when both Course.students and School.courses are val's? Besides: is my code scala-like or is there anything else I should consider?

thanks, Simon

jducoeur · October 10, 2018, 4:03pm

Yep, that’s a very typical use case.

There are three key things to keep in mind:

Scala’s immutable.Map is specifically designed for reasonably efficient additions, despite being immutable (which it does by sharing data with previous copies). So “copying” a Map with + is usually fine.
Typically, you would use case class for all of your major data structures.
case class has a magically-created .copy() function that you normally use for these things.

So your example would come out something like:

// I would typically define this as a trait
trait Person {
  // Making this an abstract def leaves you lots of flexibility in how you define it in children:
  def name: String
  override def toString: String = name
}

// No need for the `override val` in a case class, nor do we need to pass it to the parent;
// it just gets inherited:
case class Teacher(name: String) extends Person

case class Student(name: String) extends Person

// Make Course an immutable case class, and have students be just another immutable field.
case class Course(val courseName: String, val teacherName: String, val students: List[Student]) {
  private val name: String = courseName
  private val teacher: Teacher = Teacher(teacherName)

  // Here's the key mental shift.  Anywhere you would mutate a var, you instead return a *copy*
  // of this Course, updated.
  def registerStudent(student: Student): Course = {
    // copy() is auto-generated for all case classes, and returns a copy of this object, just
    // replacing any fields you specify:
    // Put the new student at the front, because prepending to a List
    // is much, much cheaper than appending:
    copy(students = student +: students)
  }

  def dump(): Unit = {
    println(s"$name (taught by $teacher)")
    students.foreach(student => println(s"  Student: $student"))
  }
}

// Again, use a case class, with an immutable value instead of the var:
case class School(courses: Map[String, Course]) {
  // Again, return a new copy of the School whenever you make a change:
  def createEmptyCourse(courseName: String, teacherName: String): School = {
    copy(courses = courses + (courseName -> new Course(courseName, teacherName, List.empty)))
  }

  // And again, return an updated version of the School:
  def registerStudentToCourse(courseName: String, studentName: String): School = {
    // I don't actually recommend the point-free style here -- I think it's best used sparingly:
    val updatedCourse = courses(courseName).registerStudent(Student(studentName))
    copy(courses = courses + (courseName -> updatedCourse))
  }

  def dump(): Unit = {
    println(s"The whole school:")
    courses.values.foreach(course => course.dump())
  }
}

object Test {
  // And now, instead of doing lots of mutations to the School,
  // we just keep returning new copies, and calling methods on
  // those copies. Obviously, you could create intermediate
  // val's for each copy, but it's often easier to just chain
  // them together like this:
  val filledSchool = 
    new School(Map.empty)
      .createEmptyCourse("Math", "Mark")
      .registerStudentToCourse("Math", "Josh")
      .registerStudentToCourse("Math", "Lisa")
      .createEmptyCourse("Biology", "Mary")
      .registerStudentToCourse("Biology", "Josh")
      .registerStudentToCourse("Biology", "Henry")
      
  def run() = {
    filledSchool.dump()
  }
}

Test.run()

Here that is in Scalafiddle. Note that I removed the App-ness of it, to make it run there, but that’s not really relevant to the point.

Hope this gives you the idea. They key is that, wherever you would change a var, you instead make a copy and return that. As previously mentioned, this can result in a fair amount of boilerplate if you do it really deeply; a lens library can help reduce that. But this general technique is how you switch to immutable programming.

More questions? Happy to help – this stuff is usually the big mental hurdle for new Scala learners, especially when coming from somewhere like Java…

bmaso · October 10, 2018, 5:30pm

+1 to Justin’s answer. It’s a mental shift moving to immutable objects, but well worth the effort especially when you start work with lots of simultaneous requests/processes or distributed processing.

Brian Maso

simon-gunacker · October 11, 2018, 1:08pm

Thank you! This was what I was looking for. And you are right, copying objects instead of modifying references is a big mental shift!

However, I’d say the mental shift is not to understand the concept but to believe in it! Believing that copy is optimized in such a way that “untouched” data structures are copied in a shallow way and believing that copying objects pays off somehow.

But how? I mean, the whole reasoning is about parallelism, isn’t it? If so, I still don’t get the point. Yes, threads might complicate things and when modifying a mutable data structure in a threaded environment I need to ensure that none of the threads accesses an “outdated version” of my data structure. But whats the big advantage of copying the data structure? I still need to ensure that none of the threads accesses an outdated copy of my structure, right? So whats the point of doing all these copies?

jducoeur · October 11, 2018, 2:01pm

Actually, no. Parallelism is the easy argument for it – it’s seriously hard to write reliable mutable code that parallelizes well – but there’s a subtler base argument about ease of reasoning about code.

There’s a crucial concept of “aliasing” – being able to have multiple pointers to the same object – that shows up in nearly all programming languages. And the thing is, when you have aliasing, your ability to reason about your code is really hurt by mutation. With mutation, you can easily wind up in situations along these lines:

val schoolIn2017 = new School(...)
... lots of code later
val schoolIn2018 = schoolIn2017.makeABunchOfChanges()
... lots of code later
// This is totally broken, because schoolIn2017 is actually schoolIn2018 because it's all been mutated
writeAnnualReport("Status Report for 2017", schoolIn2017)

Obviously this is very contrived, but it illustrates the point: when you’re in a world that is mutating, you never really know what your data is. In a small program this doesn’t often arise in practice, but in real-world larger programs it happens all the time, that my object gets “changed out from under me”, six levels down the stack, and I have to spend lots of effort tracking that down. With immutable structures, that just plain doesn’t happen: once I create schoolIn2017, I know that other code isn’t going to change it, so it’s easier to understand my code.

That’s actually not right, but it leads into a different aspect of parallelism.

The thing is, honest parallelism recognizes that the world is never 100% in sync. Akka takes this to the point of practically a religion, since it is all about distributing data around the cluster, but it’s more generally true: even in a single program space, if you have multiple threads it becomes almost arbitrarily difficult to keep all of the data precisely synchronized.

So instead, you think in terms of “snapshots” – a chunk of data that was true at some point in the past – and acknowledge that race conditions are a fact of life: this thread might be slightly off in terms of its view of the world from that one. It always has locally correct knowledge, but that may not be precisely the same as the knowledge that some other thread has, and you use some sort of mechanism for sync’ing up as necessary. (I lean towards Actors for the sorts of problems I typically deal with, but that’s only one of several approaches.)

And yes, this is a hassle – no question there. Asynchronous programming is never easy. But if you use immutable data, you avoid a gigantic pile of problems that are really hard to solve, around threads stepping on each other at just the wrong moment…

curoli · October 11, 2018, 6:09pm

It’s that “threads might complicate things”.

The compiler will try to optimize your code, including by changing the order of steps and short-cutting expressions. The compiler is authorized to make such changes as long as the result is the same on the same thread. But, there is no guarantee that other threads see things happening like they appear in the code. This is also known as the Java Memory Model.

For example, if you set a flag, then change something, then unset the flag, with the expectation that other threads can see that the flag has been set and therefore be alerted that changes are happening, the compiler might conclude that the setting and unsetting of the flag has no effect within the same thread and instead leave the flag unset altogether.

There is one guarantee, though: if you create a new object and the constructor returns the reference to it, then any thread using that reference will see all the changes that happened in the constructor. That means if an object is not further changed after construction - i.e. it is immutable - all threads will see the same object.

The classical method to communicate between threads is to use locks. If one thread locks an object, changes it, and then unlocks it, then any other thread that locks that object will see all the changes.

The problem is that locks don’t scale.

If you have a large data structure and use a lock on it, there may be many threads that want to change it, and then only one thread can change it at a time and the other threads have to wait for the lock.

So will try to have many locks for different parts of the data. But then you will have threads that need more than one lock to to accomplish a task. Then, what happens when a thread obtains one lock but needs a second one that is not available? If the thread decides to hold on to the first lock while waiting for the second, this will block any other threads that also need the first lock. In particular, if a second thread has acquired the second lock and is waiting for the first lock, you have a deadlock.

You may try to impose a policy where no thread is allowed to hold a lock without making immediate progress. So, a thread may acquire one lock, then try to get a second lock, and if the second lock is not available, the thread has to release the first lock and try again later. This can prevent deadlock, but may cause the system to spend most of its time acquiring and releasing locks.

A more scalable approach is to have threads communicate by sending each other immutable objects. Actually, there may still be locks, but only in very limited cases, such as briefly locking a collection to add or remove an object from it.

charpov · October 15, 2018, 2:29pm

That is not quite true and has been a source of tricky bugs in Java applications. In Java, it is true of final fields and of everything reachable from them. There is no such guarantee for non-final fields.

If the reference that is returned by the constructor is placed in a final field before being accessed by another thread, you’re still okay. You’re also fine if there is some kind of memory barrier between construction and access by the other thread (e.g., the object is placed into a blocking queue and retrieved by the other thread). Otherwise, a thread can see a partially constructed object, including an object that changes later magically as the effects of some constructor statements becomes visible.

In Scala, I find it harder to think about this without knowing exactly what the corresponding Java code is (e.g., what is final, what is static). I assume vals are final, hence safe and that singleton objects are accessed through static fields, hence also safe. Anyone knows of a document that describes the Scala guarantees when it comes to publishing objects across threads?

curoli · October 15, 2018, 7:38pm

Ah, you are right.

“An object is considered to be completely initialized when its constructor finishes. A thread that can only see a reference to an object after that object has been completely initialized is guaranteed to see the correctly initialized values for that object’s final fields.”

simon-gunacker · October 18, 2018, 1:09pm

There is one difference in our School class though. My School “constructor” (or whatever this is called for case classes) has no parameter. My goal was to init courses internally whenever a new object is created. However, your solution expects a Map when initializing in object. Is there some way to init it internally. I tried, but then the param is not recognized by copy. By the way - a little side question: how can I implement copy on my own. Just in case I am not using a case class. My own implementation should support that shallow-copy-behaviour.

jducoeur · October 19, 2018, 12:07pm

I’m not sure I understand what you mean by “init it internally”. Can you elaborate? In my experience, depending specifically on zero-parameter classes tends to be a bad smell, that leads to problems down the line, but you can often get a similar effect a different way.

Broadly speaking, it means that you need to have a copy() method, whose parameters are identical to the ones in the constructor, each parameter of which has a default value which is the current value of this object, and does a new using those parameters. (Which is, I believe, basically what gets generated automatically when using a case class.)

simon-gunacker · October 19, 2018, 12:28pm

Well, simply assume that I want a school to start with an empty list of courses. This probably doesn’t make sense in that context, but there are situations where I do not want the one who generates objects of my class to initialize it with values …

curoli · October 19, 2018, 3:28pm

You could replace your School class by two classes, an id which is just a simple immutable key to identify a school, and a state, which is a snapshot of what’s going on in the school at a certain point in time. If you really like to keep your School class, you could do something like

object School {

** case class Id(string: String)**

** trait State { … }**

}

class School(val id: School.Id, var state: School.State)

jducoeur · October 19, 2018, 6:54pm

In which case, I’d probably put a default argument on School:

case class School(courses: Map[String, Course] = Map.empty) { ...

which would allow it to be called as simply:

new School
  .createEmptyCourse("Math", "Mark")

or, if you want to really make sure that it can only be created empty, add a method in a companion object for that initial creation:

object School {
  def init: School = School(Map.empty)
}

and call that instead.

There are ways to provide stronger guarantees, but they would take more effort and the details would depend on the API you’re trying to build – for example, make School a trait instead of a class, and expose an API that only returns an EmptySchool object (actually, at that point you can just have that be a public singleton object for starting out), and hide the courses member so that it can’t be directly accessed. Or something like that – there are lots of ways to slice this, depending on precisely what you want your API to look like…

jducoeur · October 19, 2018, 6:54pm

Remember, this question started out as how to do this immutably…