Repeatable random numbers

Russ · November 22, 2023, 12:35am

I am using random numbers for a small simulation, and I want them repeatable for debugging. I am using something like this:

val random1 = util.Random(333333)

It produces repeatable results on one computer, but it produces different results on another computer. I am using Scala 3.3.1 on both machines. Is that difference expected, and is there a way to make them run exactly the same?

SethTisue · November 22, 2023, 3:44pm

Is it possible you’re somehow mistaken that you’re seeing different results on different machines? scala.util.Random is just a wrapper for java.util.Random, and the Javadoc on that class says:

If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers

I get this same result on both MacOS and Linux:

scala> new java.util.Random(33333).nextInt(10000)
val res0: Int = 742
                                                                                                    
scala> new scala.util.Random(33333).nextInt(10000)
val res1: Int = 742

Russ · November 22, 2023, 8:44pm

OK, thanks for clarifying that. I must be doing something wrong, but I still don’t know what it could be. I’m using rsync to run the exact same code with the same inputs and compiler version on the other machine but getting very different results.

devlaam · November 24, 2023, 9:59am

Could this be due to different versions over the JVM?

sjrd · November 24, 2023, 2:51pm

All JVMs agree on the algorithm used by java.util.Random, because it is specified in the JavaDoc. In fact, even Scala.js produces exactly the same sequences of numbers in java.util.Random, given the same seed.

Russ · November 27, 2023, 2:05am

I finally figured out what was going on. This was a minor problem for me, but it bothered me that I was getting a different result using the exact same code (and same inputs) on a different machine. Here’s a simplified version of the relevant code:

  val rand1 = util.Random(333333)
  val routes = rand1.shuffle(namedRoutes.toVector)

namedRoutes is a Set, and the non-repeatable operation is converting a Set to a Vector rather than the random shuffling operation.

That raises an interesting question. The conversion of a Set to a Vector was apparently repeatable on a given machine, but it was not repeatable across different machines. Should it be? I can see arguments both ways, but I tend to lean toward thinking it should be repeatable. But what would be the ordering criterion? Hash code?

sageserpent-open · November 27, 2023, 9:40am

How do you represent an individual route - can it be ordered? Does it have an explicit override of hashCode? Is it a case-class?

My guess is that you have instantiated a HashSet, either explicitly or wrapped up via Set.apply, in which case, yes - ordered by hash-code.

You can always use a TreeSet if you have some ordering on your routes and the precise order matters to you.

If you are concerned about repeatability from one run to another, make sure you aren’t picking up the default Java object identity hash in your route representation. I’d expect that to be varying between runs even on the same machine, but maybe you just got lucky when you tested it?

If you have a nice deterministic hash code - for example, you are using case-classes all the way down, then I guess there must be some other factor that changes the ultimate hash code used within Set - so some CPU size difference on your integers, longs or whatever?

Russ · November 27, 2023, 4:43pm

The namedRoutes are case classes. For my purposes here, I just ordered the Vector arbitrarily by name (a String) to provide repeatability before randomly shuffling them.

However, other parts of my program are still not repeatable across the two different machines. That is not a huge problem in itself, but it does complicate debugging a bit. And I think it would be a bigger problem for developers working on a team.

I now know that converting a Set to a Vector is potentially nonrepeatable. I am wondering if anyone can provide a list of other potential causes of non-repeatability so I can look for them in my code.

SethTisue · November 27, 2023, 6:15pm

In my experience, nonrepeatable results are nearly always due to hash table ordering. (But YMMV, maybe? Not sure how universal my experience with this is.)

The TreeSet suggestion is good if there is a suitable ordering you can provide. If you just want to preserve insertion order, consider using LinkedHashSet/Map.

charpov · November 27, 2023, 7:39pm

Hashing and multi-threading are the two that have bitten me the most, but I’ve also had issues with I/O default buffers sized differently in different environments. Of course, none of these are expected to be repeatable and correct code should handle all the variations.

jducoeur · November 27, 2023, 11:28pm

Right, but the question at hand appears to be debugging and testing, and for that you want everything to be as deterministic as possible.

That said, the usual rule is to make everything that is potentially non-deterministic stubbable (typically using dependency injection), so that tests can control them precisely.

Besides the items mentioned above, the clock is another big source of non-determinism that can often bite you at test time: you usually want a wrapper around the real clock so that you can manage it precisely when testing.

Russ · November 29, 2023, 2:30am

I am trying to enhance the repeatability of my code by replacing all the Sets with something else that imposes an order. The actual order does not matter in most cases, but having a consistent order for every run on every machine is important for repeatability.

The simplest solution would be to just use Vector. The only problem with it is that it allows repeat elements, which I don’t want, although it would probably be easy enough in practice to just avoid them (or use the “distinct” method to eliminate them, but I would prefer to not clutter my code with that).

The other classes that have been suggested are SortedSet, TreeSet, and LinkedHashSet. The problem with SortedSet and TreeSet is that they require an ordering method, but in most cases I don’t want to require that, and I don’t care about any particular order so long as it is consistent and repeatable. The problem with LinkedHashSet is that it is available only in mutable form, which is of course completely unacceptable!

So what should I use? Should I just use Vector and be done with it, or is there a better option?

Jasper-M · November 29, 2023, 2:05pm

There is scala.collection.immutable.ListSet. But take into account the docs which say

Other operations, such as inserting or removing entries, are also O(n), which makes this collection suitable only for a small number of elements.

But I think this will probably be a concern for most insertion-ordered sets.

BalmungSan · November 29, 2023, 2:10pm

They could / should rather use VectorMap: Scala Standard Library 2.13.12 - scala.collection.immutable.VectorMap

c.c. @Russ

SethTisue · November 29, 2023, 2:56pm

Note that there ought to be a VectorSet, but for some reason there isn’t. I opened a ticket: Add `VectorSet` (to correspond with `VectorMap`) · Issue #168 · scala/scala-library-next · GitHub

The workaround is to use VectorMap[T, Unit] as a substitute for VectorSet[T].

Russ · November 29, 2023, 10:46pm

Rather than adding VectorSet, would it make sense to just make Set the same as what VectorSet would be? In other words, if we have VectorSet, would there be any good reason to use Set? Ditto for VectorMap.

Ichoran · November 30, 2023, 4:26am

VectorMap is considerably slower than Vector or Map. It’s a big enough difference that it doesn’t make sense to use as the default when you don’t even need the functionality.

There was a lot of experimentation with different ways to do immutable ordered maps before VectorMap was settled upon as the best compromise, so for what it does it’s “surprisingly” fast. But why pay (in speed and memory) for features you don’t need?

Russ · November 30, 2023, 9:51pm

In an attempt to realize repeatability across machines I have temporarily converted all of my Sets to Vectors. However, I am still getting different results on the two different machines that I am using (both running Linux with Scala 3.3.1 and the exact same inputs).

The lack of exact repeatability isn’t a critical problem in itself, but the more fundamental problem is that, while my program appears to run correctly on one machine, it gives me clearly incorrect results on the other. I have done everything I can think of to figure out this inconsistency for over a week, but I am baffled. I’ve had some weird bugs in the past but I don’t recall ever having one like this.

I don’t expect anyone to be able to help me, but any suggestions will be appreciated.

Ichoran · November 30, 2023, 10:52pm

Are you absolutely definitely positively sure that there is no way whatsoever that the two programs could have different inputs? And you know that it’s single-threaded?

Russ · December 1, 2023, 3:10am

I am not sure I can honestly claim all of those adjectives, but yes I am fairly sure. As far as I can tell, the inputs are two text files, one of which is two lines long. The other is an XML file that is 250 lines long. They are both in the current working directory that I run the program from. Just to be sure, I used scp to copy that entire working directory from one computer to the other.

Also, I am using no libraries other than the Scala standard library, the Scala XML library, and Scala parallel Vectors. The program does some parallel processing using “.par”, but otherwise I think it is single threaded.

I’m probably missing something glaringly obvious here, but I am baffled about the inconsistency between the runs on different machines and the bugs on the one machine.