Odd behaviour of Set, HashSet

Dear friends,
I’m reading Object Equality chapter of martin’s book. there we read:

Note that the collection in the example here is a HashSet . This means
elements of the collection are put in “hash buckets” determined by their hash
code. The contains test first determines a hash bucket to look in and then
compares the given elements with all elements in that bucket.

Why set1 and set2 do not have the same size 3. hashcode mehtod written to generate random
number to test above paragraph. i expect that because three object have different hashcode, all sets have 3 elements.

import scala.collection.immutable.HashSet

object SetTest extends App {

  class B(val a: Int, val b: Int) {
    override def hashCode(): Int = {
      (math.random() * 100).toInt
      //      (a, b).##
    }

    override def equals(other: Any): Boolean = other match {
      case that: B => that.a == this.a && that.b == this.b
      case _ => false
    }
  }

  val b1 = new B(1, 2)
  val b2 = new B(1, 2)
  val b3 = new B(1, 2)

  println(b1.hashCode())
  println(b2.hashCode())
  println(b3.hashCode())

  var set1 = Set(b1, b2, b3)
  println(set1.size) // 1

  var set2 = HashSet(b1, b2, b3)
  println(set2.size) // 3

  import scala.collection.mutable

  val set3 = mutable.Set(b1, b2, b3)
  println(set3.size) // 3
}

Generating random hashCodes this way means that every time you call hashCode on an object you get a different code. That will break the sets because they can’t find the proper bucket to look things up. You also break the condition that if obj1==obj2 then obj1.hashCode==obj2.hashCode.

I’d have to look into the code to see what data structure the default immutable set is using, but the fact that b1 == b2 == b3 seems to be the likely reason that set1.size == 1.

Because Set is optimized for small elements, Set(b1, b2, b3) creates a Set1 which uses only equality not hashCode.

Also, note that you have unnecessary vars and since your hashCode is not deterministic you can get weird behaviors.

1 Like

that’s the point:
def contains(elem: A): Boolean =
elem == elem1 || elem == elem2 || elem == elem3

thanks man

It is actually a bit more incremental, it is like:

((Set.empty + b1) + b2) + b3

So it goes from the empty set to Set1 which then won’t add the other two elements because they are equal.

1 Like

yes i know the following rule:

If two objects are equal according to the equals method, then
calling the hashCode method on each of the two objects must
produce the same integer result.

I just want to learn the behaviour of Set deeply.
thanks.

That depends on the implementation. Not all sets are hash sets. The Set(b1, b2, b3) is intended to be abstract so you don’t know exactly what type of implementation is being used. As @BalmungSan mentioned, for very small sets it is using a special implementation that simply stored the values. If you look at the code for the immutable Set (https://github.com/scala/scala/blob/v2.13.4/src/library/scala/collection/immutable/Set.scala) you will see that there are special implementations for sets of size 1-4. Beyond that, it uses a SetBuilder that includes a HashedSetBuilder for the specific implementation.

1 Like

your right.
Default implementation is HashSet for Set.

I have to admit that surprised me looking in the code for the immutable set. I would normally expect the immutable set to be something more tree-based. I guess the rule here is that if you really want to know, you go look at the source code.

Immutable HashSet is a trie which is a type of tree, so I’m not sure
what you mean?

1 Like

That makes sense. I saw the citation but didn’t go to look it up. Personally, it seems a bit odd to call something that doesn’t use a hashtable a HashSet even if it is using the hashCode method on the contents.

I see what you mean, but on the other hand a trie on the hashcode is about as close to indexing as you can get with immutable data structures with efficient add/remove.

3 Likes

But as a end user of Scala APIs, i want a set with 1 to n element. the implementation should be transparent from me and their behaviour should be consistent. If i do not know this tricky tip, i get into trouble as it happenned.
I know if the hashcode() had been implemented correctly, the issue would not exist. but still consistent behaviour is expected.

Well while you do have a point, in this case it doesn’t apply.

The behaviour is consistent if you follow the rules, and the rules are equality and hashCode should be consistent.
If you break the rules you cause bugs, the same will happen in Java or any other language.

3 Likes