Odd behaviour of Set, HashSet

mrkargar · December 30, 2020, 9:35pm

Dear friends,
I’m reading Object Equality chapter of martin’s book. there we read:

Note that the collection in the example here is a HashSet . This means
elements of the collection are put in “hash buckets” determined by their hash
code. The contains test first determines a hash bucket to look in and then
compares the given elements with all elements in that bucket.

Why set1 and set2 do not have the same size 3. hashcode mehtod written to generate random
number to test above paragraph. i expect that because three object have different hashcode, all sets have 3 elements.

import scala.collection.immutable.HashSet

object SetTest extends App {

  class B(val a: Int, val b: Int) {
    override def hashCode(): Int = {
      (math.random() * 100).toInt
      //      (a, b).##
    }

    override def equals(other: Any): Boolean = other match {
      case that: B => that.a == this.a && that.b == this.b
      case _ => false
    }
  }

  val b1 = new B(1, 2)
  val b2 = new B(1, 2)
  val b3 = new B(1, 2)

  println(b1.hashCode())
  println(b2.hashCode())
  println(b3.hashCode())

  var set1 = Set(b1, b2, b3)
  println(set1.size) // 1

  var set2 = HashSet(b1, b2, b3)
  println(set2.size) // 3

  import scala.collection.mutable

  val set3 = mutable.Set(b1, b2, b3)
  println(set3.size) // 3
}

MarkCLewis · December 30, 2020, 9:55pm

Generating random hashCodes this way means that every time you call hashCode on an object you get a different code. That will break the sets because they can’t find the proper bucket to look things up. You also break the condition that if obj1==obj2 then obj1.hashCode==obj2.hashCode.

I’d have to look into the code to see what data structure the default immutable set is using, but the fact that b1 == b2 == b3 seems to be the likely reason that set1.size == 1.

BalmungSan · December 30, 2020, 9:56pm

Because Set is optimized for small elements, Set(b1, b2, b3) creates a Set1 which uses only equality not hashCode.

Also, note that you have unnecessary vars and since your hashCode is not deterministic you can get weird behaviors.

mrkargar · December 30, 2020, 9:59pm

that’s the point:
def contains(elem: A): Boolean =
elem == elem1 || elem == elem2 || elem == elem3

thanks man

BalmungSan · December 30, 2020, 10:02pm

It is actually a bit more incremental, it is like:

((Set.empty + b1) + b2) + b3

So it goes from the empty set to Set1 which then won’t add the other two elements because they are equal.

mrkargar · December 30, 2020, 10:06pm

yes i know the following rule:

If two objects are equal according to the equals method, then
calling the hashCode method on each of the two objects must
produce the same integer result.

I just want to learn the behaviour of Set deeply.
thanks.

MarkCLewis · December 30, 2020, 10:38pm

That depends on the implementation. Not all sets are hash sets. The Set(b1, b2, b3) is intended to be abstract so you don’t know exactly what type of implementation is being used. As @BalmungSan mentioned, for very small sets it is using a special implementation that simply stored the values. If you look at the code for the immutable Set (https://github.com/scala/scala/blob/v2.13.4/src/library/scala/collection/immutable/Set.scala) you will see that there are special implementations for sets of size 1-4. Beyond that, it uses a SetBuilder that includes a HashedSetBuilder for the specific implementation.

mrkargar · December 30, 2020, 10:58pm

your right.
Default implementation is HashSet for Set.

MarkCLewis · December 30, 2020, 11:21pm

I have to admit that surprised me looking in the code for the immutable set. I would normally expect the immutable set to be something more tree-based. I guess the rule here is that if you really want to know, you go look at the source code.

Ichoran · December 30, 2020, 11:45pm

Immutable HashSet is a trie which is a type of tree, so I’m not sure
what you mean?

MarkCLewis · December 31, 2020, 12:27am

That makes sense. I saw the citation but didn’t go to look it up. Personally, it seems a bit odd to call something that doesn’t use a hashtable a HashSet even if it is using the hashCode method on the contents.

Ichoran · December 31, 2020, 3:48am

I see what you mean, but on the other hand a trie on the hashcode is about as close to indexing as you can get with immutable data structures with efficient add/remove.

mrkargar · December 31, 2020, 11:21am

But as a end user of Scala APIs, i want a set with 1 to n element. the implementation should be transparent from me and their behaviour should be consistent. If i do not know this tricky tip, i get into trouble as it happenned.
I know if the hashcode() had been implemented correctly, the issue would not exist. but still consistent behaviour is expected.

BalmungSan · December 31, 2020, 1:55pm

Well while you do have a point, in this case it doesn’t apply.

The behaviour is consistent if you follow the rules, and the rules are equality and hashCode should be consistent.
If you break the rules you cause bugs, the same will happen in Java or any other language.