How to find a key of a Map

jimka · March 1, 2020, 2:20pm

@eyalroth, Yes, I realize that the people reading this forum are not necessarily aware of particular research of people chatting here. However, I don’t understand why such would be necessary. That’s why I tried to reduce it to a pretty self contained use case.

Nevertheless, if anyone cares to look at the actual code, here is an excerpt.

The problem is that the use of the WeakHashMap (line 68) requires me to have a key which pretty much contains the same information as the value. I.e., the case class object has a label, positive, and negative and the hash table key (lines 79 and 84) is a triple of (label, positive,negative). That pretty much doubles the memory usage right there. Additionally, it seems from some profiling that I’ve done, the hash implement has yet another copy of the objects internally. Or at least lots of overhead per entry.

If I had a way to just store the values once, without having to have a redundant key, that would decrease the memory. So I thought about using a weak-key-hash which just mapped the objects themselves to some ignore-me value, like true. But then I have the problem of trying to retrieve a key, which seems to be omitted from the scala API.

jimka · March 1, 2020, 2:30pm

In short the JVM memory explodes because the computations really are exponential in size. They trade exponential time for exponential space. The problem is not that it is exponential, that is understood in the theory, rather the problem is that the data structures I’m using require additional redundant objects to be stored. I’m looking for a leaner implementation which gives me an equivalent API.

jimka · March 1, 2020, 2:48pm

Actually in explaining this yet again, I had an idea. I currently have a WeakValueHashMap of type Map[(Int,Bdd,Bdd),BddNode] which has the property that the BddNode begin stored as value is one which the constructor BddNode(Int,Bdd,Bdd) created. Can I reduce memory if I instead make Map[Int,BddNode] where Int is the hash code System.identityHashCode of the BddNode? An Int takes much less storage than the tuple (Int,Bdd,Bdd), and should take roughly 25% as much. Right?

jimka · March 1, 2020, 3:29pm

That doesn’t work (to hash on the hash key) because if two objects hash to the same key, I won’t be able to distinguish them.

curoli · March 1, 2020, 3:59pm

Well, yes. While in principle, it is desirable to have only one copy of each object around, I would expect the complexity of enforcing this to outweigh the benefit unless your objects are fairly large.

jimka · March 1, 2020, 4:36pm

Sounds like what I need is an API enhancement to the Map class such that given an object == to a key, return me the key.

curoli · March 1, 2020, 5:06pm

What if a Map doesn’t actually explicitly store the keys, like a Trie.

jimka · March 2, 2020, 10:33am

Not sure I understand the question? Are you suggesting that if the Map is implemented internally than asking for the key is impossible? Or are you suggesting that I might try to use a Trie rather than a Map in my application?

ivanopagano · March 2, 2020, 1:18pm

I guess @curoli is suggesting that trying to get back the key from a Trie-based Map is actually increasing memory usage, since the key object is not stored as such in the heap, but it’s represented differently in the implementation…

eyalroth · March 2, 2020, 2:06pm

That’s because sometimes it could be useful to know the overall context in order to potentially provide a solution.

I believe that what you want to hold in your cache are the labels as keys and the nodes as weak-referenced values. Check out this example. The cache will not prevent nodes from being GC-ed (given that they are not referenced outside the cache).

If you like the cache to remove index entries of GC-ed nodes as well, you’d to have to extract the label from the nodes, and keep it only in the cache (as a new weak-reference map from a bdd to its label).

jimka · March 2, 2020, 3:13pm

Not sure if your understanding is correct. A BDD is uniquely determined by its label plus its two children. Many BDDs may have the same label, but never do (should in a correct implementation) two distinct BDDs have the same label/positive-child/negative-child. Here is a picture of such a BDD. You see several BddNode objects have label Z2, several have label Z3, and several have label Z4

BTW can you give me some explanations about the semantics of scala.ref.WeakReference. The documentation in the file really assume the user already fully understands the java class behind it.

eyalroth · March 2, 2020, 3:24pm

Oh, I thought that the label is some internal unique identifier. The numAllocations got me confused. In that case, I think you want to have a weak-reference map with nodes both as keys and as (week reference) values. Check this new example.

Perhaps this SO question will help.

jimka · March 2, 2020, 4:26pm

Let me see if I understand. You are using a WeakHashMap which is a weak-keys-map. This allows the GC to delete entries unless the key is reverenced elsewhere. And you are store as the corresponding value a WeakReference of the SAME-eq object. This means, being a value of the hash-map is invisible to the GC. Right?

If this is the correct understanding, then I think this is a big move forward.
Is that right?

So is there a way for me to test the implementation? For example, can I have some log message when an object is getting removed from he hash?

The old implementation which I was using, had the annoying feature that after values were removed from the cache, they were replaced with null, rather than having the key/value removed. Should this fix that problem?

curoli · March 2, 2020, 6:15pm

Right, I meant it is not guaranteed that the key is actually stored somewhere in the Map, and therefore asking for a key is not guaranteed to give a key that is identical to the one you used to update, or identical to the one you got last time.

eyalroth · March 3, 2020, 12:39am

Pretty much a yes on both questions

You can attempt at detecting when your objects are being GC-ed by overriding the finalize method and logging there. This doesn’t necessarily mean that the object’s entry was evacuated from the cache; unfortunately it doesn’t seem to provide a “hook” for entries’ evacuation. You could try checking the cache size though.

Note that this cache implementation is a naive one. Guava (Google’s library for Java) has a more elaborate Interner utility which provides a simpler API and potentially better performance. You’d obviously want to use a weak interner.

jimka · March 4, 2020, 3:14pm

I have implemented your suggestion, and I probably have a bug. What I see is that it is EXTREMELY slow. Can you tell me what the following line 38 of your code example does depending on whether I redefine equals and hashCode vs as you’ve defined BddNode in your example? I fear that the way you’ve defined BddNode there is a n equals call somewhere in the WeakHashMap.get method which is traversing the two BddNodes to discover equality. The BddNodes are HHUUGGEE if you try to traverse them. Maybe I’m wrong with my supposition about why I’m seeing the slowness.

case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
  // equals and hashCode methods copied from here:
  //   https://users.scala-lang.org/t/what-does-equivalence-of-instances-mean/4368/11?u=jimka
  override def equals(that: Any): Boolean = {
    that match {
      case b: Bdd => this eq b
      case _ => false
    }
  }

  override def hashCode(): Int =
    System.identityHashCode(this)
}

eyalroth · March 4, 2020, 4:42pm

In order for the weak hash map or the guava interner solutions to work, the equals and hashcode methods must rely on canonical / logical identity rather than reference identity (as in your code).

Your problem is that you want to prevent duplication of objects that are logically equal; hence, when you are about to create a new object, you want to first check whether a logically equal object already exists in your cache. If you had a reference to that object in the first place, you wouldn’t need the cache, but since you don’t have it, you need to construct a new object – meaning, no object in the cache will have a reference to it, and you will always fail to find it in the cache (rendering it useless).

It’s fair to assume that in case your nodes are deeply nested, then both hashcode and equals will consume a lot of CPU, as they are both recursive (in this case). You could assert that by profiling you application (say, with VisualVM).

In order to optimize the runtime of hashcode, I would attempt to only compute it once and then memorize it; that will trade memory for CPU:

import scala.runtime.ScalaRunTime

case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
  // `_hashCode` is what case classes use to generate hash code
  private lazy val _hashCode = ScalaRunTime._hashCode(this)
  override def hashCode(): Int = _hashCode
}

As for equals, I think you would need to come up with a way of generating a unique hash function for you nodes and also memorize it; hashcode is not unique, as it might very well produce collisions between non-equal objects:

case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
  override def equals(that: Any): Boolean = {
    that match {
      case node: BddNode => uniqueHash == node.uniqueHash
      case _ => false
    }
  }

  private lazy val uniqueHash: String = {
    def childHash(child: Bdd): String = child match {
      case True => "T"
      case False => "F"
      case node: BddNode => node.uniqueHash
    }

    s"$label[${childHash(positive)},${childHash(negative)}]" 
  }
}

This uniqueHash is very simplistic and consumes a lot of memory. I’ve used it in order to demonstrate what it stands for – the “path” of the node to the root bdds (True and False), which determines the identity of the node.

If the logical meaning of the label is to denote the level of nesting of the node, then it can be safely omitted from this hash method, as the level of nesting is implied by the path.

eyalroth · March 4, 2020, 5:15pm

One more thing. If, when about to create a new node, it is guaranteed that the children are interned (both positive and negative don’t have other objects that are similar to them), then you can use this equals method:

override def equals(that: Any): Boolean = {
  that match {
    case node: BddNode =>
      label == node.label && positive eq node.positive && negative eq node.negative
    case _ => false
  }
}

jimka · March 4, 2020, 6:56pm

I think this is the sense of my other question.

I’m trying the following. Early results are positive. It’s not dead slow as it was before.

I wonder if it is a bad idea to use pattern matching in the equals method.
There was a previous discussion about how to write a similar equal method.

case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {

  override def equals(that: Any): Boolean = {
    that match {
      case BddNode(l2:Int,p2:Bdd,n2:Bdd) => ( p2 eq positive) && (n2 eq negative) && (l2 == label)
      case _ => false
    }
  }

  override def hashCode(): Int =
    label.## ^ System.identityHashCode(positive) ^ ~System.identityHashCode(negative)

...
}

jimka · March 5, 2020, 8:23am

So one piece of good news, whereas my function was getting out-of-memory at numNodes=40, with the previous implementation, after incorporating Eyal’s suggestion, including my hashCode described above I’m not able to compute numNodes=40, and 41.