Right, I meant it is not guaranteed that the key is actually stored somewhere in the Map, and therefore asking for a key is not guaranteed to give a key that is identical to the one you used to update, or identical to the one you got last time.
Pretty much a yes on both questions
You can attempt at detecting when your objects are being GC-ed by overriding the finalize method and logging there. This doesn’t necessarily mean that the object’s entry was evacuated from the cache; unfortunately it doesn’t seem to provide a “hook” for entries’ evacuation. You could try checking the cache size though.
Note that this cache implementation is a naive one. Guava (Google’s library for Java) has a more elaborate Interner utility which provides a simpler API and potentially better performance. You’d obviously want to use a weak interner.
I have implemented your suggestion, and I probably have a bug. What I see is that it is EXTREMELY slow. Can you tell me what the following line 38 of your code example does depending on whether I redefine equals
and hashCode
vs as you’ve defined BddNode
in your example? I fear that the way you’ve defined BddNode
there is a n equals
call somewhere in the WeakHashMap.get
method which is traversing the two BddNodes to discover equality. The BddNodes
are HHUUGGEE if you try to traverse them. Maybe I’m wrong with my supposition about why I’m seeing the slowness.
case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
// equals and hashCode methods copied from here:
// https://users.scala-lang.org/t/what-does-equivalence-of-instances-mean/4368/11?u=jimka
override def equals(that: Any): Boolean = {
that match {
case b: Bdd => this eq b
case _ => false
}
}
override def hashCode(): Int =
System.identityHashCode(this)
}
In order for the weak hash map or the guava interner solutions to work, the equals
and hashcode
methods must rely on canonical / logical identity rather than reference identity (as in your code).
Your problem is that you want to prevent duplication of objects that are logically equal; hence, when you are about to create a new object, you want to first check whether a logically equal object already exists in your cache. If you had a reference to that object in the first place, you wouldn’t need the cache, but since you don’t have it, you need to construct a new object – meaning, no object in the cache will have a reference to it, and you will always fail to find it in the cache (rendering it useless).
It’s fair to assume that in case your nodes are deeply nested, then both hashcode
and equals
will consume a lot of CPU, as they are both recursive (in this case). You could assert that by profiling you application (say, with VisualVM).
In order to optimize the runtime of hashcode
, I would attempt to only compute it once and then memorize it; that will trade memory for CPU:
import scala.runtime.ScalaRunTime
case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
// `_hashCode` is what case classes use to generate hash code
private lazy val _hashCode = ScalaRunTime._hashCode(this)
override def hashCode(): Int = _hashCode
}
As for equals
, I think you would need to come up with a way of generating a unique hash function for you nodes and also memorize it; hashcode
is not unique, as it might very well produce collisions between non-equal objects:
case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
override def equals(that: Any): Boolean = {
that match {
case node: BddNode => uniqueHash == node.uniqueHash
case _ => false
}
}
private lazy val uniqueHash: String = {
def childHash(child: Bdd): String = child match {
case True => "T"
case False => "F"
case node: BddNode => node.uniqueHash
}
s"$label[${childHash(positive)},${childHash(negative)}]"
}
}
This uniqueHash
is very simplistic and consumes a lot of memory. I’ve used it in order to demonstrate what it stands for – the “path” of the node to the root bdds (True and False), which determines the identity of the node.
If the logical meaning of the label is to denote the level of nesting of the node, then it can be safely omitted from this hash method, as the level of nesting is implied by the path.
One more thing. If, when about to create a new node, it is guaranteed that the children are interned (both positive
and negative
don’t have other objects that are similar to them), then you can use this equals method:
override def equals(that: Any): Boolean = {
that match {
case node: BddNode =>
label == node.label && positive eq node.positive && negative eq node.negative
case _ => false
}
}
I think this is the sense of my other question.
I’m trying the following. Early results are positive. It’s not dead slow as it was before.
I wonder if it is a bad idea to use pattern matching in the equals
method.
There was a previous discussion about how to write a similar equal
method.
case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
override def equals(that: Any): Boolean = {
that match {
case BddNode(l2:Int,p2:Bdd,n2:Bdd) => ( p2 eq positive) && (n2 eq negative) && (l2 == label)
case _ => false
}
}
override def hashCode(): Int =
label.## ^ System.identityHashCode(positive) ^ ~System.identityHashCode(negative)
...
}
So one piece of good news, whereas my function was getting out-of-memory at numNodes=40
, with the previous implementation, after incorporating Eyal’s suggestion, including my hashCode
described above I’m not able to compute numNodes=40, and 41
.
@jimka:
Weren’t you creating BddNodes in a way that ensures uniqueness of label
field? In such case you can simplify the code:
case class BddNode(label:Int, positive:Bdd, negative:Bdd) extends Bdd {
override def equals(that: Any): Boolean = {
that match {
case b: Bdd => this eq b
case _ => false
}
}
override def hashCode(): Int =
label * some_prime // multiply by a big prime to improve distribution
}
The issue with System.identityHashCode
is that it is guranteed to be stable in Java specification, but object’s address doesn’t, so JVM needs to have a map of Object address -> hash code as the address of an object can change (garbage collector can and often does move the objects around). This map takes up memory and also lookup could be costly compared to just accessing label
field. This map’s contents are also lazily created so if you don’t invoke System.identityHashCode
then you don’t pay the price of mapping your object to identity hash code.
@tarsa no, label is not a unique identifier.
Here is an image of what such a structure looks like. As you can see many nodes have the same label: Z1, Z2, Z3, Z4. But a node is uniquely identified by its label+positiveChild+negativeChild
What is the purpose of System.identityHashCode
if it cannot be reliably used as a hash-code? This seems strange to me, like I’m perhaps not understanding something fundamental. Do Maps work? Do they use identityHashCode
internally?
It is reliable. What @tarsa said is that identityHashCode
consumes more memory, but only after first usage (it is lazy), and therefore it would have been better if the Bdds had a unique identifier in the first place (but they don’t).
Is it possible in Scala to decrease the size of an object containing an Integer by promising that the integer will always between 0 and 256 exclusive? exponentially many wasted bytes is expensive. Typically this type of optimization is only available in C-like languages.
You could use a Byte
; i.e, 1.toByte
. I’m not sure how this helps though.
Will Byte actually compile to a normal Integer?
Byte will be compiled to Byte. But like all Scala integer types, it is signed, so it goes from -128 to 127, not 0 to 255.
ah OK, that sucks Not sure if it is useful for my applications to have integers from 1 to 127.
BTW was there ever a discussion of supporting unsigned integers in Scala 3 ?
java.lang.Integer
in Java 8 implements some unsigned operations: compareUnsigned
, divideUnsigned
, remainderUnsigned
, parseUnsigned
, toUnsignedString
. Addition, subtraction and multiplication don’t need unsigned equivalents because in two’s complement encoding they work the same.
You could implement unsigned types on Scala 2, but if you go for AnyVals then they may incur high overhead: https://failex.blogspot.com/2017/04/the-high-cost-of-anyval-subclasses.html OTOH Scala 3 has https://docs.scala-lang.org/sips/opaque-types.html so it should be more feasible to do it (by design, opaque types never box).
There are always some issues when doing operation on mixed signedness types. Rust language forces you to cast numbers to matching types before doing operations, but that’s rather cheap syntactically (e.g. something as i32
, something_else as u32
). Also unsigned types are source of mistakes if you e.g. count down to zero and have some off-by-1 error in counting somewhere. It’s easier to avoid mistakes with signed types.
As to numeric types ranges and bit sizes, they’re pretty standard.
- Byte takes 1 byte and has range
-(2^7)
to(2^7)- 1
- Short takes 2 bytes and has range
-(2^15)
to(2^15)- 1
- Int takes 4 bytes and has range
-(2^31)
to(2^31)- 1
- Long takes 8 bytes and has range
-(2^63)
to(2^63)- 1
Nobody ever said “if only I had a signed Byte”, but pretending they are unsigned (and not using division on them) works well enough.
And especially since this works:
scala> 255.toByte
res1: Byte = -1