Just for discussion here,
I was thinking AKKA is great for writing distributed systems but however if your Supervisor and Actors are all in one machine your distributed system will not be highly available. If the machine goes down the whole distributed system goes down with it.
So how about i put the Supervisor in one machine and all the Actors in separate machines. So if one Actor dies there are still others to handle the work. If i bring up a replacement machine. How can the Supervisor know that there is this new machine that can house a new Actor?
Ultimately the Supervisor tree leads to a Root Supervisor. What if the machine that houses the Root Supervisor dies? Does this make it the weakest link in the whole distributed system? How about having an additional Root Supervisor node that one can fail over to? How about having several and have a load balancer in front of all the Root Supervisor to distribute the load?
This is just my 0.02, but I think you might be trying to solve this problem at the wrong level of abstraction. What stops you from just running multiple instances of your actor application in an active/passive configuration?
I actually do not have a specific problem to solve in mind. I was just wondering how can a AKKA distributed system be highly available
You need some sort of node registration. If a new node is available, it
would contact the active cluster to signal it wants to join.
If you model the entire cluster as one actor system with a single root
actor, then indeed it would not recover from a failure of the root node, or
at least the recovery process would not easily fit within the actor model.
That may be an acceptable risk, since the failure of one particular node is
much smaller than the failure of any of many nodes. You can also try to
have the root node be more resilient than other nodes.
An example of a distributed actor system is Apache Spark (based on Akka):
Spark can easily recover from the failure of any worker node, but not from
a failure of the master node.
Now you may wonder, why don’t we build a system with no single point of
failure? Because it is very difficult, and there are some fundamental
limitations, like CAP theorem.
One system designed to recover from the failure of any node is Apache
Cassandra. Each piece of payload data is stored redundantly on multiple
nodes. Nodes are constantly polled to detect failures. The problem is that
there is no way to know with certainty whether a node is dead or just slow
to respond. If you wait too long, a second node might die while you have
not yet completed recovery from the first dead node. Nodes constantly need
to reconcile transactions, especially when they learn about a series of
transactions in an order different from the order in which they were
There are also systems that can recover from the failure of any node,
such as Apache Cassandra. The basic idea is that for any function performed
by any node, there is another node waiting to take over, combined with
constant polling to detect failures. This is, however, extremely difficult
to implement reliably, because how do you distinguish between a dead node
and one that is slow to respond?