You need some sort of node registration. If a new node is available, it
would contact the active cluster to signal it wants to join.
If you model the entire cluster as one actor system with a single root
actor, then indeed it would not recover from a failure of the root node, or
at least the recovery process would not easily fit within the actor model.
That may be an acceptable risk, since the failure of one particular node is
much smaller than the failure of any of many nodes. You can also try to
have the root node be more resilient than other nodes.
An example of a distributed actor system is Apache Spark (based on Akka):
Spark can easily recover from the failure of any worker node, but not from
a failure of the master node.
Now you may wonder, why don’t we build a system with no single point of
failure? Because it is very difficult, and there are some fundamental
limitations, like CAP theorem.
One system designed to recover from the failure of any node is Apache
Cassandra. Each piece of payload data is stored redundantly on multiple
nodes. Nodes are constantly polled to detect failures. The problem is that
there is no way to know with certainty whether a node is dead or just slow
to respond. If you wait too long, a second node might die while you have
not yet completed recovery from the first dead node. Nodes constantly need
to reconcile transactions, especially when they learn about a series of
transactions in an order different from the order in which they were
There are also systems that can recover from the failure of any node,
such as Apache Cassandra. The basic idea is that for any function performed
by any node, there is another node waiting to take over, combined with
constant polling to detect failures. This is, however, extremely difficult
to implement reliably, because how do you distinguish between a dead node
and one that is slow to respond?