Reconciling Fatal exceptions and Scala Futures

sup · February 23, 2024, 9:42pm

Context
I recently ran into a few issues with Futures and fatal exceptions. Our platform was starting to run into issues with timeouts where some Futures kept hanging no matter how long we awaited on them.

I eventually realized that, unlike Java CompletableFutures, certain Fatal exceptions cause Scala Futures to never complete (not even exceptionally). And after reading some threads on this topic, it seems like this is by design: "fatal" errors are not reported to the ExecutionContext · Issue #12152 · scala/bug · GitHub

While unhandled exception handlers can be written to improve monitoring when this happens, it doesn’t actually affect the runnable – the Future itself will never complete once in this state and there’s not much we can do about it at that point.

Ideally we can fail certain aspects of our service gracefully without affecting other parts of the system. But even if we wanted to do that today, we are unable to do so quickly as the fatals are indistinguishable from a timeout by the awaiting caller.

As far as I can tell, the current solution to this is to check every possible callsite, handle potential fatals, and document reasoning where appropriate. However, at our scale, we can perhaps catch most cases but not all. Even if we could do this and ensure all of our code was bulletproof as an invariant, we might receive Futures from external libraries that lie outside our control.

Question

I’m not a Scala expert Is my understanding of Fatal exceptions and Futures here accurate?
Am I missing a possible workaround, or is there really no simple recourse for my use case here?
For anyone else running into this problem at scale, what was your compromise on this behavior?

sup · February 24, 2024, 2:15am

Coming back to this with our current workaround:

Thankfully (for our niche), many of the common scenarios where we may run into this issue is mitigable because we use Clump. Since the embedded DSL decouples execution from composition, we’re able to mutate the logic written by our engineers to program defensively against fatal exceptions with a custom runtime exception wrapper. This is done during the traversal of the execution graph so that we don’t need to manually inspect and keep up with every call-site manually.

Another alternative to the underlying primitive is Twitter Futures which do happen to complete exceptionally on fatal exceptions based on my local testing and skimming of the source code.

That being said, we currently do not have a workaround for Futures returned from external libraries or code that our org does not control (and it doesn’t seem likely we’ll be able to handle those scenarios gracefully for the foreseeable future AFAICT).

BalmungSan · February 24, 2024, 1:36pm

Fatal exceptions should just kill your program immediately, period.
Trying to recover from them is a bad idea, since common real fatal exceptions are things like out of memory, stack overflow, etc.

If you are using fatal errors to signal recoverable failures of your system that is the issue you must solve.

jducoeur · February 24, 2024, 4:56pm

Yeah, I’m kind of confused by the use case. Fatal means fatal – they’re specifically unrecoverable errors, and trying to recover from them tends to mean winding up with ever-more-confusing cascading errors. (In my experience, fatal exceptions are most often OOMs, and trying to do much of anything after that is usually a mistake.)

I’m curious about what you’re expecting to be able to do with these fatal exceptions – it’s just not something I’m used to people even trying…

sup · February 24, 2024, 7:21pm

Yeah happy to elaborate a bit! First, I’ll start off by saying that our situation is likely corner-case enough that the everyday developer may not need this functionality. I’m certainly also not advocating that everyone should treat all fatal exceptions as recoverable. And of course, it’s entirely possible this might just be us trying to fit a square peg in a round hole – perhaps Scala Futures aren’t the best underlying primitive for our particular use case and niche. And that’s okay!

That being said, the difference in flexibility for corner cases when it comes to exception handling between the standard Java CompletableFuture and standard Scala Future has caused some confusion on our end recently since we’ve considered them effectively equivalent until now.

Obviously for existential issues like an OOM, we don’t want to keep our program alive. Our infrastructure will actually kill the server before it even becomes a problem within the application layer. But for our specific use case, there exist certain fatal exceptions that may happen in non-critical flows for reasons that are not existential to the survival of the overall application. In these situations, it would be irresponsible to crash the entire web-server at our scale. For instance:

If one non-critical endpoint or path contained a human programming error like returning from a nested anonymous function and threw a NonLocalReturnException within a Future, we would want to quickly fail that particular path without crashing other tenants of our system at runtime.
If one non-critical endpoint or path utilized an external library we don’t control that threw a LinkageError within a Future from an execution context we don’t have visibility into, we’d want want to quickly fail that particular path without crashing other tenants of our system at runtime and deal with the root cause at a later time.

Our particular system is designed in a way where most “paths” can be essentially independently evaluated in parallel, joined + awaited on, and continue “best effort” mainline execution regardless of failures in each path. But because these are indistinguishable from timeouts, it’s not possible for our orchestration logic to determine whether we’re still waiting on I/O or if one path failed because someone wrote buggy code.

Certainly I’d love if all of our code and external dependencies were bulletproof by fiat – but unfortunately unexpected issues do happen and it’s currently harder than expected to write our application in a way to mitigate those failures quickly.

jducoeur · February 24, 2024, 9:28pm

I see – okay, yeah, I’ve written middleware engines like that, so I can understand what you’re thinking here.

Not sure I see any great answers here, at least not offhand. I agree that the code in Promise pretty clearly says that this behavior is by-design, and is correct for most normal applications. This sort of heterogeneity (where a module can throw a fatal Error while the larger program is still considered healthy) inside a single JVM is somewhat unusual; I don’t think the stdlib Future is really designed for it…

som-snytt · February 25, 2024, 1:21am

There is a recent issue as to whether NonFatal is the worst sauce ever.

On 2.13, Future is more robust in the face of InterruptedException; a fix was recently applied to 2.12, to allow futures to complete.

However, after an Error, “there’s not much you can do at that point.” Either sandboxing or restarting is a platform problem.

I think no visibility or configurability of execution contexts is very unfortunate.

That said, I, too, have been in personal and professional relationships in which I waited too long for an appropriate response. They kept my future hanging. I recommend, “Let it crash and burn!” Don’t forget to hang up on your end.

But if we can reconcile fatal Errors with Scala Future, then there is hope for world peace, green energy, and equal rights.