Someone posted the idea that a gradual rollout of a feature dilutes the signal of catastrophic failure so that it’s harder to identify a cause.
An illustrative example was a Kafka outage that resulted in a service disruption.
I noticed this paragraph (because it mentioned Scala) and wondered at the specific blame placed on the lack of new.
Scala Inconsistency in Our System
- Usage of the pekko-connectors-kafka Scala library – passing Kafka settings instead of a pre-created producer instance led the connector to generate a new producer per request rather than reusing an existing one as intended.
- Every time send() is called, a new producer is created. This ended up being a blind spot visually due to the lack of an explicit new operator as a hint that a new object is being allocated.
- This, in combination with reduced visibility into short-lived Kafka producer tracking, allowed the incident to occur, despite having a gradual rollout process in place.
The dubious phrase, “This [thing we’re blaming], in combination with [various other factors], allowed the incident to occur, despite having a gradual rollout,” is cut-and-paste.
The social media post highlighted the “contributing factor”:
The new feature was (intentionally) gradually enabled via configuration rather than a distinct deployment. While this is intended as safety, it also made it harder to immediately connect the outage to the change.
But the “follow-up items” includes:
slower ramp-up windows
which (by this theory) would exacerbate the problem of correlating the failure with the deployment.
I assume that the post-mortem fails to identify engineering processes that could have made a difference, but blaming “the lack of an explicit new” calls into question everything one might take for granted about the effectiveness of syntax.
(When I have performed a graduated rollout, I suffered anxiety and hypervigilance for at least 24 hours, so I don’t necessarily subscribe to the dilution theory.)