What's new (in reliability)?

Someone posted the idea that a gradual rollout of a feature dilutes the signal of catastrophic failure so that it’s harder to identify a cause.

An illustrative example was a Kafka outage that resulted in a service disruption.

I noticed this paragraph (because it mentioned Scala) and wondered at the specific blame placed on the lack of new.

Scala Inconsistency in Our System

  • Usage of the pekko-connectors-kafka Scala library – passing Kafka settings instead of a pre-created producer instance led the connector to generate a new producer per request rather than reusing an existing one as intended.
  • Every time send() is called, a new producer is created. This ended up being a blind spot visually due to the lack of an explicit new operator as a hint that a new object is being allocated.
  • This, in combination with reduced visibility into short-lived Kafka producer tracking, allowed the incident to occur, despite having a gradual rollout process in place.

The dubious phrase, “This [thing we’re blaming], in combination with [various other factors], allowed the incident to occur, despite having a gradual rollout,” is cut-and-paste.

The social media post highlighted the “contributing factor”:

The new feature was (intentionally) gradually enabled via configuration rather than a distinct deployment. While this is intended as safety, it also made it harder to immediately connect the outage to the change.

But the “follow-up items” includes:

slower ramp-up windows

which (by this theory) would exacerbate the problem of correlating the failure with the deployment.

I assume that the post-mortem fails to identify engineering processes that could have made a difference, but blaming “the lack of an explicit new” calls into question everything one might take for granted about the effectiveness of syntax.

(When I have performed a graduated rollout, I suffered anxiety and hypervigilance for at least 24 hours, so I don’t necessarily subscribe to the dilution theory.)

2 Likes

Interesting - kudos to those folks for owning it in public. Give them credit for holding it down as a team.

Speculating wildly, with just a bit of hindsight and no background whatsoever, did someone start with this documentation snippet…

val producer = SendProducer(producerDefaults)
try {
  val send: Future[RecordMetadata] = producer
    .send(new ProducerRecord(topic1, "key", "value"))
  // Blocking here for illustration only, you need to handle the future result
  Await.result(send, 2.seconds)
} finally {
  Await.result(producer.close(), 1.minute)
}

… and then dropped the RAII part?

Unless you’re familiar with Kafka and with having to call close on something with plenty of mass under the waterline, I guess you could be exposed to a disaster of potentially Titanic proportions.

It doesn’t sound like they weren’t closing things–rather the opposite. It sounded like they were expecting to reuse a session when they were recreating it each time (at much higher overhead).

So starting small, everything is fine because the expense can be managed; but once things scale, it falls over way earlier than expected.