Problem Management Is/Isn’t Hard

Many organizations put a good emphasis on incident management, or rather, let’s capitalize that, Incident Management, because it is a nice operations topic in its own right. We have Incident Commanders, we have incident channels kept sterile of chatter to focus on actions, we have lots of theory and collective experience.

But here’s the thing. Incident management is just a subset of problem management, because incidents are just Surprise Problems. Far too many service shops have an incident process that looks like this:

  1. Incident happens, things break
  2. All hands on deck, get it fixed
  3. Post-incident review, immediate fixes and long-term problems ticketed up
  4. Quick fixes happen and get pushed
  5. … ?? … tumbleweeds

Is that familiar? The quick fixes are selected for being small (probably specific to this most recent failure mode) and so those tickets get done quickly. They satisfy two nice conditions, that of Rapidity and Doing Something. Happy management, amirite?

But the broader problem, which may well throw a bunch of related incidents with related causes, well, that’s more systemic, that’s harder, more involved, and only a few in-demand engineers can take that on. It’s going to impact the roadmap; it’s all tangled up in tech debt. Nobody wants to drive the features timeline into that ditch.

But those problems are the real causes of failure. Complicated services have complicated failure modes, with incidents born of cascading failures. Those problems are the roots of those cascades, and the quick fixes tend to be mitigations at the end of the cascades - and so only likely to fix individual failure paths.

This is why it’s hard; root and branch re-evaluation of bits of the stack you thought were just done. Time spent on something that should be finished already. Well, this is tech debt charging you interest. It was always going to happen.

If you see an incident as having revealed the problem, however, then the incident itself has coughed up something of value. You didn’t know about that problem before, or you didn’t understand its scope. Now you do. Don’t waste that knowledge, because you surely paid cash for it.

A problem should be treated as a new requirement for the system. That requirement is about reliability, supportability, data integrity, or whatever else applies to the incidents it would cause. When you see it as something with benefit to the service, you will be better placed to slot it into your roadmap.