How to conduct an incident post-mortem analysis? #69

kyle-ip · 2024-01-06T08:23:41Z

kyle-ip
Jan 6, 2024
Maintainer

This article is embellished by ChatGPT.

In our daily R&D activities, we're no strangers to tackling challenges, from spotting problems to fixing them. Even with our deep expertise and solid planning, issues are always show up.

That's where the SRE mantra "Learn and improve from failures" comes into play, more commonly known as post-mortem analysis or incident review.

Let's dive into what I've learned from conducting these post-mortems, and share some nuggets of wisdom.

Try asking some key questions

In our previous post-mortem analyses, we've hit quite a few bumps. Initially, we tried tracing every issue back to one root cause, hoping for a one-stop fix.

But soon we realized that 'root cause' meant different things to different folks. This often led to blame games instead of helpful discussions, piling on the stress.

There are a couple of examples to paint the picture:

Say a server hiccup messes with business apps. The business development team might point fingers at the server's instability, while the system maintenance team could argue that the real issue is the lack of backup plans in the apps. Two teams, two root cause theories.
What about when network issues drop business requests? Is it poor network quality or the apps' lack of solid backup plans?
Things get even hairier with third-party cloud services, where you're playing the blame game not just internally but with external providers too.

Our many retrospectives taught us a key lesson: Failures usually have more than one root cause. It's better to look at all factors, not just fight over one.

We now make a list of everything that could've contributed, from the main mess-up to the smaller slip-ups and bad decisions. Then we brainstorm how to fix each part.

We sort incidents by how long they take to identify, understand, happen, and verify. We focus on the time-drainers, think about how to do better, and set who does what by when.

In our meetings, we stick to three main questions:

What contributed to the incident?
How can we stop something similar from happening again?
What could we have done differently at the time for a quicker fix?

These questions keep us on track, away from the blame game. The meeting leader, be it a Communication Lead or tech support, makes sure we stick to these constructive points.

Principles of fault determination

In refining our incident analysis process, we've identified causes and pinpointed improvement areas. The next step is deciding who takes charge of these improvements. It's key to separate owning the problem from owning the solution.

some guiding principles:

Robustness: Ensuring Self-Reliance

Each component should be somewhat self-reliant, with fail-safes like redundancy and retries.

For example, if Component B relies on Component A, and A recovers from an issue but B doesn't, causing further problems, the onus is on B, not A.

Take server and network faults. If a network hiccup is quickly fixed but an app doesn't bounce back, it's on the app to improve, not the network.

In cases of strong-weak dependencies, core apps should have backup plans for relying on non-core ones. If a core app goes down due to a non-core app's glitch, it's the core app's job to make improvements.

Third-Party Responsibility

When using third-party services like cloud platforms, the default is that these providers are not to blame.

If a third-party service issue arises, internal teams should work on internal improvements, and also push for improvements in third-party services. But remember, stability shouldn't fully depend on outsiders.

For instance, apps using Aliyun services should be able to switch providers in case of trouble. The same goes for cloud storage; have a backup ready in case one region has issues.

This principle is like a nation's defense: cooperate with allies but never fully depend on them. It also keeps internal teams from passing the buck to cloud providers after migrating to the cloud, maintaining a sense of internal responsibility.

Segmented Analysis

In complex cases with secondary faults or different causes, we break the incident into parts.

For example, an issue starting with an inaccurate model might later be compounded by a DBA's mistake.

This segmented approach helps focus on specific issues for targeted improvements.

These principles aim to move us from seeking a single root cause to a broader perspective, opening more paths for self-improvement and continuous learning from incidents.

Wrap Up

Our key takeaway from fault retrospection is: focus less on finding one root cause and more on actions that enhance fault management and reduce business downtime.

We've got our Three Key Questions for retrospection and three principles for determining faults, helping assign clear responsibilities for improvements. Hopefully, these will aid effective fault analysis.

In conclusion, faults are part of system operations. Normality is actually rare. Whatever your role, view faults with balance. Reaching this understanding requires a nuanced view of faults, cultivating a culture that values improvements over penalizing mistakes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to conduct an incident post-mortem analysis? #69

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to conduct an incident post-mortem analysis? #69

Uh oh!

Uh oh!

kyle-ip Jan 6, 2024 Maintainer

Try asking some key questions

Principles of fault determination

Robustness: Ensuring Self-Reliance

Third-Party Responsibility

Segmented Analysis

Wrap Up

Replies: 0 comments

kyle-ip
Jan 6, 2024
Maintainer