You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our daily R&D activities, we're no strangers to tackling challenges, from spotting problems to fixing them. Even with our deep expertise and solid planning, issues are always show up.
That's where the SRE mantra "Learn and improve from failures" comes into play, more commonly known as post-mortem analysis or incident review.
Let's dive into what I've learned from conducting these post-mortems, and share some nuggets of wisdom.
Try asking some key questions
In our previous post-mortem analyses, we've hit quite a few bumps. Initially, we tried tracing every issue back to one root cause, hoping for a one-stop fix.
But soon we realized that 'root cause' meant different things to different folks. This often led to blame games instead of helpful discussions, piling on the stress.
There are a couple of examples to paint the picture:
Say a server hiccup messes with business apps. The business development team might point fingers at the server's instability, while the system maintenance team could argue that the real issue is the lack of backup plans in the apps. Two teams, two root cause theories.
What about when network issues drop business requests? Is it poor network quality or the apps' lack of solid backup plans?
Things get even hairier with third-party cloud services, where you're playing the blame game not just internally but with external providers too.
Our many retrospectives taught us a key lesson: Failures usually have more than one root cause. It's better to look at all factors, not just fight over one.
We now make a list of everything that could've contributed, from the main mess-up to the smaller slip-ups and bad decisions. Then we brainstorm how to fix each part.
We sort incidents by how long they take to identify, understand, happen, and verify. We focus on the time-drainers, think about how to do better, and set who does what by when.
In our meetings, we stick to three main questions:
What contributed to the incident?
How can we stop something similar from happening again?
What could we have done differently at the time for a quicker fix?
These questions keep us on track, away from the blame game. The meeting leader, be it a Communication Lead or tech support, makes sure we stick to these constructive points.
Principles of fault determination
In refining our incident analysis process, we've identified causes and pinpointed improvement areas. The next step is deciding who takes charge of these improvements. It's key to separate owning the problem from owning the solution.
some guiding principles:
Robustness: Ensuring Self-Reliance
Each component should be somewhat self-reliant, with fail-safes like redundancy and retries.
For example, if Component B relies on Component A, and A recovers from an issue but B doesn't, causing further problems, the onus is on B, not A.
Take server and network faults. If a network hiccup is quickly fixed but an app doesn't bounce back, it's on the app to improve, not the network.
In cases of strong-weak dependencies, core apps should have backup plans for relying on non-core ones. If a core app goes down due to a non-core app's glitch, it's the core app's job to make improvements.
Third-Party Responsibility
When using third-party services like cloud platforms, the default is that these providers are not to blame.
If a third-party service issue arises, internal teams should work on internal improvements, and also push for improvements in third-party services. But remember, stability shouldn't fully depend on outsiders.
For instance, apps using Aliyun services should be able to switch providers in case of trouble. The same goes for cloud storage; have a backup ready in case one region has issues.
This principle is like a nation's defense: cooperate with allies but never fully depend on them. It also keeps internal teams from passing the buck to cloud providers after migrating to the cloud, maintaining a sense of internal responsibility.
Segmented Analysis
In complex cases with secondary faults or different causes, we break the incident into parts.
For example, an issue starting with an inaccurate model might later be compounded by a DBA's mistake.
This segmented approach helps focus on specific issues for targeted improvements.
These principles aim to move us from seeking a single root cause to a broader perspective, opening more paths for self-improvement and continuous learning from incidents.
Wrap Up
Our key takeaway from fault retrospection is: focus less on finding one root cause and more on actions that enhance fault management and reduce business downtime.
We've got our Three Key Questions for retrospection and three principles for determining faults, helping assign clear responsibilities for improvements. Hopefully, these will aid effective fault analysis.
In conclusion, faults are part of system operations. Normality is actually rare. Whatever your role, view faults with balance. Reaching this understanding requires a nuanced view of faults, cultivating a culture that values improvements over penalizing mistakes.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In our daily R&D activities, we're no strangers to tackling challenges, from spotting problems to fixing them. Even with our deep expertise and solid planning, issues are always show up.
That's where the SRE mantra "Learn and improve from failures" comes into play, more commonly known as post-mortem analysis or incident review.
Let's dive into what I've learned from conducting these post-mortems, and share some nuggets of wisdom.
Try asking some key questions
In our previous post-mortem analyses, we've hit quite a few bumps. Initially, we tried tracing every issue back to one root cause, hoping for a one-stop fix.
But soon we realized that 'root cause' meant different things to different folks. This often led to blame games instead of helpful discussions, piling on the stress.
There are a couple of examples to paint the picture:
Our many retrospectives taught us a key lesson: Failures usually have more than one root cause. It's better to look at all factors, not just fight over one.
We now make a list of everything that could've contributed, from the main mess-up to the smaller slip-ups and bad decisions. Then we brainstorm how to fix each part.
We sort incidents by how long they take to identify, understand, happen, and verify. We focus on the time-drainers, think about how to do better, and set who does what by when.
In our meetings, we stick to three main questions:
These questions keep us on track, away from the blame game. The meeting leader, be it a Communication Lead or tech support, makes sure we stick to these constructive points.
Principles of fault determination
In refining our incident analysis process, we've identified causes and pinpointed improvement areas. The next step is deciding who takes charge of these improvements. It's key to separate owning the problem from owning the solution.
some guiding principles:
Robustness: Ensuring Self-Reliance
Each component should be somewhat self-reliant, with fail-safes like redundancy and retries.
For example, if Component B relies on Component A, and A recovers from an issue but B doesn't, causing further problems, the onus is on B, not A.
Take server and network faults. If a network hiccup is quickly fixed but an app doesn't bounce back, it's on the app to improve, not the network.
In cases of strong-weak dependencies, core apps should have backup plans for relying on non-core ones. If a core app goes down due to a non-core app's glitch, it's the core app's job to make improvements.
Third-Party Responsibility
When using third-party services like cloud platforms, the default is that these providers are not to blame.
If a third-party service issue arises, internal teams should work on internal improvements, and also push for improvements in third-party services. But remember, stability shouldn't fully depend on outsiders.
For instance, apps using Aliyun services should be able to switch providers in case of trouble. The same goes for cloud storage; have a backup ready in case one region has issues.
This principle is like a nation's defense: cooperate with allies but never fully depend on them. It also keeps internal teams from passing the buck to cloud providers after migrating to the cloud, maintaining a sense of internal responsibility.
Segmented Analysis
In complex cases with secondary faults or different causes, we break the incident into parts.
For example, an issue starting with an inaccurate model might later be compounded by a DBA's mistake.
This segmented approach helps focus on specific issues for targeted improvements.
These principles aim to move us from seeking a single root cause to a broader perspective, opening more paths for self-improvement and continuous learning from incidents.
Wrap Up
Our key takeaway from fault retrospection is: focus less on finding one root cause and more on actions that enhance fault management and reduce business downtime.
We've got our Three Key Questions for retrospection and three principles for determining faults, helping assign clear responsibilities for improvements. Hopefully, these will aid effective fault analysis.
In conclusion, faults are part of system operations. Normality is actually rare. Whatever your role, view faults with balance. Reaching this understanding requires a nuanced view of faults, cultivating a culture that values improvements over penalizing mistakes.
Beta Was this translation helpful? Give feedback.
All reactions