Skip to content

Commit b4787a6

Browse files
authored
New Incident Management articles for Learn (#1289)
* Add files via upload * Create incident-management-challenges.md * Update incident-management-challenges.md * Create incident-assessment-severity.md * Create postmortems.md
1 parent 02e0131 commit b4787a6

File tree

5 files changed

+361
-0
lines changed

5 files changed

+361
-0
lines changed
73.4 KB
Loading
91.4 KB
Loading
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
title: Incident Assessment & Severity | Checkly Guide
3+
displayTitle: Incident Assessment & Severity Guide for Engineering Teams (+ Cheat Sheet)
4+
navTitle: Incident Assessment & Severity Guide
5+
description: Not every alert is an incident—and not every incident is equally urgent. How to classify incidents and determine their severity?
6+
date: 2025-05-23
7+
author: Sara Miteva
8+
githubUser: SaraMiteva
9+
displayDescription: How to classify incidents and determine their severity?
10+
menu:
11+
learn_incidents
12+
weight: 20
13+
menu:
14+
learn_incidents:
15+
parent: Classification
16+
---
17+
18+
## **Incident Assessment & Severity**
19+
20+
Not every alert is an incident—and not every incident is equally urgent.
21+
22+
That’s where **incident assessment** and **severity classification** come in. Without clear definitions, teams get stuck in limbo:
23+
24+
- Should we wake someone up?
25+
- Should we inform customers?
26+
- Should we prepare a support strategy?
27+
- Is this critical or just annoying?
28+
29+
The goal of incident assessment is to **evaluate the scope and impact** of a problem, determine its urgency, and trigger the appropriate response. Done right, this step aligns engineering, support, and leadership around a shared understanding of what matters—and what to do next.
30+
31+
Let’s break down what effective assessment looks like and how to build your own severity classification system.
32+
33+
### **What Is Incident Assessment?**
34+
35+
Incident assessment is the process of determining whether an observed issue qualifies as an incident—and if so, how serious it is.
36+
37+
To assess an incident, you typically ask:
38+
39+
- What’s broken?
40+
- Who is impacted?
41+
- Is there a workaround?
42+
- How fast do we need to act?
43+
44+
The outcome of this process is a **severity level** that maps to your internal response playbook: who gets paged, how quickly you communicate, and how visible the incident becomes across the company.
45+
46+
### **Why Severity Levels Matter**
47+
48+
Clear severity definitions help your team:
49+
50+
- Act faster under pressure
51+
- Escalate the right issues
52+
- Prevent over-alerting or under-reacting
53+
- Set communication expectations internally and externally
54+
55+
They also create psychological safety. When engineers know exactly what qualifies as a SEV1, they don’t waste time debating—they act.
56+
57+
## Severity Levels: Example Framework
58+
59+
Here’s a simple, 3-tier severity model you can adopt or adapt:
60+
61+
| **Severity** | **Impact** | **Example Incident** | **Expected Action** |
62+
| --- | --- | --- | --- |
63+
| **SEV1** | Critical / Total Outage | Full production outage, major security breach, data loss | All-hands on deck. Wake people up. 24/7 response. Execs informed. |
64+
| **SEV2** | High / Partial Outage | 10% of users can’t log in, degraded performance, partial failure | Escalate to on-call immediately. Frequent updates. Prioritized fix. |
65+
| **SEV3** | Moderate / Minor Bug | Broken styling, slow dashboard load, minor UX issue | Fix during business hours. Log the issue. May not require updates. |
66+
67+
### A Score-Based System for Classifying Severity
68+
69+
You can use a weighted scoring system that evaluates incidents across five dimensions. This adds structure and reduces subjective decisions:
70+
71+
| **Dimension** | **Low (1 pt)** | **Medium (2 pts)** | **High (3 pts)** |
72+
| --- | --- | --- | --- |
73+
| **User Impact** | <5% affected | 5–25% affected | >25% or all users affected |
74+
| **Functionality** | Cosmetic / minor bug | Partial functionality loss | Core feature broken, no workaround |
75+
| **Business Impact** | No SLA/revenue/legal risk | Mild SLA concern or revenue impact | Revenue loss, SLA breach, or legal exposure |
76+
| **Urgency** | Can wait for a sprint | Fix in a day or two | Requires immediate attention |
77+
| **Workaround** | Easy workaround exists | Workaround is possible but painful | No workaround available |
78+
79+
Then, you can map the final score as follows:
80+
81+
| **Total Score** | **Severity Level** |
82+
| --- | --- |
83+
| 5–7 | SEV3 (Low) |
84+
| 8–11 | SEV2 (Medium) |
85+
| 12–15 | SEV1 (High) |
86+
### Example: Users on an unusual browser cannot check out
87+
88+
Let’s say our business is a review site with an ecommerce store. Users on Microsoft Edge can’t check out due to an incompatibility with our payment provider implementation.
89+
90+
- User Impact: Low (1) — Less than 5% of all our users are on Microsoft Edge
91+
- Functionality: High (3) — Users are prevented from a final checkout step, and are unlikely to switch browsers, instead abandoning their cart
92+
- Business: High (3) — This will cost revenue
93+
- Urgency: Medium (2) — At our estimate, this only requires updates to dependencies, and can be fixed in a day or two
94+
- Workaround: Medium (2) - We definitely don’t want to add a ‘please switch browsers’ message to our site
95+
96+
**Score: 12 → SEV1**
97+
98+
## Downloadable Incident Severity Cheat Sheet
99+
100+
If you want to adopt or adapt this process, you can make a copy of our own [incident severity cheat sheet](https://docs.google.com/spreadsheets/d/18L2r8u5h8ylRWNbfMZv0Ff-amJ-ySk5ZU1FLxW2G4uY/edit?usp=sharing).
101+
102+
## Creating Your Own Severity Rules
103+
104+
Every organization operates differently, and what counts as a critical incident for one team may be a routine alert for another. Here’s how to build a severity scoring system that reflects your team’s priorities, customer expectations, and business context.
105+
106+
### **1. Pick Dimensions That Matter to You**
107+
108+
Start by identifying the dimensions of impact that are most relevant to your systems and stakeholders. Common ones include **user impact**, **feature impact**, **business risk**, **urgency**, and **workaround availability**—but you might also include **compliance violations**, **customer tier affected**, or **data integrity** if those are key concerns. The goal is to capture the real-world consequences of an incident in a way that reflects your product and risk model.
109+
110+
### **2. Agree on What “High”, “Medium”, and “Low” Mean**
111+
112+
Without clear definitions, severity scoring quickly becomes subjective. What one engineer sees as a “minor issue” might be considered “urgent” by someone in customer success. To avoid this, document clear criteria for each level of impact. For example, define “High User Impact” as “more than 25% of users affected” or “SEV1 Business Impact” as “any outage that causes revenue loss or legal risk.” These definitions become your north star for consistent triage.
113+
114+
### **3. Add Automation Where Possible**
115+
116+
Manual severity scoring can slow things down and introduce inconsistencies, especially during high-pressure incidents. Automate as much of the process as you can. A shared Google Sheet or Notion template can help teams select impact levels via dropdowns, with scores and severity levels calculated automatically. For more mature teams, connect this logic directly into your alerting pipeline or incident management tool so severity is auto-assigned on alert creation.
117+
118+
### **4. Train Your Incident Responders**
119+
120+
Severity models only work if everyone is aligned on how to use them. Run training sessions or tabletop exercises with your incident response team. Use past incidents to “test” the model—how would they score it now, and does the outcome feel right in hindsight? Over time, this helps calibrate judgment, improves consistency across teams, and creates shared understanding between engineering, support, and leadership.
121+
122+
## Final Words
123+
124+
Clear severity rules turn gut-feel decisions into structured, confident responses. With a shared scoring model, your team can triage faster, communicate better, and stay focused when it matters most.
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: Incident Management Challenges | Checkly Guide
3+
displayTitle: Incident Management Challenges
4+
navTitle: Incident Management Challenges
5+
description: Find out what are the most common incident challenges and how to address them.
6+
date: 2025-05-23
7+
author: Sara Miteva
8+
githubUser: SaraMiteva
9+
displayDescription: "Find out what are the most common incident challenges and how to address them."
10+
menu:
11+
learn_incidents:
12+
parent: Detection
13+
weight: 20
14+
---
15+
16+
**System downtime is just the tip of the iceberg.**
17+
18+
When something breaks in production—whether it's a full outage, or a degraded feature—it kicks off far more than a technical recovery. Behind every incident lies a flurry of internal coordination, customer communication, and hard questions about what went wrong and why.
19+
20+
Internally, teams scramble to assess impact, debug under pressure, and coordinate across tools. Externally, trust is on the line—customers are confused, support tickets spike, and leadership wants answers.
21+
22+
That’s why incident management isn’t just a technical process. **It’s a test of your culture, communication, and structure**.
23+
24+
Let’s break down the real-world challenges teams face when managing incidents—and how a structured approach can help you respond faster, better, and with more confidence.
25+
26+
![Incident Management Challenges](learn/images/Incident Management Challenges (1).png)
27+
28+
## Early Detection Is Still Hard
29+
30+
Despite proactive monitoring tools, teams often struggle to detect incidents early enough. The challenge?
31+
32+
Modern systems are deeply interconnected—built on microservices, APIs, cloud infrastructure, third-party components, and continuous deployments. When something starts to break, it rarely announces itself with a dramatic failure.
33+
34+
Instead, it’s a single failing endpoint. A slightly elevated latency. A login request that takes 5 seconds longer than usual. At first glance, these symptoms might not seem alarming. But left unchecked, they can cascade into a full-blown outage.
35+
36+
[Synthetic monitoring](https://www.checklyhq.com/product/synthetic-monitoring/) and continuous API checks can make a huge difference. However, teams need to agree on what “normal” looks like. Without a shared baseline or alerting logic, it’s too easy to ignore early signs—or drown in noisy alerts that don’t mean anything.
37+
38+
In the end, early detection isn’t just about tooling. It’s about tuning, ownership, and continuously improving your signal-to-noise ratio.
39+
40+
## **2. Defining What Is an Incident**
41+
42+
*An **incident** is any unplanned disruption or degradation of a service that affects users or business operations and requires a response.*
43+
44+
What events categorize as incidents? That’s for your team to decide.
45+
46+
Some teams treat any failed check as an incident. Others only classify it as such if customers are impacted or a system is fully down. Without alignment, this leads to chaos. One engineer might escalate a minor error, while another silently fixes a major outage without notifying anyone.
47+
48+
Every team defines incidents differently. But without clearly defined severity levels, it’s too easy to either over-alert or under-react.
49+
50+
Here’s an example of how you could classify incidents by severity:
51+
52+
- **SEV1**: Critical—core features down, customers impacted.
53+
- **SEV2**: Partial degradation—users are affected, but workarounds exist.
54+
- **SEV3**: Minor bug—non-blocking, but potentially noisy.
55+
56+
![Incident severity levels](learn/images/Incident serverity levels (1).png)
57+
58+
The challenge is not just defining these, but aligning everyone—engineering, support, and leadership—on what they mean in practice.
59+
60+
## 3. Coordination and Escalation Can Get Messy
61+
62+
When things go wrong, teams scramble. But without a clearly defined incident commander or roles like communication lead or scribe, progress stalls. People either duplicate efforts or wait for someone else to lead.
63+
64+
Escalation must be automatic. Everyone should know: when this happens, who gets paged, and who owns the response.
65+
66+
## 4. Postmortems Get Ignored or Misused
67+
68+
The post-incident review often turns into a blame game or a checkbox exercise. But a good postmortem is blameless, structured, and actionable.
69+
70+
Ask:
71+
72+
- What failed—process, tooling, or communication?
73+
- What went *well*?
74+
- What will we change in the runbook, monitoring, or alert logic?
75+
76+
Without real reflection, you’re doomed to repeat the same fire drills.
77+
78+
## 5. Fear Slows Down Response
79+
80+
One of the most dangerous challenges in incident management isn’t technical—it’s emotional.
81+
82+
When engineers fear being blamed or embarrassed in a postmortem, they become hesitant to speak up. They might delay declaring an incident, hoping it resolves quietly. Or they’ll avoid updating stakeholders out of fear that incomplete information will reflect poorly on them.
83+
84+
This slows everything down. Detection is delayed. Communication stalls. Recovery takes longer.
85+
86+
The antidote? **Psychological safety.** Teams need to know they won’t be punished for triggering an alert or surfacing a potential issue—even if it turns out to be a false alarm. In a blameless culture:
87+
88+
- Engineers feel safe escalating issues early
89+
- People focus on improving systems, not assigning blame
90+
- Postmortems become honest learning tools, not interrogations
91+
92+
If people are afraid of being called out, they'll wait. And in incident response, waiting is the enemy.
93+
94+
Your culture determines whether your team moves fast—or freezes. Choose transparency over silence. Choose learning over blame.
95+
96+
## Chaos is the Enemy, Structure is the Fix
97+
98+
Incident management is a high-stakes, high-pressure process. But when you build in structure—detection, communication, escalation, resolution, reflection—you replace chaos with clarity.
99+
100+
Clear roles, proactive monitoring, and automated status updates can’t prevent every outage—but they ensure your team is never caught flat-footed.

0 commit comments

Comments
 (0)