checkly
diff --git a/‎site/assets/learn/images/Incident Management Challenges (1).png
73.4 KB b/‎site/assets/learn/images/Incident Management Challenges (1).png
73.4 KB
diff --git a/‎site/assets/learn/images/Incident serverity levels (1).png
91.4 KB b/‎site/assets/learn/images/Incident serverity levels (1).png
91.4 KB
diff --git a/‎site/content/learn/incidents/incident-assessment-severity.md
Lines changed: 124 additions & 0 deletions b/‎site/content/learn/incidents/incident-assessment-severity.md
Lines changed: 124 additions & 0 deletions
diff --git a/‎site/content/learn/incidents/incident-management-challenges.md
Lines changed: 100 additions & 0 deletions b/‎site/content/learn/incidents/incident-management-challenges.md
Lines changed: 100 additions & 0 deletions
@@ -0,0 +1,124 @@
+---
+title: Incident Assessment & Severity | Checkly Guide
+displayTitle: Incident Assessment & Severity Guide for Engineering Teams (+ Cheat Sheet)
+navTitle:  Incident Assessment & Severity Guide
+description: Not every alert is an incident—and not every incident is equally urgent. How to classify incidents and determine their severity? 
+date: 2025-05-23
+author: Sara Miteva
+githubUser: SaraMiteva
+displayDescription: How to classify incidents and determine their severity? 
+menu:
+  learn_incidents
+weight: 20
+menu:
+  learn_incidents:
+      parent: Classification
+---
+
+## **Incident Assessment & Severity**
+
+Not every alert is an incident—and not every incident is equally urgent.
+
+That’s where **incident assessment** and **severity classification** come in. Without clear definitions, teams get stuck in limbo: 
+
+- Should we wake someone up?
+- Should we inform customers?
+- Should we prepare a support strategy?
+- Is this critical or just annoying?
+
+The goal of incident assessment is to **evaluate the scope and impact** of a problem, determine its urgency, and trigger the appropriate response. Done right, this step aligns engineering, support, and leadership around a shared understanding of what matters—and what to do next.
+
+Let’s break down what effective assessment looks like and how to build your own severity classification system.
+
+### **What Is Incident Assessment?**
+
+Incident assessment is the process of determining whether an observed issue qualifies as an incident—and if so, how serious it is.
+
+To assess an incident, you typically ask:
+
+- What’s broken?
+- Who is impacted?
+- Is there a workaround?
+- How fast do we need to act?
+
+The outcome of this process is a **severity level** that maps to your internal response playbook: who gets paged, how quickly you communicate, and how visible the incident becomes across the company.
+
+### **Why Severity Levels Matter**
+
+Clear severity definitions help your team:
+
+- Act faster under pressure
+- Escalate the right issues
+- Prevent over-alerting or under-reacting
+- Set communication expectations internally and externally
+
+They also create psychological safety. When engineers know exactly what qualifies as a SEV1, they don’t waste time debating—they act.
+
+## Severity Levels: Example Framework
+
+Here’s a simple, 3-tier severity model you can adopt or adapt:
+
+| **Severity** | **Impact** | **Example Incident** | **Expected Action** |
+| --- | --- | --- | --- |
+| **SEV1** | Critical / Total Outage | Full production outage, major security breach, data loss | All-hands on deck. Wake people up. 24/7 response. Execs informed. |
+| **SEV2** | High / Partial Outage | 10% of users can’t log in, degraded performance, partial failure | Escalate to on-call immediately. Frequent updates. Prioritized fix. |
+| **SEV3** | Moderate / Minor Bug | Broken styling, slow dashboard load, minor UX issue | Fix during business hours. Log the issue. May not require updates. |
+
+### A Score-Based System for Classifying Severity
+
+You can use a weighted scoring system that evaluates incidents across five dimensions. This adds structure and reduces subjective decisions:
+
+| **Dimension** | **Low (1 pt)** | **Medium (2 pts)** | **High (3 pts)** |
+| --- | --- | --- | --- |
+| **User Impact** | <5% affected | 5–25% affected | >25% or all users affected |
+| **Functionality** | Cosmetic / minor bug | Partial functionality loss | Core feature broken, no workaround |
+| **Business Impact** | No SLA/revenue/legal risk | Mild SLA concern or revenue impact | Revenue loss, SLA breach, or legal exposure |
+| **Urgency** | Can wait for a sprint | Fix in a day or two | Requires immediate attention |
+| **Workaround** | Easy workaround exists | Workaround is possible but painful | No workaround available |
+
+Then, you can map the final score as follows:
+
+| **Total Score** | **Severity Level** |
+| --- | --- |
+| 5–7 | SEV3 (Low) |
+| 8–11 | SEV2 (Medium) |
+| 12–15 | SEV1 (High) |
+### Example: Users on an unusual browser cannot check out
+
+Let’s say our business is a review site with an ecommerce store. Users on Microsoft Edge can’t check out due to an incompatibility with our payment provider implementation.
+
+- User Impact: Low (1) — Less than 5% of all our users are on Microsoft Edge
+- Functionality: High (3) — Users are prevented from a final checkout step, and are unlikely to switch browsers, instead abandoning their cart
+- Business: High (3) — This will cost revenue
+- Urgency: Medium (2) — At our estimate, this only requires updates to dependencies, and can be fixed in a day or two
+- Workaround: Medium (2) - We definitely don’t want to add a ‘please switch browsers’ message to our site
+    
+    → **Score: 12 → SEV1**
+    
+## Downloadable Incident Severity Cheat Sheet
+
+If you want to adopt or adapt this process, you can make a copy of our own [incident severity cheat sheet](https://docs.google.com/spreadsheets/d/18L2r8u5h8ylRWNbfMZv0Ff-amJ-ySk5ZU1FLxW2G4uY/edit?usp=sharing).
+
+## Creating Your Own Severity Rules
+
+Every organization operates differently, and what counts as a critical incident for one team may be a routine alert for another. Here’s how to build a severity scoring system that reflects your team’s priorities, customer expectations, and business context.
+
+### **1. Pick Dimensions That Matter to You**
+
+Start by identifying the dimensions of impact that are most relevant to your systems and stakeholders. Common ones include **user impact**, **feature impact**, **business risk**, **urgency**, and **workaround availability**—but you might also include **compliance violations**, **customer tier affected**, or **data integrity** if those are key concerns. The goal is to capture the real-world consequences of an incident in a way that reflects your product and risk model.
+
+### **2. Agree on What “High”, “Medium”, and “Low” Mean**
+
+Without clear definitions, severity scoring quickly becomes subjective. What one engineer sees as a “minor issue” might be considered “urgent” by someone in customer success. To avoid this, document clear criteria for each level of impact. For example, define “High User Impact” as “more than 25% of users affected” or “SEV1 Business Impact” as “any outage that causes revenue loss or legal risk.” These definitions become your north star for consistent triage.
+
+### **3. Add Automation Where Possible**
+
+Manual severity scoring can slow things down and introduce inconsistencies, especially during high-pressure incidents. Automate as much of the process as you can. A shared Google Sheet or Notion template can help teams select impact levels via dropdowns, with scores and severity levels calculated automatically. For more mature teams, connect this logic directly into your alerting pipeline or incident management tool so severity is auto-assigned on alert creation.
+
+### **4. Train Your Incident Responders**
+
+Severity models only work if everyone is aligned on how to use them. Run training sessions or tabletop exercises with your incident response team. Use past incidents to “test” the model—how would they score it now, and does the outcome feel right in hindsight? Over time, this helps calibrate judgment, improves consistency across teams, and creates shared understanding between engineering, support, and leadership.
+
+## Final Words
+
+Clear severity rules turn gut-feel decisions into structured, confident responses. With a shared scoring model, your team can triage faster, communicate better, and stay focused when it matters most.
@@ -0,0 +1,100 @@
+---
+title: Incident Management Challenges | Checkly Guide
+displayTitle: Incident Management Challenges
+navTitle: Incident Management Challenges
+description: Find out what are the most common incident challenges and how to address them. 
+date: 2025-05-23
+author: Sara Miteva
+githubUser: SaraMiteva
+displayDescription: "Find out what are the most common incident challenges and how to address them."
+menu:
+  learn_incidents:
+    parent: Detection
+weight: 20
+---
+
+**System downtime is just the tip of the iceberg.**
+
+When something breaks in production—whether it's a full outage, or a degraded feature—it kicks off far more than a technical recovery. Behind every incident lies a flurry of internal coordination, customer communication, and hard questions about what went wrong and why.
+
+Internally, teams scramble to assess impact, debug under pressure, and coordinate across tools. Externally, trust is on the line—customers are confused, support tickets spike, and leadership wants answers.
+
+That’s why incident management isn’t just a technical process. **It’s a test of your culture, communication, and structure**.
+
+Let’s break down the real-world challenges teams face when managing incidents—and how a structured approach can help you respond faster, better, and with more confidence.
+
+![Incident Management Challenges](learn/images/Incident Management Challenges (1).png)
+
+## Early Detection Is Still Hard
+
+Despite proactive monitoring tools, teams often struggle to detect incidents early enough. The challenge? 
+
+Modern systems are deeply interconnected—built on microservices, APIs, cloud infrastructure, third-party components, and continuous deployments. When something starts to break, it rarely announces itself with a dramatic failure.
+
+Instead, it’s a single failing endpoint. A slightly elevated latency. A login request that takes 5 seconds longer than usual. At first glance, these symptoms might not seem alarming. But left unchecked, they can cascade into a full-blown outage.
+
+[Synthetic monitoring](https://www.checklyhq.com/product/synthetic-monitoring/) and continuous API checks can make a huge difference. However, teams need to agree on what “normal” looks like. Without a shared baseline or alerting logic, it’s too easy to ignore early signs—or drown in noisy alerts that don’t mean anything.
+
+In the end, early detection isn’t just about tooling. It’s about tuning, ownership, and continuously improving your signal-to-noise ratio.
+
+## **2. Defining What Is an Incident**
+
+*An **incident** is any unplanned disruption or degradation of a service that affects users or business operations and requires a response.*
+
+What events categorize as incidents? That’s for your team to decide. 
+
+Some teams treat any failed check as an incident. Others only classify it as such if customers are impacted or a system is fully down. Without alignment, this leads to chaos. One engineer might escalate a minor error, while another silently fixes a major outage without notifying anyone.
+
+Every team defines incidents differently. But without clearly defined severity levels, it’s too easy to either over-alert or under-react.
+
+Here’s an example of how you could classify incidents by severity: 
+
+- **SEV1**: Critical—core features down, customers impacted.
+- **SEV2**: Partial degradation—users are affected, but workarounds exist.
+- **SEV3**: Minor bug—non-blocking, but potentially noisy.
+
+![Incident severity levels](learn/images/Incident serverity levels (1).png)
+
+The challenge is not just defining these, but aligning everyone—engineering, support, and leadership—on what they mean in practice.
+
+## 3. Coordination and Escalation Can Get Messy
+
+When things go wrong, teams scramble. But without a clearly defined incident commander or roles like communication lead or scribe, progress stalls. People either duplicate efforts or wait for someone else to lead.
+
+Escalation must be automatic. Everyone should know: when this happens, who gets paged, and who owns the response.
+
+## 4. Postmortems Get Ignored or Misused
+
+The post-incident review often turns into a blame game or a checkbox exercise. But a good postmortem is blameless, structured, and actionable.
+
+Ask:
+
+- What failed—process, tooling, or communication?
+- What went *well*?
+- What will we change in the runbook, monitoring, or alert logic?
+
+Without real reflection, you’re doomed to repeat the same fire drills.
+
+## 5. Fear Slows Down Response
+
+One of the most dangerous challenges in incident management isn’t technical—it’s emotional.
+
+When engineers fear being blamed or embarrassed in a postmortem, they become hesitant to speak up. They might delay declaring an incident, hoping it resolves quietly. Or they’ll avoid updating stakeholders out of fear that incomplete information will reflect poorly on them.
+
+This slows everything down. Detection is delayed. Communication stalls. Recovery takes longer.
+
+The antidote? **Psychological safety.** Teams need to know they won’t be punished for triggering an alert or surfacing a potential issue—even if it turns out to be a false alarm. In a blameless culture:
+
+- Engineers feel safe escalating issues early
+- People focus on improving systems, not assigning blame
+- Postmortems become honest learning tools, not interrogations
+
+If people are afraid of being called out, they'll wait. And in incident response, waiting is the enemy.
+
+Your culture determines whether your team moves fast—or freezes. Choose transparency over silence. Choose learning over blame.
+
+## Chaos is the Enemy, Structure is the Fix
+
+Incident management is a high-stakes, high-pressure process. But when you build in structure—detection, communication, escalation, resolution, reflection—you replace chaos with clarity.
+
+Clear roles, proactive monitoring, and automated status updates can’t prevent every outage—but they ensure your team is never caught flat-footed.