Skip to content

Commit 25671e7

Browse files
authored
Merge pull request #1885 from Praveen-98cs/rca-pe
[PE]: Add Incident overview doc
2 parents 0005e1a + cd247b4 commit 25671e7

File tree

7 files changed

+181
-0
lines changed

7 files changed

+181
-0
lines changed
67.6 KB
Loading
144 KB
Loading
53.4 KB
Loading
226 KB
Loading
78.6 KB
Loading
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Incident Overview
2+
3+
This section explains how Choreo automatically detects, analyzes, and helps you manage component incidents. The incident management feature provides automated root cause analysis to help you quickly understand and resolve issues affecting your components.
4+
5+
!!! tip
6+
Incidents are automatically created when critical system events are detected in your components. You don't need to configure anything. Choreo monitors your components and creates incidents when issues occur.
7+
8+
## What are Incidents?
9+
10+
Incidents are automatically generated alerts that indicate your component has experienced a critical issue affecting its stability or availability. When an incident occurs, Choreo automatically:
11+
12+
- Creates an incident record with detailed information
13+
- Collects relevant logs and metrics from before and during the incident
14+
- Analyzes recent deployment source code and configuration changes
15+
16+
This helps you quickly identify what went wrong and how to fix it.
17+
18+
## Incident Types
19+
20+
Choreo automatically detects and tracks the following types of incidents:
21+
22+
- [OOMKilled incidents](#oomkilled-incidents)
23+
- [CrashLoopBackOff incidents](#crashloopbackoff-incidents)
24+
25+
### OOMKilled Incidents
26+
27+
OOMKilled Incidents occur when your component ran out of memory and was terminated by the system.
28+
29+
**Common causes:**
30+
31+
- Memory leaks in your application
32+
- Memory allocation is too low for your workload needs
33+
- Unexpected traffic spikes causing memory pressure
34+
- Large data processing without proper resource management
35+
36+
37+
### CrashLoopBackOff Incidents
38+
39+
CrashLoopBackOff Incidents occur when your component keeps crashing and restarting repeatedly.
40+
41+
**Common causes:**
42+
43+
- Application fails to start properly
44+
- Missing or incorrect configuration values
45+
- Unable to connect to required dependencies (databases, APIs, etc.)
46+
- Code errors that cause the application to crash immediately on startup
47+
48+
## View Incidents
49+
50+
### Accessing the Incidents Page
51+
52+
1. Navigate to the component you want to monitor.
53+
54+
2. In the left navigation menu, click **Observability** and then click **Incidents**.
55+
56+
!!! note
57+
You can view incidents from **project level** or **component level**.
58+
59+
![Incidents Navigation](../assets/img/monitoring-and-insights/incidents/incident-view.png){.cInlineImage-full}
60+
61+
3. The incidents page displays all detected incidents for your project or component based on where you view the incidents.
62+
63+
### Filtering Incidents
64+
65+
Use the filters at the top of the incidents page to find specific incidents:
66+
67+
- **Time Range**: Select a date range to view incidents from a specific period
68+
- **Environment**: View incidents from specific environments (e.g., Development, Production)
69+
70+
## Understanding Incident Details
71+
72+
Click on any incident to view comprehensive diagnostic information.
73+
74+
### Incident Summary
75+
76+
At the top of the incident details page, you'll see:
77+
78+
- **Incident Type**: What kind of issue occurred (OOMKilled or CrashLoopBackOff)
79+
- **Incident ID**: Unique identifier for the incident
80+
- **Time of Failure**: Exact date and time of the incident
81+
82+
### Incident Analysis Sections
83+
84+
Once you open an incident, you'll find four key sections that provide comprehensive diagnostic information to help you understand and resolve the issue:
85+
86+
| **Section** | **Description** |
87+
|--------------------|---------------------------------------------------------------------------------|
88+
| **Compare Source Code** | Analyzes code changes between the incident version and the previous stable state. |
89+
| **Compare Configurations**| Highlights changes in environment variables or resource allocations. |
90+
| **Logs** | Displays filtered logs and events captured at the time of the incident. |
91+
| **Metrics** | Shows resource usage and performance metrics leading up to the incident. |
92+
93+
#### 1. Compare Source Code
94+
95+
Analyzes code changes between the incident version and the previous stable deployment.
96+
97+
**What you'll see:**
98+
- Side-by-side code diff showing what changed between versions
99+
- Specific files and lines that were modified along with the Commit diff Link of the provider
100+
- A link to view the Commit diff in the respective git provider.
101+
102+
!!! note
103+
If the Commit diff contains more than 5000 characters, only the Commit diff link will be shown.
104+
105+
![Compare Source Code](../assets/img/monitoring-and-insights/incidents/compare-source-code.png){.cInlineImage-full}
106+
107+
!!! tip
108+
If the incident occurred shortly after a deployment, carefully review the code changes they often reveal the root cause.
109+
110+
#### 2. Compare Configurations
111+
112+
Highlights changes in environment variables, secrets, and resource allocations (CPU/Memory).
113+
114+
**What you'll see:**
115+
- Configuration differences between the current and previous deployment
116+
- Changes in environment variables
117+
- Resource allocation modifications (CPU and memory limits)
118+
- Secret and configmap changes
119+
120+
![Compare Configurations](../assets/img/monitoring-and-insights/incidents/compare-configurations.png){.cInlineImage-full}
121+
122+
!!! important
123+
For OOMKilled incidents, check if memory limits were reduced. For CrashLoopBackOff, verify that all required environment variables are correctly set.
124+
125+
#### 3. Logs
126+
127+
Displays filtered logs and events captured **5 minutes before** and **20 seconds after** the incident occurred.
128+
129+
**What you'll see:**
130+
- **Application Logs**: Your application's output and error logs
131+
- **System Logs**: Container lifecycle events (restarts, crashes, OOMKilled events)
132+
- **Gateway Logs**: API Gateway access and error logs (if applicable)
133+
- Up to 200 log entries per log type, helping you pinpoint exactly what happened
134+
135+
![Logs](../assets/img/monitoring-and-insights/incidents/logs.png){.cInlineImage-full}
136+
137+
!!! note
138+
Logs are automatically collected from 5 minutes before the incident to 20 seconds after, ensuring you have context before and during the failure.
139+
140+
#### 4. Metrics
141+
142+
Shows resource usage and performance metrics leading up to the incident.
143+
144+
**What you'll see:**
145+
- **Memory and CPU Usage**: Trends showing resource consumption over time
146+
- **Request Rates**: Number of requests per minute
147+
- **Response Times**: API latency and performance metrics
148+
- **Error Rates**: HTTP error status codes and failure rates
149+
150+
![Metrics](../assets/img/monitoring-and-insights/incidents/metrics.png){.cInlineImage-full}
151+
152+
**Use metrics to:**
153+
- Identify memory leaks (steadily increasing memory usage)
154+
- Spot CPU spikes that may have caused issues
155+
- Correlate traffic spikes with the incident
156+
- Understand performance degradation patterns
157+
158+
## Best Practices
159+
160+
### Prevent OOMKilled Incidents
161+
162+
- **Set appropriate memory limits** based on your component's actual usage patterns
163+
- **Monitor memory trends** regularly to catch gradual increases
164+
- **Implement proper memory management** in your code
165+
- **Add alerts** for high memory usage before it reaches the limit
166+
167+
### Prevent CrashLoopBackOff Incidents
168+
169+
- **Test deployments** in non-production environments first
170+
- **Validate configuration** before deploying
171+
- **Implement health checks** in your application
172+
- **Ensure dependencies are available** before deploying
173+
174+
### General Best Practices
175+
176+
1. **Review incidents regularly**: Don't wait for critical issues—proactively review and address incidents
177+
2. **Follow recommendations**: The root cause analysis provides actionable steps—follow them
178+
3. **Track patterns**: If similar incidents occur repeatedly, investigate deeper systemic issues
179+
4. **Update resource limits**: Adjust CPU and memory limits based on incident insights
180+
5. **Keep deployments small**: Smaller, incremental deployments make it easier to identify what changed when incidents occur

en/developer-docs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,7 @@ nav:
217217
- Monitoring and Insights:
218218
- Observability Overview: monitoring-and-insights/observability-overview.md
219219
- Alert Overview: monitoring-and-insights/alerts-overview.md
220+
- Incident Overview: monitoring-and-insights/incident-overview.md
220221
- Operational Insights: monitoring-and-insights/operational-insights-dashboard.md
221222
- Delivery Insights:
222223
- Configure Delivery Insights: monitoring-and-insights/delivery-insights/configure-delivery-insights.md

0 commit comments

Comments
 (0)