Skip to content

Commit 81cb0ec

Browse files
authored
Merge pull request #303064 from craigshoemaker/sre/im-draft
[SRE Agent] New: Incident management
2 parents 83f3ffb + 4d9d267 commit 81cb0ec

File tree

2 files changed

+70
-0
lines changed

2 files changed

+70
-0
lines changed
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
title: Incident management in Azure SRE Agent (preview)
3+
description: Learn how the Azure SRE Agent incident management capabilities help reduce manual intervention and accelerate resolution times for your Azure resources.
4+
author: craigshoemaker
5+
ms.topic: conceptual
6+
ms.date: 07/21/2025
7+
ms.author: cshoe
8+
ms.service: azure
9+
---
10+
11+
# Incident management in Azure SRE Agent (preview)
12+
13+
Azure SRE Agent streamlines incident management by automatically collecting, analyzing, and responding to alerts from various management platforms. This article explains how the agent processes incidents, evaluates their severity, and takes appropriate actions based on your configuration.
14+
15+
Azure SRE Agent receives alerts from incident management platforms such as:
16+
17+
* [Azure Monitor](/azure/azure-monitor/fundamentals/overview)
18+
* [PagerDuty](https://www.pagerduty.com/)
19+
20+
Alerts are triggered by predefined conditions configured in these systems external to SRE Agent.
21+
22+
When SRE Agent receives an alert from the management platform, the agent brings the incident into its context, analyzes the situation, and determines the next steps. This process mimics how a human SRE would acknowledge and investigate an incident.
23+
24+
The agent reviews logs, health probes, and other telemetry to assess the incident. During the assessment step, the agent summarizes findings, determines if the alert is a false positive, and decides whether action is needed.
25+
26+
## How agents respond
27+
28+
SRE Agent responds to incidents based on its configuration and operational mode.
29+
30+
* **Reader**: In reader mode, the agent provides recommendations and requires human intervention for resolution.
31+
32+
* **Autonomous**: In autonomous mode, the agent could automatically close incidents or take corrective actions, depending on your configuration settings. The agent can also update or close incidents in management platforms to maintain synchronization across platforms.
33+
34+
You define the rules for how incidents of different priorities are handled. By customizing the rules in the management platforms, you decide which incidents the agent should acknowledge, resolve, or escalate. These rules can be set via prompts or configuration options.
35+
36+
## Platform integration
37+
38+
Minimal setup is required for Azure Monitor (default integration), while non-Microsoft systems like PagerDuty require extra setup for incident handling preferences.
39+
40+
To access the incident management settings, open your agent in the Azure portal. Select **Settings** and select **Incident platform**.
41+
42+
### Azure Monitor
43+
44+
By default, Azure Monitor is configured as the agent's incident management platform. As the agent encounters incidents any instances of Azure Monitor in any resource groups managed by SRE Agent process incident data.
45+
46+
To use a different management platform, first disconnect Azure Monitor as the incident management platform for the SRE Agent.
47+
48+
### PagerDuty
49+
50+
To set up PagerDuty, open the agent in the Azure portal, select **Settings** then select **Incident platform**, and enter the following settings:
51+
52+
| Setting | Value |
53+
|---|---|
54+
| Incident platform | Select **PagerDuty**. |
55+
| REST API access key | Enter your PagerDuty REST API access key. |
56+
| Quickstart handler | Keep the checkbox for the quickstart handler checked. |
57+
58+
Select **Save** to save your changes.
59+
60+
Once the changes are saved, PagerDuty is now responsible to manage incidents for the agent.
61+
62+
#### Tools
63+
64+
You can choose to enable a series of tools that provide granular control over how PagerDuty manages incidents. To further refine the incident management process, you can also add free-form text instructions (in the form of an LLM prompt) to customize how PagerDuty responds to incidents.
65+
66+
## Related content
67+
68+
* [Security contexts](./security-context.md)

articles/sre-agent/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ items:
1919
items:
2020
- name: Security context
2121
href: security-context.md
22+
- name: Incident management
23+
href: incident-management.md
2224
- name: Troubleshooting
2325
href: troubleshoot.md
2426
- name: Billing

0 commit comments

Comments
 (0)