Skip to content

Commit 851fd52

Browse files
authored
Merge pull request #299784 from MicrosoftDocs/release-build-sre-agents
[Build 2025 Ship Room] SRE Agents (#421911)
2 parents a84e9e1 + f9ea13b commit 851fd52

File tree

5 files changed

+75
-0
lines changed

5 files changed

+75
-0
lines changed
117 KB
Loading
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: SRE Agent overview (preview)
3+
description: Learn how AI-enabled agents help solve problems and support resilient and self-healing systems on your behalf.
4+
services: container-apps
5+
author: craigshoemaker
6+
ms.service: azure-container-apps
7+
ms.topic: conceptual
8+
ms.date: 05/08/2025
9+
ms.author: cshoe
10+
---
11+
12+
# SRE Agent overview (preview)
13+
14+
Site Reliability Engineering (SRE) focuses on creating reliable, scalable systems through automation and proactive management. An SRE Agent brings these principles to your cloud environment by providing AI-powered monitoring, troubleshooting, and remediation capabilities. An SRE Agent automates routine operational tasks and provides reasoned insights to help you maintain application reliability while reducing manual intervention. Available as a chatbot, you can ask questions and give natural language commands to maintain your applications and services. To ensure accuracy and control, any agent action taken on your behalf requires your approval.
15+
16+
Agents have access to every resource inside the resource groups associated to the agent. Therefore, agents:
17+
18+
- Continuously evaluate resource activity, and monitor active resources
19+
20+
- Send proactive notifications about unhealthy or unstable apps
21+
22+
- Provide a natural language interface to issue commands
23+
24+
An SRE Agent also integrates with [PagerDuty](https://www.pagerduty.com/) to support advanced notification solutions.
25+
26+
> [!NOTE]
27+
> The SRE Agent feature is in limited preview. To sign up for access, fill out the [SRE Agent application](https://go.microsoft.com/fwlink/?linkid=2319540).
28+
29+
## Key features
30+
31+
The SRE Agent offers several key features that enhance the reliability and performance of your Azure resources:
32+
33+
- **Proactive monitoring**: Continuous resource monitoring with real-time alerts for potential issues and daily resource reports.
34+
35+
- **Automated mitigation:** Automatic detection and mitigation of common issues, reducing downtime and improving resource health. While agents attempt to work on your behalf, all automation requires your approval.
36+
37+
- **Resource visualization**: Comprehensive views of your resource dependencies and health status.
38+
39+
:::image type="content" source="media/sre-agent/sre-agent-knowldege-graph.png" alt-text="Screenshot of an SRE Agent knowledge graph.":::
40+
41+
An SRE Agent works to proactively monitor and maintain your Azure services. Each day your agent creates daily resource reports which provide insights into the health and status of your applications. Reports include:
42+
43+
- **Actionable steps**: Measures you can take each day to reduce errors and harden security practices.
44+
45+
- **Key insights**: Summaries of important details relevant to the health and maintenance of your Azure resources.
46+
47+
- **Reasoning**: Summaries of analysis done by your agent helping surface possible problem areas in your apps.
48+
49+
- **Actions already taken by the agent**: A list of tasks the agent did on your behalf for the day.
50+
51+
## Scenarios
52+
53+
| Scenario | Possible cause | Agent mitigation |
54+
|---|---|---|
55+
| Application down |**Application code issues**: Bugs or errors in the application code can lead to crashes or unresponsiveness.<br><br>▪ **Bad deployment**: Incorrect configurations or failed deployments can cause the application to go down.<br><br>▪ **High CPU/memory/thread issues**: Resource exhaustion due to high CPU, memory, or thread usage can affect application performance. | The SRE Agent can detect these issues and provide actionable insights or fixes. For example, it can identify a decrease in web app availability that coincides with a recent slot swap and recommend swapping back slots as first step of mitigation. |
56+
| Virtual machine RDP issues |**NSG rules**: Misconfigured NSG rules on the NIC or Subnet can block RDP access. | The SRE Agent can detect misconfigurations of NSG rules that block RDP access. Agents can also apply the correct NSG rules to restore access. |
57+
| Container image pull failures |**Image availability**: The requested image might not be available or could be missing.<br><br>▪ **Network connectivity**: Network issues can disrupt the connection to the container app.<br><br>▪ **Registry connectivity issues**: Problems with connecting to the container registry can prevent image pulls. | The SRE Agent can detect container image pull failures and provide detailed diagnostics. It can recommend solutions such as rolling back to the last known healthy revision and updating the image reference. |
58+
59+
An agent can provide detailed information about different aspects of your apps and resources. The following examples demonstrate the types of questions you could pose to your agent:
60+
61+
- What can you assist me with?
62+
- Why isn't my application working?
63+
- What services is my resource connected to?
64+
- Can you provide best practices for my resource?
65+
- What's the CPU and memory utilization of my app?
66+
67+
## Preview access
68+
69+
Access to an SRE Agent is only available as a limited preview. To sign up for access, fill out the [SRE Agent application](https://go.microsoft.com/fwlink/?linkid=2319540).

articles/app-service/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,8 @@ items:
219219
href: /azure/templates/microsoft.web/allversions
220220
- name: Logs and monitoring
221221
items:
222+
- name: SRE Agent overview
223+
href: sre-agent-overview.md
222224
- name: Monitor App Service
223225
href: monitor-app-service.md
224226
- name: Monitoring data reference

articles/azure-functions/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -729,6 +729,8 @@
729729
href: functions-hybrid-powershell.md
730730
- name: Troubleshoot
731731
items:
732+
- name: SRE Agent overview
733+
href: ../app-service/sre-agent-overview.md?toc=/azure/azure-functions/toc.json
732734
- name: Storage connections
733735
href: functions-recover-storage-account.md
734736
displayName: troubleshoot

articles/container-apps/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,8 @@
198198
- name: Overview
199199
href: observability.md
200200
displayName: Observability overview
201+
- name: SRE Agent overview
202+
href: ../app-service/sre-agent-overview.md?toc=/azure/container-apps/toc.json
201203
- name: Application logging
202204
href: logging.md
203205
- name: Real time data

0 commit comments

Comments
 (0)