Skip to content

Commit 568242d

Browse files
Update sre-agent-overview.md
1 parent 4b72776 commit 568242d

File tree

1 file changed

+46
-15
lines changed

1 file changed

+46
-15
lines changed

articles/app-service/sre-agent-overview.md

Lines changed: 46 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,48 +3,63 @@ title: Azure SRE Agent overview (preview)
33
description: Learn how AI-enabled agents help solve problems and support resilient and self-healing systems on your behalf.
44
author: craigshoemaker
55
ms.topic: conceptual
6-
ms.date: 05/16/2025
6+
ms.date: 06/13/2025
77
ms.author: cshoe
8-
ms.custom:
9-
- build-2025
8+
ms.service: azure
109
---
1110

1211
# Azure SRE Agent overview (preview)
1312

14-
Site Reliability Engineering (SRE) focuses on creating reliable, scalable systems through automation and proactive management. Azure SRE Agent brings these principles to your Azure hosted applications by providing AI-powered monitoring, troubleshooting, and remediation capabilities to your app environments. The agent automates routine operational tasks and provides reasoned insights to help you maintain application reliability while reducing manual intervention. Available as a chat interface, you can ask questions and give natural language commands to maintain your applications and services. To ensure accuracy and control, any agent action taken on your behalf requires your approval.
13+
Site Reliability Engineering (SRE) focuses on creating reliable, scalable systems through automation and proactive management. Azure SRE Agent brings these principles to your Azure hosted applications by providing an AI-powered tool that helps sustain production cloud environments. SRE Agent helps you respond to incidents quickly and effectively, alleviating the toil of manually managing production environments. The agent uses the reasoning capabilities of large language models (LLMs) to identify the logs and metrics necessary for rapid root cause analysis and issue mitigation. Azure SRE Agent brings you better service uptime and reduced operational costs.
1514

1615
Agents have access to every resource inside the resource groups associated to the agent. Therefore, agents:
1716

1817
- Continuously evaluate resource activity, and monitor active resources
1918

2019
- Send proactive notifications about unhealthy or unstable apps
2120

22-
- Provide a natural language interface to issue commands
23-
24-
An SRE Agent also integrates with [PagerDuty](https://www.pagerduty.com/) to support advanced notification solutions.
21+
Azure SRE Agent also integrates with [Azure Monitor Alerts](/azure/azure-monitor/alerts/alerts-overview) and [PagerDuty](https://www.pagerduty.com/) to support advanced notification solutions.
2522

2623
> [!NOTE]
27-
> The SRE Agent feature is in limited preview. To sign up for access, fill out the [SRE Agent application](https://go.microsoft.com/fwlink/?linkid=2319540).
24+
> The SRE Agent feature is in public preview. To sign up for the wait list, fill out the [SRE Agent application](https://go.microsoft.com/fwlink/?linkid=2319540).
2825
2926
By using an SRE Agent, you consent the product-specific [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
3027

3128
## Key features
3229

33-
The SRE Agent offers several key features that enhance the reliability and performance of your Azure resources:
30+
Azure SRE Agent offers several key features that enhance the reliability and performance of your Azure resources:
31+
32+
- **Welcome thread**: When you first create your agent, a new thread is created which provides initial analysis of your services. The environment analysis creates a snapshot of all the resources managed by the agent. Additionally, the agent generates a list of applications found in the managed resource groups.
33+
34+
- **Daily threads**: Each day, the agent creates a resource report which summarizes the state and status of the services in your managed resource groups.
35+
36+
- **Tooling**: Querying and operations support via Azure CLI and Kubectl.
3437

35-
- **Proactive monitoring**: Continuous 24x7 resource monitoring with real-time alerts for potential issues and daily resource reports.
38+
- **Data sources**: Access to Azure Resource Manager APIs and Azure Monitor metrics data sources.
39+
40+
- **Incident management**: Diagnose incidents by chatting with the agent directly or by connecting an incident management platform to the agent. Automatically respond to Azure Monitor alerts or PagerDuty incidents with initial analysis.
41+
42+
- **Proactive monitoring**: Continuous 24x7 resource monitoring with real-time alerts for potential issues.
3643

3744
- **Automated mitigation:** Automatic detection and mitigation of common issues, reducing downtime and improving resource health. While agents attempt to work on your behalf, all automation requires your approval.
3845

3946
- **Infrastructure best practices:** Identify and remediate resources not following security best practices and help updates.
4047

41-
- **Automates incident response:** Automatically respond to Azure Monitor alerts or PagerDuty incidents with initial analysis.
42-
4348
- **Accelerates root cause analysis:** Diagnose root causes of app issues by analyzing metrics and logs and suggest mitigations.
4449

4550
- **Resource visualization**: Comprehensive views of your resource dependencies and health status.
4651

47-
:::image type="content" source="media/sre-agent/sre-agent-knowldege-graph.png" alt-text="Screenshot of an SRE Agent knowledge graph.":::
52+
:::image type="content" source="media/overview/resources.png" alt-text="Screenshot of an SRE Agent knowledge graph." lightbox="media/overview/resources.png":::
53+
54+
- **Mitigation support**: SRE Agent can fix application configuration and dependent services. For code issues, the agent provides stack traces and can create GitHub issue to help resolve issues. The following describes service-specific features of the agent:
55+
56+
- *Azure App Service*: Roll back deployment, scale resources up/down, application restarts.
57+
58+
- *Azure Container Apps*: Roll back deployment, scale resources up/down, and application restarts.
59+
60+
- *Azure Kubernetes Service*: Restart pods/deployments, roll back deployments to previous revisions, scale resources up/down, and patch resource definitions.
61+
62+
## Reports
4863

4964
An SRE Agent works to proactively monitor and maintain your Azure services. Each day your agent creates daily resource reports which provide insights into the health and status of your applications.
5065

@@ -60,8 +75,7 @@ Reports include:
6075

6176
| Scenario | Possible cause | Agent mitigation |
6277
|---|---|---|
63-
| Application down |**Application code issues**: Bugs or errors in the application code can lead to crashes or unresponsiveness.<br><br>▪ **Bad deployment**: Incorrect configurations or failed deployments can cause the application to go down.<br><br>▪ **High CPU/memory/thread issues**: Resource exhaustion due to high CPU, memory, or thread usage can affect application performance. | The SRE Agent can detect these issues and provide actionable insights or fixes. For example, it can identify a decrease in web app availability that coincides with a recent slot swap and recommend swapping back slots as first step of mitigation. |
64-
| Virtual machine RDP issues |**NSG rules**: Misconfigured NSG rules on the NIC or Subnet can block RDP access. | The SRE Agent can detect misconfigurations of NSG rules that block RDP access. Agents can also apply the correct NSG rules to restore access. |
78+
| Application down |**Application code issues**: Bugs or errors in the application code can lead to crashes or unresponsiveness.<br><br>▪ **Bad deployment**: Incorrect configurations or failed deployments can cause the application to go down.<br><br>▪ **High CPU/memory/thread issues**: Resource exhaustion due to high CPU, memory, or thread usage can affect application performance. | The SRE Agent can detect these issues and provide actionable insights or fixes. For example, it can identify a decrease in web app availability that coincides with a recent slot swap and recommend swapping back slots as the first step of mitigation. |
6579
| Container image pull failures |**Image availability**: The requested image might not be available or could be missing.<br><br>▪ **Network connectivity**: Network issues can disrupt the connection to the container app.<br><br>▪ **Registry connectivity issues**: Problems with connecting to the container registry can prevent image pulls. | The SRE Agent can detect container image pull failures and provide detailed diagnostics. It can recommend solutions such as rolling back to the last known healthy revision and updating the image reference. |
6680

6781
An agent can provide detailed information about different aspects of your apps and resources. The following examples demonstrate the types of questions you could pose to your agent:
@@ -72,6 +86,23 @@ An agent can provide detailed information about different aspects of your apps a
7286
- Can you provide best practices for my resource?
7387
- What's the CPU and memory utilization of my app?
7488

89+
Further, here are some prompts you can use to help you interact with your agent:
90+
91+
- Which apps have Dapr enabled?
92+
- List replicas for my container app
93+
- Which apps have diagnostic logging turned on?
94+
- Give me an individual heatmap for each storage account.
95+
- Which revision of my container app is currently active?
96+
- What are some best practices that my app should follow?
97+
- What is the ingress configuration for my container app?
98+
- Are there any staging slots configured for this web app?
99+
- What container images are used by each of my Container Apps?
100+
- List all resource groups that you’re managing across all subscriptions.
101+
- Draw heatmap of storage latencies over the last 14 days for storage accounts.
102+
- Show me a visualization of response times for Container Apps for last week.
103+
- List [Container Apps/Web Apps/etc.] that you’re managing across all subscriptions.
104+
- Visualize split of Container Apps vs Web Apps vs AKS clusters managed across all subscriptions as a pie chart.
105+
75106
## Preview access
76107

77108
Access to an SRE Agent is only available as a limited preview. To sign up for access, fill out the [SRE Agent application](https://go.microsoft.com/fwlink/?linkid=2319540).

0 commit comments

Comments
 (0)