|
| 1 | +--- |
| 2 | +title: SRE Agent overview (preview) |
| 3 | +description: Learn how AI-enabled agents help solve problems and support resilient and self-healing systems on your behalf. |
| 4 | +services: container-apps |
| 5 | +author: craigshoemaker |
| 6 | +ms.service: azure-container-apps |
| 7 | +ms.topic: conceptual |
| 8 | +ms.date: 05/08/2025 |
| 9 | +ms.author: cshoe |
| 10 | +--- |
| 11 | + |
| 12 | +# SRE Agent overview (preview) |
| 13 | + |
| 14 | +Site Reliability Engineering (SRE) focuses on creating reliable, scalable systems through automation and proactive management. An SRE Agent brings these principles to your cloud environment by providing AI-powered monitoring, troubleshooting, and remediation capabilities. An SRE Agent automates routine operational tasks and provides reasoned insights to help you maintain application reliability while reducing manual intervention. Available as a chatbot, you can ask questions and give natural language commands to maintain your applications and services. To ensure accuracy and control, any agent action taken on your behalf requires your approval. |
| 15 | + |
| 16 | +Agents have access to every resource inside the resource groups associated to the agent. Therefore, agents: |
| 17 | + |
| 18 | +- Continuously evaluate resource activity, and monitor active resources |
| 19 | + |
| 20 | +- Send proactive notifications about unhealthy or unstable apps |
| 21 | + |
| 22 | +- Provide a natural language interface to issue commands |
| 23 | + |
| 24 | +An SRE Agent also integrates with [PagerDuty](https://www.pagerduty.com/) to support advanced notification solutions. |
| 25 | + |
| 26 | +> [!NOTE] |
| 27 | +> The SRE Agent feature is in limited preview. To sign up for access, fill out the [SRE Agent application](https://go.microsoft.com/fwlink/?linkid=2319540). |
| 28 | +
|
| 29 | +## Key features |
| 30 | + |
| 31 | +The SRE Agent offers several key features that enhance the reliability and performance of your Azure resources: |
| 32 | + |
| 33 | +- **Proactive monitoring**: Continuous resource monitoring with real-time alerts for potential issues and daily resource reports. |
| 34 | + |
| 35 | +- **Automated mitigation:** Automatic detection and mitigation of common issues, reducing downtime and improving resource health. While agents attempt to work on your behalf, all automation requires your approval. |
| 36 | + |
| 37 | +- **Resource visualization**: Comprehensive views of your resource dependencies and health status. |
| 38 | + |
| 39 | + :::image type="content" source="media/sre-agent/sre-agent-knowldege-graph.png" alt-text="Screenshot of an SRE Agent knowledge graph."::: |
| 40 | + |
| 41 | +An SRE Agent works to proactively monitor and maintain your Azure services. Each day your agent creates daily resource reports which provide insights into the health and status of your applications. Reports include: |
| 42 | + |
| 43 | +- **Actionable steps**: Measures you can take each day to reduce errors and harden security practices. |
| 44 | + |
| 45 | +- **Key insights**: Summaries of important details relevant to the health and maintenance of your Azure resources. |
| 46 | + |
| 47 | +- **Reasoning**: Summaries of analysis done by your agent helping surface possible problem areas in your apps. |
| 48 | + |
| 49 | +- **Actions already taken by the agent**: A list of tasks the agent did on your behalf for the day. |
| 50 | + |
| 51 | +## Scenarios |
| 52 | + |
| 53 | +| Scenario | Possible cause | Agent mitigation | |
| 54 | +|---|---|---| |
| 55 | +| Application down | ▪ **Application code issues**: Bugs or errors in the application code can lead to crashes or unresponsiveness.<br><br>▪ **Bad deployment**: Incorrect configurations or failed deployments can cause the application to go down.<br><br>▪ **High CPU/memory/thread issues**: Resource exhaustion due to high CPU, memory, or thread usage can affect application performance. | The SRE Agent can detect these issues and provide actionable insights or fixes. For example, it can identify a decrease in web app availability that coincides with a recent slot swap and recommend swapping back slots as first step of mitigation. | |
| 56 | +| Virtual machine RDP issues | ▪ **NSG rules**: Misconfigured NSG rules on the NIC or Subnet can block RDP access. | The SRE Agent can detect misconfigurations of NSG rules that block RDP access. Agents can also apply the correct NSG rules to restore access. | |
| 57 | +| Container image pull failures | ▪ **Image availability**: The requested image might not be available or could be missing.<br><br>▪ **Network connectivity**: Network issues can disrupt the connection to the container app.<br><br>▪ **Registry connectivity issues**: Problems with connecting to the container registry can prevent image pulls. | The SRE Agent can detect container image pull failures and provide detailed diagnostics. It can recommend solutions such as rolling back to the last known healthy revision and updating the image reference. | |
| 58 | + |
| 59 | +An agent can provide detailed information about different aspects of your apps and resources. The following examples demonstrate the types of questions you could pose to your agent: |
| 60 | + |
| 61 | +- What can you assist me with? |
| 62 | +- Why isn't my application working? |
| 63 | +- What services is my resource connected to? |
| 64 | +- Can you provide best practices for my resource? |
| 65 | +- What's the CPU and memory utilization of my app? |
| 66 | + |
| 67 | +## Preview access |
| 68 | + |
| 69 | +Access to an SRE Agent is only available as a limited preview. To sign up for access, fill out the [SRE Agent application](https://go.microsoft.com/fwlink/?linkid=2319540). |
0 commit comments