Skip to content

feat(02-samples): add SRE incident response multi-agent sample#228

Open
ajha8 wants to merge 10 commits intostrands-agents:mainfrom
ajha8:feat/sre-incident-response-agent
Open

feat(02-samples): add SRE incident response multi-agent sample#228
ajha8 wants to merge 10 commits intostrands-agents:mainfrom
ajha8:feat/sre-incident-response-agent

Conversation

@ajha8
Copy link

@ajha8 ajha8 commented Mar 1, 2026

Screenshot 2026-03-19 at 10 26 57 AM

PR: feat(02-samples): Add SRE Incident Response multi-agent sample

Summary

Issue #, if available:

Description of changes:
This is a proactive and original contribution adding a missing SRE/DevOps use case. No existing issue tracks this gap. This PR adds a new sample to 02-samples/ demonstrating a multi-agent SRE (Site Reliability Engineering) incident response workflow built with the Strands Agents SDK.

Why this sample?

After reviewing the existing samples, there is no example that covers:

  • Operations / SRE use cases (vs. finance, restaurant, JIRA, audit tools)
  • Multi-agent supervisor pattern applied to real-time incident detection
  • AWS ↔ Kubernetes bridge (CloudWatch alarms → kubectl/Helm remediation)
  • Red Hat / OpenShift compatibility (kubectl tools work with oc too)

This fills a genuine gap and is relevant to thousands of DevOps/SRE engineers
who run workloads on AWS with Kubernetes or OpenShift.

What this adds

02-samples/20-sre-incident-response-agent/
├── sre_agent.py          # Main agent (4 agents + 8 tools)
├── test_sre_agent.py     # Pytest unit tests (mocked AWS, 15 tests)
├── requirements.txt
├── .env.example
├── assets/
│   └── architecture.png
└── README.md

Strands SDK concepts demonstrated

Concept How
@tool decorator 8 tools: CloudWatch, Logs, kubectl, Helm, Slack
Agents-as-tools pattern 3 specialist sub-agents each wrapped as a @tool and passed to the supervisor via tools=
BedrockModel Configurable model provider
Dry-run safety All destructive actions gated by DRY_RUN=true

Agent architecture

supervisor_agent (Incident Commander)
    ├── cloudwatch_agent   → list_active_alarms, get_metric_statistics, fetch_log_events
    ├── rca_agent          → reasoning-only, no tools (pure LLM analysis)
    └── remediation_agent  → kubectl_get, kubectl_rollout_restart, helm_rollback, helm_scale

Testing

pip install -r requirements.txt
pytest test_sre_agent.py -v

All 15 tests pass without AWS credentials (mocked boto3). Includes a regression test for the AlarmNamePrefix namespace filtering bug.

Related

  • Bridges AWS open source (Strands Agents, CloudWatch) with Red Hat/Kubernetes tooling
  • Works with OpenShift by swapping kubectloc in the remediation tools
  • Designed to be extended with PagerDuty, GitHub Issues, or custom runbooks

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Latest scan for commit: 6b6e048 | Updated: 2026-03-19 19:26:13 UTC

✅ Security Scan Report (PR Files Only)

Scanned Files

  • 02-samples/20-sre-incident-response-agent/.env.example
  • 02-samples/20-sre-incident-response-agent/README.md
  • 02-samples/20-sre-incident-response-agent/assets/architecture.png
  • 02-samples/20-sre-incident-response-agent/requirements.txt
  • 02-samples/20-sre-incident-response-agent/sre_agent.py
  • 02-samples/20-sre-incident-response-agent/test_sre_agent.py
  • 02-samples/README.md

Security Scan Results

Critical High Medium Low Info
0 0 0 0 0

Threshold: High

No security issues detected in your changes. Great job!

This scan only covers files changed in this PR.

@ajha8
Copy link
Author

ajha8 commented Mar 12, 2026

@ryanycoleman could you please review my changes? Thanks

@manoj-selvakumar5
Copy link
Collaborator

@ajha8 — Let me review your PR

@ajha8
Copy link
Author

ajha8 commented Mar 12, 2026

@ajha8 — Let me review your PR

@manoj-selvakumar5 Thank you!

@ajha8 ajha8 force-pushed the feat/sre-incident-response-agent branch from 25fa18e to 6455a63 Compare March 12, 2026 22:58
@ajha8
Copy link
Author

ajha8 commented Mar 12, 2026

@ajha8 — Let me review your PR

@manoj-selvakumar5 - I had to rebase with latest main and that has caused the initial workflow waiting for approval again. Could you please initiate this and review my PR?

@ajha8 ajha8 closed this Mar 18, 2026
@ajha8 ajha8 force-pushed the feat/sre-incident-response-agent branch from 6455a63 to 95e59d4 Compare March 18, 2026 18:44
@ajha8 ajha8 reopened this Mar 18, 2026
@ajha8
Copy link
Author

ajha8 commented Mar 18, 2026

@ryanycoleman @clareliguori could you please review my PR with a simple new sample addition? It's been pending review for over two weeks now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants