You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
>NOTE: for sample code for this [AWS Blog](https://aws.amazon.com/blogs/architecture/architecting-conversational-observability-for-cloud-applications/) please use the [code branch](https://github.com/aws-samples/sample-eks-troubleshooting-rag-chatbot/tree/blog) from this repository.
2
+
This code branch is updated and is related to the AWS guidance below.
3
+
1
4
# Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS
2
5
3
-
This project provides an example of Agemtic AI approaches for troubleshooting EKS (Elastic Kubernetes Service) issues via ChatOps:
6
+
This guidance provides an example of Platform Engineering approach to troubleshooting Amazon EKS (Elastic Kubernetes Service) issues using Agentic AI workflow integrated with ChatOps via Slack
4
7
5
-
**Strands-based AI Agentic workflow Troubleshooting**: An intelligent agent using AWS Strands Agent framework with EKS MCP server integration for real-time troubleshooting
8
+
**Strands-based AI Agentic workflow Troubleshooting**: An intelligent agent using AWS [Strands Agent framework](http://strandsagents.com/latest/) with [EKS MCP server](https://awslabs.github.io/mcp/servers/eks-mcp-server) integration for real-time troubleshooting
6
9
7
-
It can be deployed using Terraform, which provisions the necessary AWS resources including EKS cluster, monitoring tools, and application-specific infrastructure.
10
+
It can be deployed using [Terraform](https://developer.hashicorp.com/terraform), which provisions all necessary AWS resources including EKS cluster with compute plane, required add-ons, monitoring tools, and application-specific infrastructure.
_Figure 1: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Reference Architecture_
17
19
18
20
### Reference Architecture Steps
19
21
20
-
1.**Log Collection and Streaming**: Amazon EKS cluster generates logs from various components (pods, services, nodes) which are collected by Fluent Bit and streamed to Amazon Kinesis Data Streams for real-time processing.
22
+
1.**Log Collection and Streaming**: Amazon EKS cluster generates logs from various components (pods, services, nodes etc.) which are collected by Fluent Bit and streamed to Amazon Kinesis Data Streams for real-time processing.
21
23
22
-
2.**Log Processing and Indexing**: Amazon Kinesis Data Streams processes the incoming log data and forwards it to Amazon OpenSearch for indexing and storage, enabling fast search and retrieval capabilities.
24
+
2.**Log Processing and Indexing**: Amazon Kinesis Data Streams processes the incoming log data and forwards it to Amazon OpenSearch for indexing and "knowldge base" data storage, enabling fast search and retrieval capabilities.
23
25
24
26
3.**Vector Storage and Embeddings**: Log data is processed through AWS Bedrock to generate embeddings, which are stored in Amazon S3 for semantic search capabilities and knowledge retrieval.
25
27
@@ -31,27 +33,27 @@ _Figure 1: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow
31
33
32
34
7.**Security and Access Control**: AWS IAM and EKS Pod Identity ensure secure access to cluster resources while maintaining proper permissions and audit trails.
33
35
34
-
### Agentic AI workflow Architecture
36
+
### Agentic AI Workflow Architecture
35
37
36
38

37
39
38
40
_Figure 2: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Troubleshooting Workflow_
39
41
40
-
### Agentic AI workflow Architecture Steps
42
+
### Agentic AI Workflow Architecture Steps
41
43
42
44
1.**Setup** - Guidance workloads are deployed into an Amazon EKS cluster, configured for application readiness with compute plane managed by Karpenter auto-scaler.
43
45
44
-
2.**User Interaction** - Users (DevOps engineers, SREs, developers) who encounter Kubernetes (K8s) issues send troubleshooting requests through designated Slack channel integrated with K8s Troubleshooting AI Agent. Its components are running as containers on the EKS deployed from previously built images hosted in Elastic Container registry (ECR) via Helm charts that reference the services-built images
46
+
2.**User Interaction** - Users (DevOps engineers, SREs, developers) who encounter Kubernetes (K8s) issues send troubleshooting requests through designated Slack channel integrated with K8s Troubleshooting AI Agent. Its components are running as containers on EKS cluster deployed from previously built images hosted in Elastic Container registry (ECR) via [Helm](https://helm.sh/) charts that reference the services-built images
45
47
46
-
3.**Message Reception & Slack Integration** - Slack receives user messages via AWS Elastic Load Balancer and establishes a WebSocket connection (Socket Mode) to the Orchestrator agent running in the EKS cluster.
48
+
3.**Message Reception & Slack Integration** - Slack receives user messages coming via AWS Elastic Load Balancer and establishes a WebSocket connection (Socket Mode) to the Orchestrator agent running in the EKS cluster.
47
49
48
-
4.**Intelligent Message Classification & Orchestration** - Orchestrator agent receives users’ message and calls Nova Micro model via Amazon Bedrock API to determine whether the message requires K8s troubleshooting. If an issue is classified as K8s-related, the Orchestrator agent initiates a workflow by delegating tasks to specialized agents while maintaining overall session context.
50
+
4.**Intelligent Message Classification & Orchestration** - Orchestrator agent receives users’ message and calls [Nova Micro](https://docs.aws.amazon.com/ai/responsible-ai/nova-micro-lite-pro/overview.html) model via Amazon Bedrock API to determine whether the message requires K8s troubleshooting. If an issue is classified as K8s-related, the Orchestrator agent initiates a workflow by delegating tasks to specialized agents while maintaining overall session context.
49
51
50
-
5.**Historical Knowledge Retrieval** - Orchestrator agent invokes the Memory agent, which connects to Amazon S3 Vectors based knowledge base to search for similar troubleshooting cases for precise issue classification
52
+
5.**Historical Knowledge Retrieval** - Orchestrator agent invokes the Memory agent, which connects to Amazon S3 Vectors based knowledge base to search for similar troubleshooting cases for precise issue classification
51
53
52
-
6.**Semantic Vector Matching** - The Memory agent invokes Titan Embeddings model via Amazon Bedrock API to generate semantic embeddings and perform vector similarity matching against the shared S3 Vectors knowledge base
54
+
6.**Semantic Vector Matching** - The Memory agent invokes [Titan Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html) model via Amazon Bedrock API to generate semantic embeddings and perform vector similarity matching against the shared S3 Vectors knowledge base
53
55
54
-
7.**Real-Time Cluster Intelligence** - Orchestrator agent invokes the K8s Specialist agent, which utilizes the hosted [AWS EKS Model Context Protocol (MCP) Server](https://docs.aws.amazon.com/eks/latest/userguide/eks-mcp-introduction.html) to execute commands against the EKS API Server. The MCP Server gathers real-time cluster state, pod logs, events, and resource metrics to better “understand” the current problem context.
56
+
7.**Real-Time Cluster Intelligence** - Orchestrator agent invokes the K8s Specialist agent, which utilizes the hosted [AWS EKS Model Context Protocol (MCP) Server](https://docs.aws.amazon.com/eks/latest/userguide/eks-mcp-introduction.html) to execute commands against the EKS API Server. The Client of MCP Server gathers real-time cluster state, pod logs, events, and resource metrics to better “understand” the current problem context.
55
57
56
58
8.**Intelligent Issue Analysis** - K8s Specialist agent sends the collected cluster data to Anthropic Claude model via Amazon Bedrock for intelligent issue analysis and resolution generation.
57
59
@@ -82,21 +84,19 @@ _Figure 2: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow
82
84
83
85
Before running this project, make sure you have the following tools installed:
- Create a Slack app with the following **Bot Token Scopes**:
102
102
-`app_mentions:read` - View messages mentioning the bot
@@ -129,7 +129,7 @@ Before running this project, make sure you have the following tools installed:
129
129
130
130

131
131
132
-
_Figure 5: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Adding Sample app to Channel_
132
+
_Figure 5: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Adding Sample app to Slack Channel_
133
133
134
134
## Plan your Deployment
135
135
@@ -152,7 +152,6 @@ Before running this project, make sure you have the following tools installed:
152
152
|[AWS Lambda](https://aws.amazon.com/lambda/)| Optional Service | Provides serverless compute for processing Slack webhooks and handling event-driven troubleshooting workflows when Slack integration is enabled. |
153
153
154
154
155
-
156
155
### Cost
157
156
158
157
You are responsible for the cost of the AWS services used while running this guidance.
@@ -176,7 +175,7 @@ deployment as per the guidance. This **does not** factor any model deployments o
@@ -238,165 +237,15 @@ Workload Ready Cluster. Here are the key security components and considerations:
238
237
239
238
Please see detailed [Implementation Guide](https://implementationguides.kits.eventoutfitters.aws.dev/tbst-eks-rag-1017/compute/troubleshooting-amazon-eks-using-rag-based-chatbot.html) for instruction for solution deployment, validation, basic troubleshooting and uninstallation options.
240
239
241
-
<!--
242
-
243
-
### Option 1: Strands-based Agentic AI Workflow Troubleshooting Deployment
244
-
245
-
The agentic approach uses the AWS Strands Agent framework with EKS MCP server integration for intelligent, real-time troubleshooting.
246
-
247
-
#### Setup Steps
248
-
249
-
1. **Set your AWS region:**
250
-
```bash
251
-
export AWS_REGION="us-east-1" # Change to your preferred region
- **Access Denied (Bedrock)**: Ensure your AWS account has access to the specified Bedrock model
386
-
- **Image Pull Errors**: Verify ECR repository exists and credentials are correct
387
-
- **Slack Integration**: Check bot tokens and permissions
388
-
- **Pod Identity**: Ensure EKS Pod Identity Agent is enabled
389
-
390
-
-->
391
-
392
-
## Acknowledgments
240
+
## References
393
241
394
242
This project uses:
395
243
396
244
-[Terraform AWS EKS Blueprints](https://github.com/aws-ia/terraform-aws-eks-blueprints) for infrastructure
397
245
-[AWS Strands Agent Framework](https://github.com/aws/strands) for multi-agent orchestration (Agentic deployment)
398
246
-[EKS MCP Server](https://github.com/aws/eks-mcp-server) for Kubernetes integration via Model Context Protocol (Agentic deployment)
399
-
-[Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html) for semantic vector matching and solution content validation
247
+
-[Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html) model hosting for semantic vector matching and solution content validation
0 commit comments