Skip to content

Commit 8d8e544

Browse files
Update README.md
Added note about the code branch for the published blog
1 parent a354056 commit 8d8e544

File tree

1 file changed

+24
-175
lines changed

1 file changed

+24
-175
lines changed

README.md

Lines changed: 24 additions & 175 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,27 @@
1+
>NOTE: for sample code for this [AWS Blog](https://aws.amazon.com/blogs/architecture/architecting-conversational-observability-for-cloud-applications/) please use the [code branch](https://github.com/aws-samples/sample-eks-troubleshooting-rag-chatbot/tree/blog) from this repository.
2+
This code branch is updated and is related to the AWS guidance below.
3+
14
# Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS
25

3-
This project provides an example of Agemtic AI approaches for troubleshooting EKS (Elastic Kubernetes Service) issues via ChatOps:
6+
This guidance provides an example of Platform Engineering approach to troubleshooting Amazon EKS (Elastic Kubernetes Service) issues using Agentic AI workflow integrated with ChatOps via Slack
47

5-
**Strands-based AI Agentic workflow Troubleshooting**: An intelligent agent using AWS Strands Agent framework with EKS MCP server integration for real-time troubleshooting
8+
**Strands-based AI Agentic workflow Troubleshooting**: An intelligent agent using AWS [Strands Agent framework](http://strandsagents.com/latest/) with [EKS MCP server](https://awslabs.github.io/mcp/servers/eks-mcp-server) integration for real-time troubleshooting
69

7-
It can be deployed using Terraform, which provisions the necessary AWS resources including EKS cluster, monitoring tools, and application-specific infrastructure.
10+
It can be deployed using [Terraform](https://developer.hashicorp.com/terraform), which provisions all necessary AWS resources including EKS cluster with compute plane, required add-ons, monitoring tools, and application-specific infrastructure.
811

912
## Architecture
1013

1114
### Reference Architecture - EKS Cluster
12-
<!--static/images/chatbot-architecture.jpg-->
1315

1416
![Reference Architecture Diagram](/static/images/EKS%20troubleshooting%20agentic%20AI%20diagram%201.png)
1517

1618
_Figure 1: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Reference Architecture_
1719

1820
### Reference Architecture Steps
1921

20-
1. **Log Collection and Streaming**: Amazon EKS cluster generates logs from various components (pods, services, nodes) which are collected by Fluent Bit and streamed to Amazon Kinesis Data Streams for real-time processing.
22+
1. **Log Collection and Streaming**: Amazon EKS cluster generates logs from various components (pods, services, nodes etc.) which are collected by Fluent Bit and streamed to Amazon Kinesis Data Streams for real-time processing.
2123

22-
2. **Log Processing and Indexing**: Amazon Kinesis Data Streams processes the incoming log data and forwards it to Amazon OpenSearch for indexing and storage, enabling fast search and retrieval capabilities.
24+
2. **Log Processing and Indexing**: Amazon Kinesis Data Streams processes the incoming log data and forwards it to Amazon OpenSearch for indexing and "knowldge base" data storage, enabling fast search and retrieval capabilities.
2325

2426
3. **Vector Storage and Embeddings**: Log data is processed through AWS Bedrock to generate embeddings, which are stored in Amazon S3 for semantic search capabilities and knowledge retrieval.
2527

@@ -31,27 +33,27 @@ _Figure 1: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow
3133

3234
7. **Security and Access Control**: AWS IAM and EKS Pod Identity ensure secure access to cluster resources while maintaining proper permissions and audit trails.
3335

34-
### Agentic AI workflow Architecture
36+
### Agentic AI Workflow Architecture
3537

3638
![Agentic AI workflow Architecture Diagram](/static/images/EKS%20troubleshooting%20agentic%20AI%20diagram%202b.png)
3739

3840
_Figure 2: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Troubleshooting Workflow_
3941

40-
### Agentic AI workflow Architecture Steps
42+
### Agentic AI Workflow Architecture Steps
4143

4244
1. **Setup** - Guidance workloads are deployed into an Amazon EKS cluster, configured for application readiness with compute plane managed by Karpenter auto-scaler.
4345

44-
2. **User Interaction** - Users (DevOps engineers, SREs, developers) who encounter Kubernetes (K8s) issues send troubleshooting requests through designated Slack channel integrated with K8s Troubleshooting AI Agent. Its components are running as containers on the EKS deployed from previously built images hosted in Elastic Container registry (ECR) via Helm charts that reference the services-built images
46+
2. **User Interaction** - Users (DevOps engineers, SREs, developers) who encounter Kubernetes (K8s) issues send troubleshooting requests through designated Slack channel integrated with K8s Troubleshooting AI Agent. Its components are running as containers on EKS cluster deployed from previously built images hosted in Elastic Container registry (ECR) via [Helm](https://helm.sh/) charts that reference the services-built images
4547

46-
3. **Message Reception & Slack Integration** - Slack receives user messages via AWS Elastic Load Balancer and establishes a WebSocket connection (Socket Mode) to the Orchestrator agent running in the EKS cluster.
48+
3. **Message Reception & Slack Integration** - Slack receives user messages coming via AWS Elastic Load Balancer and establishes a WebSocket connection (Socket Mode) to the Orchestrator agent running in the EKS cluster.
4749

48-
4. **Intelligent Message Classification & Orchestration** - Orchestrator agent receives users’ message and calls Nova Micro model via Amazon Bedrock API to determine whether the message requires K8s troubleshooting. If an issue is classified as K8s-related, the Orchestrator agent initiates a workflow by delegating tasks to specialized agents while maintaining overall session context.
50+
4. **Intelligent Message Classification & Orchestration** - Orchestrator agent receives users’ message and calls [Nova Micro](https://docs.aws.amazon.com/ai/responsible-ai/nova-micro-lite-pro/overview.html) model via Amazon Bedrock API to determine whether the message requires K8s troubleshooting. If an issue is classified as K8s-related, the Orchestrator agent initiates a workflow by delegating tasks to specialized agents while maintaining overall session context.
4951

50-
5. **Historical Knowledge Retrieval** - Orchestrator agent invokes the Memory agent, which connects to Amazon S3 Vectors based knowledge base to search for similar troubleshooting cases for precise issue classification
52+
5. **Historical Knowledge Retrieval** - Orchestrator agent invokes the Memory agent, which connects to Amazon S3 Vectors based knowledge base to search for similar troubleshooting cases for precise issue classification
5153

52-
6. **Semantic Vector Matching** - The Memory agent invokes Titan Embeddings model via Amazon Bedrock API to generate semantic embeddings and perform vector similarity matching against the shared S3 Vectors knowledge base
54+
6. **Semantic Vector Matching** - The Memory agent invokes [Titan Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html) model via Amazon Bedrock API to generate semantic embeddings and perform vector similarity matching against the shared S3 Vectors knowledge base
5355

54-
7. **Real-Time Cluster Intelligence** - Orchestrator agent invokes the K8s Specialist agent, which utilizes the hosted [AWS EKS Model Context Protocol (MCP) Server](https://docs.aws.amazon.com/eks/latest/userguide/eks-mcp-introduction.html) to execute commands against the EKS API Server. The MCP Server gathers real-time cluster state, pod logs, events, and resource metrics to better “understand” the current problem context.
56+
7. **Real-Time Cluster Intelligence** - Orchestrator agent invokes the K8s Specialist agent, which utilizes the hosted [AWS EKS Model Context Protocol (MCP) Server](https://docs.aws.amazon.com/eks/latest/userguide/eks-mcp-introduction.html) to execute commands against the EKS API Server. The Client of MCP Server gathers real-time cluster state, pod logs, events, and resource metrics to better “understand” the current problem context.
5557

5658
8. **Intelligent Issue Analysis** - K8s Specialist agent sends the collected cluster data to Anthropic Claude model via Amazon Bedrock for intelligent issue analysis and resolution generation.
5759

@@ -82,21 +84,19 @@ _Figure 2: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow
8284

8385
Before running this project, make sure you have the following tools installed:
8486

85-
- [Terraform](https://www.terraform.io/downloads.html)
87+
- [Terraform CLI](https://www.terraform.io/downloads.html)
8688
- [AWS CLI](https://aws.amazon.com/cli/)
8789
- [Python 3.8+](https://www.python.org/downloads/)
88-
- [Docker](https://www.docker.com/) (for agentic deployment)
89-
- [Helm](https://helm.sh/) (for agentic deployment)
90+
- [Docker](https://www.docker.com/) (for agentic application deployment)
91+
- [Helm](https://helm.sh/) (for agentic application deployment)
9092
- [Kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) (for K8s CLI commands)
9193

9294
### Slack Configuration (Required)
9395

94-
#### For Both Deployments:
9596
1. **Slack Webhook** (Alert Manager notifications):
9697
- Create incoming webhook in your Slack workspace
9798
- Note the webhook URL and target channel name
9899

99-
#### For Strands Agentic Deployment Only:
100100
2. **Slack Bot Configuration**:
101101
- Create a Slack app with the following **Bot Token Scopes**:
102102
- `app_mentions:read` - View messages mentioning the bot
@@ -129,7 +129,7 @@ Before running this project, make sure you have the following tools installed:
129129

130130
![Sample Slack Application adding to Channel](static/images/slack_adding_app_to_channel.png)
131131

132-
_Figure 5: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Adding Sample app to Channel_
132+
_Figure 5: Guidance for Troubleshooting of Amazon EKS using Agentic AI workflow on AWS - Adding Sample app to Slack Channel_
133133

134134
## Plan your Deployment
135135

@@ -152,7 +152,6 @@ Before running this project, make sure you have the following tools installed:
152152
| [AWS Lambda](https://aws.amazon.com/lambda/) | Optional Service | Provides serverless compute for processing Slack webhooks and handling event-driven troubleshooting workflows when Slack integration is enabled. |
153153

154154

155-
156155
### Cost
157156

158157
You are responsible for the cost of the AWS services used while running this guidance.
@@ -176,7 +175,7 @@ deployment as per the guidance. This **does not** factor any model deployments o
176175
| Elastic Load Balancer | 1 NLB for workloads | $16.46 |
177176
| Amazon VPC | Public IP addresses | $3.65 |
178177
| AWS Key Management Service (KMS) | Keys and requests | $6.00 |
179-
| AWS Bedrock (Claude) | 1M input tokens, 100K output tokens | $25.00 |
178+
| AWS Bedrock (Titan) | 1M input tokens, 100K output tokens | $25.00 |
180179
| Amazon OpenSearch Service | 3 m5.large.search instances | $95.00 |
181180
| Amazon Kinesis Data Streams | 2 shards, 10GB data ingestion | $30.00 |
182181
| Amazon S3 | 500GB storage, 10K requests | $11.50 |
@@ -238,165 +237,15 @@ Workload Ready Cluster. Here are the key security components and considerations:
238237

239238
Please see detailed [Implementation Guide](https://implementationguides.kits.eventoutfitters.aws.dev/tbst-eks-rag-1017/compute/troubleshooting-amazon-eks-using-rag-based-chatbot.html) for instruction for solution deployment, validation, basic troubleshooting and uninstallation options.
240239

241-
<!--
242-
243-
### Option 1: Strands-based Agentic AI Workflow Troubleshooting Deployment
244-
245-
The agentic approach uses the AWS Strands Agent framework with EKS MCP server integration for intelligent, real-time troubleshooting.
246-
247-
#### Setup Steps
248-
249-
1. **Set your AWS region:**
250-
```bash
251-
export AWS_REGION="us-east-1" # Change to your preferred region
252-
```
253-
254-
2. **Create ECR repository manually:**
255-
```bash
256-
# Create ECR repository
257-
aws ecr create-repository --repository-name eks-llm-troubleshooting-agentic-agent --region $AWS_REGION
258-
259-
# Get the repository URI
260-
export ECR_REPO_URL=$(aws ecr describe-repositories --repository-names eks-llm-troubleshooting-agentic-agent --region $AWS_REGION --query 'repositories[0].repositoryUri' --output text)
261-
echo "ECR Repository URL: $ECR_REPO_URL"
262-
```
263-
264-
3. **Create S3 vector bucket and index:**
265-
```bash
266-
# Create S3 vector bucket with unique name
267-
export VECTOR_BUCKET="eks-llm-troubleshooting-vector-storage-$(date +%s)"
268-
aws s3vectors create-vector-bucket \
269-
--vector-bucket-name $VECTOR_BUCKET \
270-
--region $AWS_REGION
271-
272-
# Create S3 Vectors index with 1024 dimensions
273-
aws s3vectors create-index \
274-
--vector-bucket-name $VECTOR_BUCKET \
275-
--index-name "k8s-troubleshooting" \
276-
--dimension 1024 \
277-
--data-type float32 \
278-
--distance-metric cosine \
279-
--region $AWS_REGION
280-
281-
echo "Vector bucket: $VECTOR_BUCKET"
282-
echo "Index name: k8s-troubleshooting"
283-
```
284-
285-
4. **Build and push the Docker images for Agents :**
286-
```bash
287-
cd apps/agentic-troubleshooting/
288-
289-
# Login to ECR
290-
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ECR_REPO_URL
291-
292-
# Build and tag the image
293-
docker build --platform linux/amd64 -t $ECR_REPO_URL .
294-
295-
# Push to ECR
296-
docker push $ECR_REPO_URL
297-
```
298-
299-
5. **Configure Terraform variables:**
300-
Create `terraform/terraform.tfvars` file (replace with your actual values):
301-
```hcl
302-
deployment_type = "agentic"
303-
agentic_image_repository = "your-account.dkr.ecr.us-east-1.amazonaws.com/eks-llm-troubleshooting-agentic-agent"
304-
agentic_image_tag = "latest"
305-
slack_webhook_url = "https://hooks.slack.com/services/[YOUR-WEBHOOK]"
306-
slack_channel_name = "alert-manager-alerts"
307-
slack_bot_token = "xoxb-your-bot-token"
308-
slack_app_token = "xapp-your-app-token"
309-
slack_signing_secret = "your-signing-secret"
310-
bedrock_model_id = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
311-
vector_bucket_name = "eks-llm-troubleshooting-vector-storage-1234567890" # Use the bucket created above
312-
vector_index_name = "k8s-troubleshooting"
313-
```
314-
315-
6. **Deploy infrastructure:**
316-
```bash
317-
cd terraform/
318-
terraform init
319-
terraform apply -auto-approve
320-
```
321-
322-
The agentic deployment will automatically:
323-
- Create IAM roles with EKS MCP permissions
324-
- Set up Pod Identity associations
325-
- Deploy the Helm chart with the troubleshooting agent
326-
- Configure Slack integration
327-
328-
329-
## Key Features
330-
331-
### Strands-based Agentic AI workflow Troubleshooting
332-
- Multi-agent orchestration with EKS MCP integration
333-
- S3 Vectors storage for tribal knowledge
334-
- Slack bot integration with Pod Identity security
335-
- Real-time cluster monitoring and troubleshooting
336-
337-
## Configuration
338-
339-
### Terraform Variables
340-
- **deployment_type**: `"agentic"` (default) or `"rag"`
341-
- **name**: Project name (default: `"eks-llm-troubleshooting"`)
342-
- **slack_webhook_url**: Slack webhook for alerts (both deployments)
343-
- **slack_channel_name**: Slack channel name (both deployments)
344-
- **agentic_image_repository**: ECR repository for agent image (Agentic only)
345-
- **slack_bot_token**: Slack bot token (Agentic only)
346-
- **bedrock_model_id**: Bedrock model identifier (Agentic only)
347-
- **vector_bucket_name**: S3 vector bucket name (Agentic only)
348-
349-
## Testing
350-
351-
352-
### Strands Agentic AI Workflow
353-
See [Demo EKS Troubleshooting Script](/demo/demo-script.md) for complete testing instructions and example scenarios.
354-
355-
<TODO> Add instructions for testing Slack based ChatOps scenario with Agentic AI workflow.
356-
357-
## Cleanup
358-
359-
1. **Destroy infrastructure:**
360-
```bash
361-
cd terraform/
362-
terraform destroy --auto-approve
363-
```
364-
365-
2. **Clean up additional resources** (Agentic only):
366-
```bash
367-
# Delete ECR repository
368-
aws ecr delete-repository --repository-name eks-llm-troubleshooting-agentic-agent --force --region $AWS_REGION
369-
370-
# Delete S3 vector bucket (if created)
371-
aws s3vectors delete-index --vector-bucket-name $VECTOR_BUCKET --index-name k8s-troubleshooting --region $AWS_REGION
372-
aws s3vectors delete-vector-bucket --vector-bucket-name $VECTOR_BUCKET --region $AWS_REGION
373-
```
374-
375-
## Architecture
376-
377-
### Strands-based Agentic Architecture
378-
- Multi-agent system with EKS MCP integration
379-
- S3 Vectors for knowledge storage
380-
- Slack bot with Pod Identity security
381-
382-
## Troubleshooting
383-
384-
### Common Issues
385-
- **Access Denied (Bedrock)**: Ensure your AWS account has access to the specified Bedrock model
386-
- **Image Pull Errors**: Verify ECR repository exists and credentials are correct
387-
- **Slack Integration**: Check bot tokens and permissions
388-
- **Pod Identity**: Ensure EKS Pod Identity Agent is enabled
389-
390-
-->
391-
392-
## Acknowledgments
240+
## References
393241

394242
This project uses:
395243

396244
- [Terraform AWS EKS Blueprints](https://github.com/aws-ia/terraform-aws-eks-blueprints) for infrastructure
397245
- [AWS Strands Agent Framework](https://github.com/aws/strands) for multi-agent orchestration (Agentic deployment)
398246
- [EKS MCP Server](https://github.com/aws/eks-mcp-server) for Kubernetes integration via Model Context Protocol (Agentic deployment)
399-
- [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html) for semantic vector matching and solution content validation
247+
- [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html) model hosting for semantic vector matching and solution content validation
248+
- [Slack](https://slack.com/) communications platform
400249

401250
## Security
402251

0 commit comments

Comments
 (0)