You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ NgKore (ngkore.org) is an open-source community driving innovation across Post-Q
4
4
5
5
## Contributing
6
6
7
-
We welcome contributions from the community! See our [How to Contribute](./how-to-contribute.md) for detailed setup instructions, development workflow, and pull request process.
7
+
We welcome contributions from the community! See our [How to Contribute](./how-to-contribute.md) for detailed setup instructions, development workflow, and pull request process and [Repository Conventions](./repository-conventions.md) for content and file organization guidelines.
@@ -9,38 +10,45 @@ The final day of the AI course was designed for exactly that. The scenario was a
9
10
10
11
This final post covers how I used an AI “incident commander” to delegate, resolve, and document a complex, multi-faceted outage, bringing the whole journey to a powerful conclusion.
11
12
12
-
## **My Strategy: The AI-Powered Response Team**
13
+
## My Strategy: The AI-Powered Response Team
14
+
13
15
With the crisis established in the introduction, my approach was to fully leverage the AI-powered system I had developed throughout this course. My role shifted from a hands-on engineer to that of a strategic commander, directing a purpose-built AI team with the following structure:
14
16
15
-
-***Qwen as the Orchestration Engine***: At the center of the operation, Qwen was responsible for interpreting my high-level commands and delegating the tactical execution to the appropriate specialist.
16
-
-***Plane as the System of Record***: Integrated via MCP, Plane provided real-time visibility into the active incidents and served as the platform for our automated resolution updates.
17
-
-***The Expert Subagents***: The core of the response team were the two specialists we built and validated on Day 4:
17
+
-**_Qwen as the Orchestration Engine_**: At the center of the operation, Qwen was responsible for interpreting my high-level commands and delegating the tactical execution to the appropriate specialist.
18
+
-**_Plane as the System of Record_**: Integrated via MCP, Plane provided real-time visibility into the active incidents and served as the platform for our automated resolution updates.
19
+
-**_The Expert Subagents_**: The core of the response team were the two specialists we built and validated on Day 4:
18
20
<ul>
19
21
<li> K8s: The kubernetes-specialist, tasked with methodically diagnosing the CrashLoopBackOff errors and restoring service.</li>
20
22
<li>TF : The cloud-architect, responsible for identifying the source of the Terraform drift and reconciling our production state.</li>
21
23
</ul>
22
24
This structure allowed me to manage the incident strategically, focusing on the resolution path rather than getting lost in the tactical details of any single issue.
23
25
24
-
## **Step 1: Assembling the Crisis Team**
26
+
## Step 1: Assembling the Crisis Team
27
+
25
28
Before diving into the production fire, the first step was to ensure my AI team was online and ready. I ran a quick check to list the installed agents and verify the connection to our Plane ticketing system.
29
+
26
30
```shell
27
31
# Verify available agents
28
32
qwen --prompt "List installed agents"2>/dev/null
29
33
# Test Plane MCP Integration
30
34
qwen -y 2>/dev/null
31
35
How many open issues do I have in my plane instance?
32
36
```
37
+
33
38
The system confirmed three agents were active (cloud-architect, kubernetes-specialist, and general-purpose) and that it was connected to Plane, immediately reporting four open issues. The crisis team was ready for its assignments.
34
39

35
40
36
-
## **Step 2: Addressing the Kubernetes Outage**
41
+
## Step 2: Addressing the Kubernetes Outage
42
+
37
43
The most critical issue was the application downtime caused by the crashing pods. I navigated to the relevant directory and delegated the problem to our Kubernetes expert.
38
44
39
45
```shell
40
46
cd /root/k8s-incident
41
47
qwen -y 2>/dev/null
42
48
```
49
+
43
50
**Prompt to the Kubernetes Specialist:**
51
+
44
52
```
45
53
Use the kubernetes-specialist agent to investigate and resolve pod failures.
46
54
- Analyze the pods in the default namespace for CrashLoopBackOff issues, resource constraints, and configuration errors. Pod manifest files can be found at '/root/k8s-incident'.
@@ -52,25 +60,28 @@ The agent immediately began its investigation, reading the YAML manifests for al
52
60
53
61

54
62
It quickly diagnosed the root cause:
63
+
55
64
- kk-pod1 had an invalid command causing it to crash,
56
65
- kk-pod2and kk-pod3 had resource or configuration issues preventing them from running properly.
57
-
The agent formulated a plan to fix all three.
66
+
The agent formulated a plan to fix all three.
58
67
59
68

60
69
After applying the fixes, the agent re-checked the pod status. Success!!! All pods were now in a stable running state, and the application was back online. The agent then completed its final task: documenting the entire process in a detailed incident report.
61
70
62
71

63
72
64
-
## **Step 3: Resolving Infrastructure Drift**
73
+
## Step 3: Resolving Infrastructure Drift
74
+
65
75
With the application back online, I assigned the next issue to the cloud-architect agent: reconciling the infrastructure drift.
66
76
67
-
*First, What is Infrastructure Drift?*
77
+
_First, What is Infrastructure Drift?_
68
78
69
79
Before diving into the fix, it’s important to clarify what “infrastructure drift” actually means. In the world of Infrastructure as Code (IaC), your Terraform files are your single source of truth — they represent the desired state of your environment.Infrastructure drift occurs when the actual state of one’s live production environment no longer matches the state defined in their code.
70
80
71
81
Think of Terraform code as the official architectural blueprint for a house. Drift is what happens when someone makes a change on-site — like moving a wall or adding a window — without updating the blueprint. The blueprint is now wrong, and any future work based on it is at risk of causing serious problems.
72
82
73
83
This is precisely the problem our cloud-architect agent was designed to solve: to programmatically detect this drift, report on it, and bring our infrastructure back into alignment with our code.
84
+
74
85
```shell
75
86
cd /root/terraform-static-site
76
87
qwen -y 2>/dev/null
@@ -93,30 +104,34 @@ Use the cloud-architect agent to:
93
104
Save this RCA report as /root/terraform-static-site/terraform-drift-rca.md.
This is a classic example of dangerous drift — our infrastructure was in a state where it could not properly serve error pages to users, and our code was blind to the problem.
99
111
100
112
The agent’s solution was simple and direct: it ran terraform apply to create the missing error.html object, instantly bringing our live infrastructure back into alignment with our code.
101
113
102
-
## **Step 4: Closing the Loop with Automated Documentation**
114
+
## Step 4: Closing the Loop with Automated Documentation
115
+
103
116
With both incidents resolved, the final, and often forgotten, step was to close the loop. I prompted the AI to:
104
117
105
118
With both incidents fully resolved, it was time for the final, and often forgotten, step of any incident: closing the loop. Manually writing ticket updates and post-mortem summaries is tedious, error-prone, and often gets skipped in the rush to move on.
106
119
107
120
This is where the genral-purpose agent shines. I gave it one final task:
108
121
109
122
Prompt for Final Reporting:
123
+
110
124
```
111
125
Update our Plane tickets with concise, professional comments.
112
-
Summarize the Kubernetes resolution from /root/k8s-incident/incident-report.md
126
+
Summarize the Kubernetes resolution from /root/k8s-incident/incident-report.md
113
127
and the Terraform drift resolution from /root/terraform-static-site/terraform-drift-rca.md.
114
128
After updating the tickets, concatenate both RCA markdown files into a single, comprehensive executive-summary.md file for the CTO.
## **Final Thoughts: A New Operating Model for DevOps**
133
+
## Final Thoughts: A New Operating Model for DevOps
134
+
120
135
This five-day journey through the KodeKloud AI course has fundamentally reshaped my perspective on managing complex cloud environments. I began this blog series exploring AI as a clever assistant. I’m ending it with the conviction that AI is the platform on which we will build the next generation of resilient, automated, and self-healing systems.
121
136
122
137
This series charted a clear path of that evolution. I progressed from using AI for smarter diagnostics (Day 1) and organizing documentation with RAG (Day 2), to integrating it with live cloud services for security audits (Day 3). From there, I learned to build a scalable team of specialized AI agents (Day 4), which all culminated in the final capstone: leading an AI-powered incident response.
@@ -125,4 +140,4 @@ The capstone lab was the ultimate proof of this new model. A complex production
125
140
126
141
The true impact here isn’t just about speed; it’s a move away from a reliance on siloed human expertise for incident response. By codifying knowledge into autonomous, reusable agents, we create a system where best practices are applied consistently, and every resolution makes the entire system more reliable.
127
142
128
-
This journey has made one thing crystal clear: *the future of DevOps is not about simply using AI. It’s about building with it.*
143
+
This journey has made one thing crystal clear: _the future of DevOps is not about simply using AI. It’s about building with it._
@@ -7,12 +8,13 @@ After three intensive days of firefighting and automation, my perspective on AI
7
8
8
9
The challenge was a classic DevOps bottleneck: a bloated 2GB Docker image was driving up ECR storage costs, and the security team had flagged critical vulnerabilities in our Terraform code. Our human team was swamped. Could we build an AI team to handle it?
9
10
10
-
The answer is *Qwen Subagents*.
11
+
The answer is _Qwen Subagents_.
12
+
13
+
## What are Qwen Subagents?
11
14
12
-
## **What are Qwen Subagents?**
13
15
Your analogy of a “virtual DevOps team” is spot on.Subagents are specialized, independent AI assistants that you can create to handle specific tasks.
14
16
15
-
*But what exactly is a subagent? Is it an app, a container, or something else?*
17
+
_But what exactly is a subagent? Is it an app, a container, or something else?_
16
18
17
19
In the Qwen framework, a subagent is fundamentally a simple text file — specifically, a Markdown file (.md) with a special configuration header. This file does two things:
18
20
@@ -23,14 +25,16 @@ This means we are not building a complex application or spinning up a new contai
23
25
24
26
- Docker-optimizer: An expert in shrinking container images and applying security best practices.
25
27
- Terraform Security : A specialist that scans Infrastructure-as-Code for vulnerabilities and suggests fixes.
26
-
Each of these agents operates with its own isolated context, just like a real team member focusing on their specific job without getting distracted
28
+
Each of these agents operates with its own isolated context, just like a real team member focusing on their specific job without getting distracted
29
+
30
+
## Task 1: Assembling the Virtual DevOps Team
27
31
28
-
## **Task 1: Assembling the Virtual DevOps Team**
29
32
The first step was to “hire” my new team members by installing their agent files. A simple setup script handled the installation
30
33
31
34
```shell
32
35
bash /root/setup-agents.sh
33
36
```
37
+
34
38
This copied the docker-optimizer.md and terraform-security.md agent files into the ~/.qwen/agents directory.
The script successfully installed both the Docker Optimizer and Terraform Security agents. To verify my new team was ready, I navigated to the production issues directory and ran /agents manage in Qwen:
@@ -43,17 +47,19 @@ Looking at the docker-optimizer agent file revealed how simple yet powerful thes
43
47
This subagent is configured to be an expert in Docker optimization for ECR deployment, with a focus on reducing image sizes and implementing security best practices.
44
48
45
49
> **How to Use Subagents**
46
-
The most impressive part is that Qwen handles the delegation automatically. You don’t need to explicitly call an agent. You just describe the problem, and Qwen intelligently selects the right specialist for the job. For example, a prompt about Docker optimization triggers the docker-optimizer, while a prompt about Terraform security routes to the terraform-security expert.
50
+
> The most impressive part is that Qwen handles the delegation automatically. You don’t need to explicitly call an agent. You just describe the problem, and Qwen intelligently selects the right specialist for the job. For example, a prompt about Docker optimization triggers the docker-optimizer, while a prompt about Terraform security routes to the terraform-security expert.
51
+
52
+
## First Challenge: Optimizing a Bloated Docker Image
47
53
48
-
## **First Challenge: Optimizing a Bloated Docker Image**
49
54
Our first production issue was a massive 2GB Docker image that was costing us significantly on AWS ECR storage(The lab estimated this single image was costing $150/month) and slowing down deployment . I tasked the docker-optimizerwith fixing it using a simple prompt in Qwen:
50
55
51
56
```
52
-
"Use the docker-optimizer agent to analyze and
53
-
optimize the Dockerfile in /root/production-issues/bad-docker/ for pushing
54
-
to ECR. The current image is 2GB and we need to reduce it significantly.
57
+
"Use the docker-optimizer agent to analyze and
58
+
optimize the Dockerfile in /root/production-issues/bad-docker/ for pushing
59
+
to ECR. The current image is 2GB and we need to reduce it significantly.
55
60
Save your optimization report to /root/production-issues/bad-docker/docker-optimization-report.md"
The agent immediately went to work. It read the existing Dockerfile, analyzed the structure, generated an optimized version, and prepared to create a comprehensive report. The execution summary showed 2 tool uses and took about 8 seconds to complete the analysis.
@@ -64,41 +70,49 @@ The generated report was impressive. The agent predicted a size reduction from a
The new Dockerfile used a multi-stage build, switched to a slim Python base image, copied only necessary dependencies, created a non-root user for security, and optimized the layer structure. These are all industry best practices for production Docker images.
66
72
67
-
## **Verifying the Fix: Building the New Image**
73
+
## Verifying the Fix: Building the New Image
74
+
68
75
With the docker-optimized agent's work complete, it was time for the moment of truth. I built the new, optimized Docker image using the agent-generated Dockerfile.optimized
The impact of the optimization was immediately clear: the image size plummeted from 2GB to 531MB. A 75% reduction like this has a direct, positive effect on the bottom line by cutting ECR storage costs and making our deployment pipeline significantly faster
88
98
89
-
## **Second Challenge : Securing the Infrastructure**
99
+
## Second Challenge : Securing the Infrastructure
100
+
90
101
With the Docker image optimized and the deployment pipeline faster, my virtual team’s next assignment was to address the security vulnerabilities flagged in our Terraform code. This is where the terraform-security agent, our Infrastructure-as-Code specialist, stepped in.
102
+
91
103
```
92
104
"Use the terraform-security agent to scan /root/production-issues
93
105
/bad-terraform/ for security violations and ECR misconfigurations.
94
-
We need to ensure our infrastructure is secure before deployment.
106
+
We need to ensure our infrastructure is secure before deployment.
95
107
Save your security scan report to /root/production-issues/bad-terraform/terraform-security-report.md"
The agent performed a comprehensive static analysis of the code, a practice often called “shifting left” because it moves security checks to the earliest stages of development. In about 90 seconds, it identified 20 security violations across 6 resources. The issues ranged from critical problems, like overly permissive security group rules, to high-risk misconfigurations, such as S3 buckets without encryption and ECR repositories with image scanning disabled. The agent’s detailed report included specific remediation steps for each vulnerability, providing a clear path to a more secure infrastructure.
100
113
101
-
## **A Virtual Team, A Real-World Impact**
114
+
## A Virtual Team, A Real-World Impact
115
+
102
116
Day 4 was a profound lesson in scaling expertise. Instead of being the sole expert trying to master every domain, I learned how to build and delegate to a team of specialized AI assistants. By encapsulating domain knowledge into reusable agents, we can automate the enforcement of best practices and ensure a consistent level of quality and security across all projects.
103
117
104
118
This hands-on lab demonstrated a clear and immediate impact:
@@ -107,13 +121,12 @@ This hands-on lab demonstrated a clear and immediate impact:
107
121
- Enhanced Security: Proactive, automated scanning caught critical vulnerabilities before they could ever reach production.
108
122
- Increased Speed: What would have taken hours of manual analysis and remediation was accomplished in minutes.
109
123
- Reusable Expertise: The docker-optimizer and terraform-security agents are now part of my toolkit, ready to be deployed on any future project.
110
-
## **Key Learnings from Building an AI Team**
124
+
125
+
## Key Learnings from Building an AI Team
126
+
111
127
- Subagents are Specialists: They excel by focusing on one domain.
112
128
- Expertise is Code: Best practices can be codified into simple Markdown files.
113
129
- Automation is Delegation: Qwen intelligently routes tasks to the right AI expert, streamlining complex workflows.
114
130
- Independent Context is Power: Agents work without interfering, allowing for parallel, focused problem-solving.
115
131
116
132
This journey has progressed from using AI as a helper to truly orchestrating it as a team. We’ve built specialized agents to proactively improve our systems. Now, it’s time for the ultimate test. The final post in this series will tackle a live production crisis, demonstrating how an AI-powered team performs when the pressure is on.
0 commit comments