Skip to content

Commit fbbc04f

Browse files
authored
Merge pull request #27 from chmodshubham/main
reformat doc structure and add convention guidelines
2 parents a296f9c + 3ff1070 commit fbbc04f

File tree

8 files changed

+129
-43
lines changed

8 files changed

+129
-43
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ NgKore (ngkore.org) is an open-source community driving innovation across Post-Q
44

55
## Contributing
66

7-
We welcome contributions from the community! See our [How to Contribute](./how-to-contribute.md) for detailed setup instructions, development workflow, and pull request process.
7+
We welcome contributions from the community! See our [How to Contribute](./how-to-contribute.md) for detailed setup instructions, development workflow, and pull request process and [Repository Conventions](./repository-conventions.md) for content and file organization guidelines.
88

99
## Need Help?
1010

Lines changed: 29 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# AI DevOps — The AI Incident Commander
1+
# The AI Incident Commander
2+
23
**Author:** [Megha](https://www.linkedin.com/in/megha-7aa3a0203/)
34

45
**Published:** Oct 27, 2025
@@ -9,38 +10,45 @@ The final day of the AI course was designed for exactly that. The scenario was a
910

1011
This final post covers how I used an AI “incident commander” to delegate, resolve, and document a complex, multi-faceted outage, bringing the whole journey to a powerful conclusion.
1112

12-
## **My Strategy: The AI-Powered Response Team**
13+
## My Strategy: The AI-Powered Response Team
14+
1315
With the crisis established in the introduction, my approach was to fully leverage the AI-powered system I had developed throughout this course. My role shifted from a hands-on engineer to that of a strategic commander, directing a purpose-built AI team with the following structure:
1416

15-
- ***Qwen as the Orchestration Engine***: At the center of the operation, Qwen was responsible for interpreting my high-level commands and delegating the tactical execution to the appropriate specialist.
16-
- ***Plane as the System of Record***: Integrated via MCP, Plane provided real-time visibility into the active incidents and served as the platform for our automated resolution updates.
17-
- ***The Expert Subagents***: The core of the response team were the two specialists we built and validated on Day 4:
17+
- **_Qwen as the Orchestration Engine_**: At the center of the operation, Qwen was responsible for interpreting my high-level commands and delegating the tactical execution to the appropriate specialist.
18+
- **_Plane as the System of Record_**: Integrated via MCP, Plane provided real-time visibility into the active incidents and served as the platform for our automated resolution updates.
19+
- **_The Expert Subagents_**: The core of the response team were the two specialists we built and validated on Day 4:
1820
<ul>
1921
<li> K8s: The kubernetes-specialist, tasked with methodically diagnosing the CrashLoopBackOff errors and restoring service.</li>
2022
<li>TF : The cloud-architect, responsible for identifying the source of the Terraform drift and reconciling our production state.</li>
2123
</ul>
2224
This structure allowed me to manage the incident strategically, focusing on the resolution path rather than getting lost in the tactical details of any single issue.
2325

24-
## **Step 1: Assembling the Crisis Team**
26+
## Step 1: Assembling the Crisis Team
27+
2528
Before diving into the production fire, the first step was to ensure my AI team was online and ready. I ran a quick check to list the installed agents and verify the connection to our Plane ticketing system.
29+
2630
```shell
2731
# Verify available agents
2832
qwen --prompt "List installed agents" 2>/dev/null
2933
# Test Plane MCP Integration
3034
qwen -y 2>/dev/null
3135
How many open issues do I have in my plane instance?
3236
```
37+
3338
The system confirmed three agents were active (cloud-architect, kubernetes-specialist, and general-purpose) and that it was connected to Plane, immediately reporting four open issues. The crisis team was ready for its assignments.
3439
![4 issues](./images/plane_instance.webp)
3540

36-
## **Step 2: Addressing the Kubernetes Outage**
41+
## Step 2: Addressing the Kubernetes Outage
42+
3743
The most critical issue was the application downtime caused by the crashing pods. I navigated to the relevant directory and delegated the problem to our Kubernetes expert.
3844

3945
```shell
4046
cd /root/k8s-incident
4147
qwen -y 2>/dev/null
4248
```
49+
4350
**Prompt to the Kubernetes Specialist:**
51+
4452
```
4553
Use the kubernetes-specialist agent to investigate and resolve pod failures.
4654
- Analyze the pods in the default namespace for CrashLoopBackOff issues, resource constraints, and configuration errors. Pod manifest files can be found at '/root/k8s-incident'.
@@ -52,25 +60,28 @@ The agent immediately began its investigation, reading the YAML manifests for al
5260

5361
![read 3 pods ](./images/read3pods.webp)
5462
It quickly diagnosed the root cause:
63+
5564
- kk-pod1 had an invalid command causing it to crash,
5665
- kk-pod2and kk-pod3 had resource or configuration issues preventing them from running properly.
57-
The agent formulated a plan to fix all three.
66+
The agent formulated a plan to fix all three.
5867

5968
![fixed 3 pods ](./images/fix3pods.webp)
6069
After applying the fixes, the agent re-checked the pod status. Success!!! All pods were now in a stable running state, and the application was back online. The agent then completed its final task: documenting the entire process in a detailed incident report.
6170

6271
![fixed 3 pods ](./images/complete.webp)
6372

64-
## **Step 3: Resolving Infrastructure Drift**
73+
## Step 3: Resolving Infrastructure Drift
74+
6575
With the application back online, I assigned the next issue to the cloud-architect agent: reconciling the infrastructure drift.
6676

67-
*First, What is Infrastructure Drift?*
77+
_First, What is Infrastructure Drift?_
6878

6979
Before diving into the fix, it’s important to clarify what “infrastructure drift” actually means. In the world of Infrastructure as Code (IaC), your Terraform files are your single source of truth — they represent the desired state of your environment.Infrastructure drift occurs when the actual state of one’s live production environment no longer matches the state defined in their code.
7080

7181
Think of Terraform code as the official architectural blueprint for a house. Drift is what happens when someone makes a change on-site — like moving a wall or adding a window — without updating the blueprint. The blueprint is now wrong, and any future work based on it is at risk of causing serious problems.
7282

7383
This is precisely the problem our cloud-architect agent was designed to solve: to programmatically detect this drift, report on it, and bring our infrastructure back into alignment with our code.
84+
7485
```shell
7586
cd /root/terraform-static-site
7687
qwen -y 2>/dev/null
@@ -93,30 +104,34 @@ Use the cloud-architect agent to:
93104
Save this RCA report as /root/terraform-static-site/terraform-drift-rca.md.
94105
Press enter or click to view image in full size
95106
```
107+
96108
![cloud-architect ](./images/cloud_architect.webp)
97109

98110
This is a classic example of dangerous drift — our infrastructure was in a state where it could not properly serve error pages to users, and our code was blind to the problem.
99111

100112
The agent’s solution was simple and direct: it ran terraform apply to create the missing error.html object, instantly bringing our live infrastructure back into alignment with our code.
101113

102-
## **Step 4: Closing the Loop with Automated Documentation**
114+
## Step 4: Closing the Loop with Automated Documentation
115+
103116
With both incidents resolved, the final, and often forgotten, step was to close the loop. I prompted the AI to:
104117

105118
With both incidents fully resolved, it was time for the final, and often forgotten, step of any incident: closing the loop. Manually writing ticket updates and post-mortem summaries is tedious, error-prone, and often gets skipped in the rush to move on.
106119

107120
This is where the genral-purpose agent shines. I gave it one final task:
108121

109122
Prompt for Final Reporting:
123+
110124
```
111125
Update our Plane tickets with concise, professional comments.
112-
Summarize the Kubernetes resolution from /root/k8s-incident/incident-report.md
126+
Summarize the Kubernetes resolution from /root/k8s-incident/incident-report.md
113127
and the Terraform drift resolution from /root/terraform-static-site/terraform-drift-rca.md.
114128
After updating the tickets, concatenate both RCA markdown files into a single, comprehensive executive-summary.md file for the CTO.
115129
```
116130

117131
![executive-summary](./images/executive_summary.webp)
118132

119-
## **Final Thoughts: A New Operating Model for DevOps**
133+
## Final Thoughts: A New Operating Model for DevOps
134+
120135
This five-day journey through the KodeKloud AI course has fundamentally reshaped my perspective on managing complex cloud environments. I began this blog series exploring AI as a clever assistant. I’m ending it with the conviction that AI is the platform on which we will build the next generation of resilient, automated, and self-healing systems.
121136

122137
This series charted a clear path of that evolution. I progressed from using AI for smarter diagnostics (Day 1) and organizing documentation with RAG (Day 2), to integrating it with live cloud services for security audits (Day 3). From there, I learned to build a scalable team of specialized AI agents (Day 4), which all culminated in the final capstone: leading an AI-powered incident response.
@@ -125,4 +140,4 @@ The capstone lab was the ultimate proof of this new model. A complex production
125140

126141
The true impact here isn’t just about speed; it’s a move away from a reliance on siloed human expertise for incident response. By codifying knowledge into autonomous, reusable agents, we create a system where best practices are applied consistently, and every resolution makes the entire system more reliable.
127142

128-
This journey has made one thing crystal clear: *the future of DevOps is not about simply using AI. It’s about building with it.*
143+
This journey has made one thing crystal clear: _the future of DevOps is not about simply using AI. It’s about building with it._

ai-ml/devops/Building_a_Virtual_DevOps_Team_with_Qwen_Subagents.md renamed to ai-ml/devops/building-virtual-devops-team-with-qwen-subagents.md

Lines changed: 33 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# AI Devops - Building a Virtual DevOps Team with Qwen Subagents
1+
# Building a Virtual DevOps Team with Qwen Subagents
2+
23
**Author:** [Megha](https://www.linkedin.com/in/megha-7aa3a0203/)
34

45
**Published:** Oct 25, 2025
@@ -7,12 +8,13 @@ After three intensive days of firefighting and automation, my perspective on AI
78

89
The challenge was a classic DevOps bottleneck: a bloated 2GB Docker image was driving up ECR storage costs, and the security team had flagged critical vulnerabilities in our Terraform code. Our human team was swamped. Could we build an AI team to handle it?
910

10-
The answer is *Qwen Subagents*.​
11+
The answer is _Qwen Subagents_.​
12+
13+
## What are Qwen Subagents?
1114

12-
## **What are Qwen Subagents?**
1315
Your analogy of a “virtual DevOps team” is spot on.Subagents are specialized, independent AI assistants that you can create to handle specific tasks.
1416

15-
*But what exactly is a subagent? Is it an app, a container, or something else?*
17+
_But what exactly is a subagent? Is it an app, a container, or something else?_
1618

1719
In the Qwen framework, a subagent is fundamentally a simple text file — specifically, a Markdown file (.md) with a special configuration header. This file does two things:​
1820

@@ -23,14 +25,16 @@ This means we are not building a complex application or spinning up a new contai
2325

2426
- Docker-optimizer: An expert in shrinking container images and applying security best practices.​
2527
- Terraform Security : A specialist that scans Infrastructure-as-Code for vulnerabilities and suggests fixes.​
26-
Each of these agents operates with its own isolated context, just like a real team member focusing on their specific job without getting distracted
28+
Each of these agents operates with its own isolated context, just like a real team member focusing on their specific job without getting distracted
29+
30+
## Task 1: Assembling the Virtual DevOps Team
2731

28-
## **Task 1: Assembling the Virtual DevOps Team**
2932
The first step was to “hire” my new team members by installing their agent files. A simple setup script handled the installation
3033

3134
```shell
3235
bash /root/setup-agents.sh
3336
```
37+
3438
This copied the docker-optimizer.md and terraform-security.md agent files into the ~/.qwen/agents directory.
3539
![installation complete](./images/subagent_setup.webp)
3640
The script successfully installed both the Docker Optimizer and Terraform Security agents. To verify my new team was ready, I navigated to the production issues directory and ran /agents manage in Qwen:
@@ -43,17 +47,19 @@ Looking at the docker-optimizer agent file revealed how simple yet powerful thes
4347
This subagent is configured to be an expert in Docker optimization for ECR deployment, with a focus on reducing image sizes and implementing security best practices.
4448

4549
> **How to Use Subagents**
46-
The most impressive part is that Qwen handles the delegation automatically. You don’t need to explicitly call an agent. You just describe the problem, and Qwen intelligently selects the right specialist for the job. For example, a prompt about Docker optimization triggers the docker-optimizer, while a prompt about Terraform security routes to the terraform-security expert.
50+
> The most impressive part is that Qwen handles the delegation automatically. You don’t need to explicitly call an agent. You just describe the problem, and Qwen intelligently selects the right specialist for the job. For example, a prompt about Docker optimization triggers the docker-optimizer, while a prompt about Terraform security routes to the terraform-security expert.
51+
52+
## First Challenge: Optimizing a Bloated Docker Image
4753

48-
## **First Challenge: Optimizing a Bloated Docker Image**
4954
Our first production issue was a massive 2GB Docker image that was costing us significantly on AWS ECR storage(The lab estimated this single image was costing $150/month) and slowing down deployment . I tasked the docker-optimizerwith fixing it using a simple prompt in Qwen:
5055

5156
```
52-
"Use the docker-optimizer agent to analyze and
53-
optimize the Dockerfile in /root/production-issues/bad-docker/ for pushing
54-
to ECR. The current image is 2GB and we need to reduce it significantly.
57+
"Use the docker-optimizer agent to analyze and
58+
optimize the Dockerfile in /root/production-issues/bad-docker/ for pushing
59+
to ECR. The current image is 2GB and we need to reduce it significantly.
5560
Save your optimization report to /root/production-issues/bad-docker/docker-optimization-report.md"
5661
```
62+
5763
![docker-optimizer execution](./images/dockeroptz_execute.webp)
5864

5965
The agent immediately went to work. It read the existing Dockerfile, analyzed the structure, generated an optimized version, and prepared to create a comprehensive report. The execution summary showed 2 tool uses and took about 8 seconds to complete the analysis.
@@ -64,41 +70,49 @@ The generated report was impressive. The agent predicted a size reduction from a
6470
![optimized Dockerfile report](./images/optimized_docfile.webp)
6571
The new Dockerfile used a multi-stage build, switched to a slim Python base image, copied only necessary dependencies, created a non-root user for security, and optimized the layer structure. These are all industry best practices for production Docker images.​
6672

67-
## **Verifying the Fix: Building the New Image**
73+
## Verifying the Fix: Building the New Image
74+
6875
With the docker-optimized agent's work complete, it was time for the moment of truth. I built the new, optimized Docker image using the agent-generated Dockerfile.optimized
6976

7077
```shell
71-
docker build -f /root/production-issues/bad-docker/Dockerfile.optimized
78+
docker build -f /root/production-issues/bad-docker/Dockerfile.optimized
7279
-t my-app:optimized /root/production-issues/bad-docker/
7380
Press enter or click to view image in full size
7481

7582
```
83+
7684
![Docker build process​](./images/docker_build.webp)
7785

7886
The build completed successfully, installing only the necessary dependencies and creating a much leaner image.
7987

8088
When I checked the final image size:
89+
8190
```shell
8291
docker images | grep my-app
8392

8493
```
94+
8595
![Image size comparison​](./images/imgsize_compare.webp)
8696

8797
The impact of the optimization was immediately clear: the image size plummeted from 2GB to 531MB. A 75% reduction like this has a direct, positive effect on the bottom line by cutting ECR storage costs and making our deployment pipeline significantly faster
8898

89-
## **Second Challenge : Securing the Infrastructure**
99+
## Second Challenge : Securing the Infrastructure
100+
90101
With the Docker image optimized and the deployment pipeline faster, my virtual team’s next assignment was to address the security vulnerabilities flagged in our Terraform code. This is where the terraform-security agent, our Infrastructure-as-Code specialist, stepped in.
102+
91103
```
92104
"Use the terraform-security agent to scan /root/production-issues
93105
/bad-terraform/ for security violations and ECR misconfigurations.
94-
We need to ensure our infrastructure is secure before deployment.
106+
We need to ensure our infrastructure is secure before deployment.
95107
Save your security scan report to /root/production-issues/bad-terraform/terraform-security-report.md"
96108
```
109+
97110
![Terraform security scan​](./images/TR_security_scan.webp)
98111

99112
The agent performed a comprehensive static analysis of the code, a practice often called “shifting left” because it moves security checks to the earliest stages of development. In about 90 seconds, it identified 20 security violations across 6 resources. The issues ranged from critical problems, like overly permissive security group rules, to high-risk misconfigurations, such as S3 buckets without encryption and ECR repositories with image scanning disabled. The agent’s detailed report included specific remediation steps for each vulnerability, providing a clear path to a more secure infrastructure.​
100113

101-
## **A Virtual Team, A Real-World Impact**
114+
## A Virtual Team, A Real-World Impact
115+
102116
Day 4 was a profound lesson in scaling expertise. Instead of being the sole expert trying to master every domain, I learned how to build and delegate to a team of specialized AI assistants. By encapsulating domain knowledge into reusable agents, we can automate the enforcement of best practices and ensure a consistent level of quality and security across all projects.
103117

104118
This hands-on lab demonstrated a clear and immediate impact:
@@ -107,13 +121,12 @@ This hands-on lab demonstrated a clear and immediate impact:
107121
- Enhanced Security: Proactive, automated scanning caught critical vulnerabilities before they could ever reach production.
108122
- Increased Speed: What would have taken hours of manual analysis and remediation was accomplished in minutes.
109123
- Reusable Expertise: The docker-optimizer and terraform-security agents are now part of my toolkit, ready to be deployed on any future project.
110-
## **Key Learnings from Building an AI Team**
124+
125+
## Key Learnings from Building an AI Team
126+
111127
- Subagents are Specialists: They excel by focusing on one domain.
112128
- Expertise is Code: Best practices can be codified into simple Markdown files.
113129
- Automation is Delegation: Qwen intelligently routes tasks to the right AI expert, streamlining complex workflows.
114130
- Independent Context is Power: Agents work without interfering, allowing for parallel, focused problem-solving.
115131

116132
This journey has progressed from using AI as a helper to truly orchestrating it as a team. We’ve built specialized agents to proactively improve our systems. Now, it’s time for the ultimate test. The final post in this series will tackle a live production crisis, demonstrating how an AI-powered team performs when the pressure is on.
117-
118-
119-

ai-ml/devops/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,6 @@
66
broken-pod-to-actionable-prompts
77
turning-45percent-rag-into-audit-system
88
ai-automation-in-aws-with-mcp
9+
ai-incident-commander
10+
building-virtual-devops-team-with-qwen-subagents
911
```

ai-ml/index.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,6 @@
55
66
ai-training-at-scale
77
ai-driven-call-routing
8-
openai-120B-model
9-
harbor-setup
10-
grok2-onprem
11-
k2-think-onprem
12-
aipos
138
openai-120b-model-deployment
149
harbor-setup-for-proxy-mirror
1510
grok2-deployment-via-sglang

ebpf/index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,9 @@ data-processing-with-ebpf-in-ssd
2626
telecom/index
2727
```
2828

29-
````{toctree}
29+
```{toctree}
3030
:maxdepth: 1
3131
ebpf-for-gpu-acceleration
3232
building-ebpf-uprobes-for-gpu-monitoring-in-cuda
3333
shared-socket-for-k8s-pods
3434
```
35-
````

0 commit comments

Comments
 (0)