Merge pull request #27 from chmodshubham/main

chmodshubham · web-flow · commit fbbc04fc3cd9 · 2025-10-29T21:11:03.000+05:30
reformat doc structure and add convention guidelines
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ NgKore (ngkore.org) is an open-source community driving innovation across Post-Q
 
 ## Contributing
 
-We welcome contributions from the community! See our [How to Contribute](./how-to-contribute.md) for detailed setup instructions, development workflow, and pull request process.
+We welcome contributions from the community! See our [How to Contribute](./how-to-contribute.md) for detailed setup instructions, development workflow, and pull request process and [Repository Conventions](./repository-conventions.md) for content and file organization guidelines.
 
 ## Need Help?
 
diff --git a/ai-ml/devops/ai-incident-commander.md b/ai-ml/devops/ai-incident-commander.md
@@ -1,4 +1,5 @@
-# AI DevOps — The AI Incident Commander
+# The AI Incident Commander
+
 **Author:** [Megha](https://www.linkedin.com/in/megha-7aa3a0203/)
 
 **Published:** Oct 27, 2025
@@ -9,38 +10,45 @@ The final day of the AI course was designed for exactly that. The scenario was a
 
 This final post covers how I used an AI “incident commander” to delegate, resolve, and document a complex, multi-faceted outage, bringing the whole journey to a powerful conclusion.
 
-## **My Strategy: The AI-Powered Response Team**
+## My Strategy: The AI-Powered Response Team
+
 With the crisis established in the introduction, my approach was to fully leverage the AI-powered system I had developed throughout this course. My role shifted from a hands-on engineer to that of a strategic commander, directing a purpose-built AI team with the following structure:
 
-- ***Qwen as the Orchestration Engine***: At the center of the operation, Qwen was responsible for interpreting my high-level commands and delegating the tactical execution to the appropriate specialist.
-- ***Plane as the System of Record***: Integrated via MCP, Plane provided real-time visibility into the active incidents and served as the platform for our automated resolution updates.
-- ***The Expert Subagents***: The core of the response team were the two specialists we built and validated on Day 4:
+- **_Qwen as the Orchestration Engine_**: At the center of the operation, Qwen was responsible for interpreting my high-level commands and delegating the tactical execution to the appropriate specialist.
+- **_Plane as the System of Record_**: Integrated via MCP, Plane provided real-time visibility into the active incidents and served as the platform for our automated resolution updates.
+- **_The Expert Subagents_**: The core of the response team were the two specialists we built and validated on Day 4:
   <ul>
   <li> K8s: The kubernetes-specialist, tasked with methodically diagnosing the CrashLoopBackOff errors and restoring service.</li>
    <li>TF : The cloud-architect, responsible for identifying the source of the Terraform drift and reconciling our production state.</li>
 </ul>
 This structure allowed me to manage the incident strategically, focusing on the resolution path rather than getting lost in the tactical details of any single issue.
 
-## **Step 1: Assembling the Crisis Team**
+## Step 1: Assembling the Crisis Team
+
 Before diving into the production fire, the first step was to ensure my AI team was online and ready. I ran a quick check to list the installed agents and verify the connection to our Plane ticketing system.
+
 ```shell
 # Verify available agents
 qwen --prompt "List installed agents" 2>/dev/null
 # Test Plane MCP Integration
 qwen -y 2>/dev/null
 How many open issues do I have in my plane instance?
 ```
+
 The system confirmed three agents were active (cloud-architect, kubernetes-specialist, and general-purpose) and that it was connected to Plane, immediately reporting four open issues. The crisis team was ready for its assignments.
 ![4 issues](./images/plane_instance.webp)
 
-## **Step 2: Addressing the Kubernetes Outage**
+## Step 2: Addressing the Kubernetes Outage
+
 The most critical issue was the application downtime caused by the crashing pods. I navigated to the relevant directory and delegated the problem to our Kubernetes expert.
 
 ```shell
 cd /root/k8s-incident
 qwen -y 2>/dev/null
 ```
+
 **Prompt to the Kubernetes Specialist:**
+
 ```
 Use the kubernetes-specialist agent to investigate and resolve pod failures.
 - Analyze the pods in the default namespace for CrashLoopBackOff issues, resource constraints, and configuration errors. Pod manifest files can be found at '/root/k8s-incident'.
@@ -52,25 +60,28 @@ The agent immediately began its investigation, reading the YAML manifests for al
 
 ![read 3 pods ](./images/read3pods.webp)
 It quickly diagnosed the root cause:
+
 - kk-pod1 had an invalid command causing it to crash,
 - kk-pod2and kk-pod3 had resource or configuration issues preventing them from running properly.
-The agent formulated a plan to fix all three.
+  The agent formulated a plan to fix all three.
 
 ![fixed 3 pods ](./images/fix3pods.webp)
 After applying the fixes, the agent re-checked the pod status. Success!!! All pods were now in a stable running state, and the application was back online. The agent then completed its final task: documenting the entire process in a detailed incident report.
 
 ![fixed 3 pods ](./images/complete.webp)
 
-## **Step 3: Resolving Infrastructure Drift**
+## Step 3: Resolving Infrastructure Drift
+
 With the application back online, I assigned the next issue to the cloud-architect agent: reconciling the infrastructure drift.
 
-*First, What is Infrastructure Drift?*
+_First, What is Infrastructure Drift?_
 
 Before diving into the fix, it’s important to clarify what “infrastructure drift” actually means. In the world of Infrastructure as Code (IaC), your Terraform files are your single source of truth — they represent the desired state of your environment.Infrastructure drift occurs when the actual state of one’s live production environment no longer matches the state defined in their code.
 
 Think of Terraform code as the official architectural blueprint for a house. Drift is what happens when someone makes a change on-site — like moving a wall or adding a window — without updating the blueprint. The blueprint is now wrong, and any future work based on it is at risk of causing serious problems.
 
 This is precisely the problem our cloud-architect agent was designed to solve: to programmatically detect this drift, report on it, and bring our infrastructure back into alignment with our code.
+
 ```shell
 cd /root/terraform-static-site
 qwen -y 2>/dev/null
@@ -93,30 +104,34 @@ Use the cloud-architect agent to:
 Save this RCA report as /root/terraform-static-site/terraform-drift-rca.md.
 Press enter or click to view image in full size
 ```
+
 ![cloud-architect ](./images/cloud_architect.webp)
 
 This is a classic example of dangerous drift — our infrastructure was in a state where it could not properly serve error pages to users, and our code was blind to the problem.
 
 The agent’s solution was simple and direct: it ran terraform apply to create the missing error.html object, instantly bringing our live infrastructure back into alignment with our code.
 
-## **Step 4: Closing the Loop with Automated Documentation**
+## Step 4: Closing the Loop with Automated Documentation
+
 With both incidents resolved, the final, and often forgotten, step was to close the loop. I prompted the AI to:
 
 With both incidents fully resolved, it was time for the final, and often forgotten, step of any incident: closing the loop. Manually writing ticket updates and post-mortem summaries is tedious, error-prone, and often gets skipped in the rush to move on.
 
 This is where the genral-purpose agent shines. I gave it one final task:
 
 Prompt for Final Reporting:
+
 ```
 Update our Plane tickets with concise, professional comments.
-Summarize the Kubernetes resolution from /root/k8s-incident/incident-report.md 
+Summarize the Kubernetes resolution from /root/k8s-incident/incident-report.md
 and the Terraform drift resolution from /root/terraform-static-site/terraform-drift-rca.md.
 After updating the tickets, concatenate both RCA markdown files into a single, comprehensive executive-summary.md file for the CTO.
 ```
 
 ![executive-summary](./images/executive_summary.webp)
 
-## **Final Thoughts: A New Operating Model for DevOps**
+## Final Thoughts: A New Operating Model for DevOps
+
 This five-day journey through the KodeKloud AI course has fundamentally reshaped my perspective on managing complex cloud environments. I began this blog series exploring AI as a clever assistant. I’m ending it with the conviction that AI is the platform on which we will build the next generation of resilient, automated, and self-healing systems.
 
 This series charted a clear path of that evolution. I progressed from using AI for smarter diagnostics (Day 1) and organizing documentation with RAG (Day 2), to integrating it with live cloud services for security audits (Day 3). From there, I learned to build a scalable team of specialized AI agents (Day 4), which all culminated in the final capstone: leading an AI-powered incident response.
@@ -125,4 +140,4 @@ The capstone lab was the ultimate proof of this new model. A complex production
 
 The true impact here isn’t just about speed; it’s a move away from a reliance on siloed human expertise for incident response. By codifying knowledge into autonomous, reusable agents, we create a system where best practices are applied consistently, and every resolution makes the entire system more reliable.
 
-This journey has made one thing crystal clear: *the future of DevOps is not about simply using AI. It’s about building with it.*
+This journey has made one thing crystal clear: _the future of DevOps is not about simply using AI. It’s about building with it._
diff --git a/ai-ml/devops/building-virtual-devops-team-with-qwen-subagents.md b/ai-ml/devops/building-virtual-devops-team-with-qwen-subagents.md
@@ -1,4 +1,5 @@
-# AI Devops - Building a Virtual DevOps Team with Qwen Subagents
+# Building a Virtual DevOps Team with Qwen Subagents
+
 **Author:** [Megha](https://www.linkedin.com/in/megha-7aa3a0203/)
 
 **Published:** Oct 25, 2025
@@ -7,12 +8,13 @@ After three intensive days of firefighting and automation, my perspective on AI
 
 The challenge was a classic DevOps bottleneck: a bloated 2GB Docker image was driving up ECR storage costs, and the security team had flagged critical vulnerabilities in our Terraform code. Our human team was swamped. Could we build an AI team to handle it?
 
-The answer is *Qwen Subagents*.​
+The answer is _Qwen Subagents_.​
+
+## What are Qwen Subagents?
 
-## **What are Qwen Subagents?**
 Your analogy of a “virtual DevOps team” is spot on.Subagents are specialized, independent AI assistants that you can create to handle specific tasks.
 
-*But what exactly is a subagent? Is it an app, a container, or something else?*
+_But what exactly is a subagent? Is it an app, a container, or something else?_
 
 In the Qwen framework, a subagent is fundamentally a simple text file — specifically, a Markdown file (.md) with a special configuration header. This file does two things:​
 
@@ -23,14 +25,16 @@ This means we are not building a complex application or spinning up a new contai
 
 - Docker-optimizer: An expert in shrinking container images and applying security best practices.​
 - Terraform Security : A specialist that scans Infrastructure-as-Code for vulnerabilities and suggests fixes.​
-Each of these agents operates with its own isolated context, just like a real team member focusing on their specific job without getting distracted
+  Each of these agents operates with its own isolated context, just like a real team member focusing on their specific job without getting distracted
+
+## Task 1: Assembling the Virtual DevOps Team
 
-## **Task 1: Assembling the Virtual DevOps Team**
 The first step was to “hire” my new team members by installing their agent files. A simple setup script handled the installation
 
 ```shell
 bash /root/setup-agents.sh
 ```
+
 This copied the docker-optimizer.md and terraform-security.md agent files into the ~/.qwen/agents directory.
 ![installation complete](./images/subagent_setup.webp)
 The script successfully installed both the Docker Optimizer and Terraform Security agents. To verify my new team was ready, I navigated to the production issues directory and ran /agents manage in Qwen:
@@ -43,17 +47,19 @@ Looking at the docker-optimizer agent file revealed how simple yet powerful thes
 This subagent is configured to be an expert in Docker optimization for ECR deployment, with a focus on reducing image sizes and implementing security best practices.
 
 > **How to Use Subagents**
- The most impressive part is that Qwen handles the delegation automatically. You don’t need to explicitly call an agent. You just describe the problem, and Qwen intelligently selects the right specialist for the job. For example, a prompt about Docker optimization triggers the docker-optimizer, while a prompt about Terraform security routes to the terraform-security expert.
+> The most impressive part is that Qwen handles the delegation automatically. You don’t need to explicitly call an agent. You just describe the problem, and Qwen intelligently selects the right specialist for the job. For example, a prompt about Docker optimization triggers the docker-optimizer, while a prompt about Terraform security routes to the terraform-security expert.
+
+## First Challenge: Optimizing a Bloated Docker Image
 
-## **First Challenge: Optimizing a Bloated Docker Image**
 Our first production issue was a massive 2GB Docker image that was costing us significantly on AWS ECR storage(The lab estimated this single image was costing $150/month) and slowing down deployment . I tasked the docker-optimizerwith fixing it using a simple prompt in Qwen:
 
 ```
-"Use the docker-optimizer agent to analyze and 
-optimize the Dockerfile in /root/production-issues/bad-docker/ for pushing 
-to ECR. The current image is 2GB and we need to reduce it significantly. 
+"Use the docker-optimizer agent to analyze and
+optimize the Dockerfile in /root/production-issues/bad-docker/ for pushing
+to ECR. The current image is 2GB and we need to reduce it significantly.
 Save your optimization report to /root/production-issues/bad-docker/docker-optimization-report.md"
 ```
+
 ![docker-optimizer execution](./images/dockeroptz_execute.webp)
 
 The agent immediately went to work. It read the existing Dockerfile, analyzed the structure, generated an optimized version, and prepared to create a comprehensive report. The execution summary showed 2 tool uses and took about 8 seconds to complete the analysis.
@@ -64,41 +70,49 @@ The generated report was impressive. The agent predicted a size reduction from a
 ![optimized Dockerfile report](./images/optimized_docfile.webp)
 The new Dockerfile used a multi-stage build, switched to a slim Python base image, copied only necessary dependencies, created a non-root user for security, and optimized the layer structure. These are all industry best practices for production Docker images.​
 
-## **Verifying the Fix: Building the New Image**
+## Verifying the Fix: Building the New Image
+
 With the docker-optimized agent's work complete, it was time for the moment of truth. I built the new, optimized Docker image using the agent-generated Dockerfile.optimized
 
 ```shell
-docker build -f /root/production-issues/bad-docker/Dockerfile.optimized 
+docker build -f /root/production-issues/bad-docker/Dockerfile.optimized
 -t my-app:optimized /root/production-issues/bad-docker/
 Press enter or click to view image in full size
 
 ```
+
 ![Docker build process​](./images/docker_build.webp)
 
 The build completed successfully, installing only the necessary dependencies and creating a much leaner image.
 
 When I checked the final image size:
+
 ```shell
 docker images | grep my-app
 
 ```
+
 ![Image size comparison​](./images/imgsize_compare.webp)
 
 The impact of the optimization was immediately clear: the image size plummeted from 2GB to 531MB. A 75% reduction like this has a direct, positive effect on the bottom line by cutting ECR storage costs and making our deployment pipeline significantly faster
 
-## **Second Challenge : Securing the Infrastructure**
+## Second Challenge : Securing the Infrastructure
+
 With the Docker image optimized and the deployment pipeline faster, my virtual team’s next assignment was to address the security vulnerabilities flagged in our Terraform code. This is where the terraform-security agent, our Infrastructure-as-Code specialist, stepped in.
+
 ```
 "Use the terraform-security agent to scan /root/production-issues
 /bad-terraform/ for security violations and ECR misconfigurations.
-We need to ensure our infrastructure is secure before deployment. 
+We need to ensure our infrastructure is secure before deployment.
 Save your security scan report to /root/production-issues/bad-terraform/terraform-security-report.md"
 ```
+
 ![Terraform security scan​](./images/TR_security_scan.webp)
 
 The agent performed a comprehensive static analysis of the code, a practice often called “shifting left” because it moves security checks to the earliest stages of development. In about 90 seconds, it identified 20 security violations across 6 resources. The issues ranged from critical problems, like overly permissive security group rules, to high-risk misconfigurations, such as S3 buckets without encryption and ECR repositories with image scanning disabled. The agent’s detailed report included specific remediation steps for each vulnerability, providing a clear path to a more secure infrastructure.​
 
-## **A Virtual Team, A Real-World Impact**
+## A Virtual Team, A Real-World Impact
+
 Day 4 was a profound lesson in scaling expertise. Instead of being the sole expert trying to master every domain, I learned how to build and delegate to a team of specialized AI assistants. By encapsulating domain knowledge into reusable agents, we can automate the enforcement of best practices and ensure a consistent level of quality and security across all projects.
 
 This hands-on lab demonstrated a clear and immediate impact:
@@ -107,13 +121,12 @@ This hands-on lab demonstrated a clear and immediate impact:
 - Enhanced Security: Proactive, automated scanning caught critical vulnerabilities before they could ever reach production.
 - Increased Speed: What would have taken hours of manual analysis and remediation was accomplished in minutes.
 - Reusable Expertise: The docker-optimizer and terraform-security agents are now part of my toolkit, ready to be deployed on any future project.
-## **Key Learnings from Building an AI Team**
+
+## Key Learnings from Building an AI Team
+
 - Subagents are Specialists: They excel by focusing on one domain.
 - Expertise is Code: Best practices can be codified into simple Markdown files.
 - Automation is Delegation: Qwen intelligently routes tasks to the right AI expert, streamlining complex workflows.
 - Independent Context is Power: Agents work without interfering, allowing for parallel, focused problem-solving.
 
 This journey has progressed from using AI as a helper to truly orchestrating it as a team. We’ve built specialized agents to proactively improve our systems. Now, it’s time for the ultimate test. The final post in this series will tackle a live production crisis, demonstrating how an AI-powered team performs when the pressure is on.
-
-
-
diff --git a/ai-ml/devops/index.md b/ai-ml/devops/index.md
@@ -6,4 +6,6 @@
 broken-pod-to-actionable-prompts
 turning-45percent-rag-into-audit-system
 ai-automation-in-aws-with-mcp
+ai-incident-commander
+building-virtual-devops-team-with-qwen-subagents
 ```
diff --git a/ai-ml/index.md b/ai-ml/index.md
@@ -5,11 +5,6 @@
 
 ai-training-at-scale
 ai-driven-call-routing
-openai-120B-model
-harbor-setup
-grok2-onprem
-k2-think-onprem
-aipos
 openai-120b-model-deployment
 harbor-setup-for-proxy-mirror
 grok2-deployment-via-sglang
diff --git a/ebpf/index.md b/ebpf/index.md
@@ -26,10 +26,9 @@ data-processing-with-ebpf-in-ssd
 telecom/index
 ```
 
-````{toctree}
+```{toctree}
 :maxdepth: 1
 ebpf-for-gpu-acceleration
 building-ebpf-uprobes-for-gpu-monitoring-in-cuda
 shared-socket-for-k8s-pods
 ```
-````
diff --git a/index.md b/index.md
diff --git a/repository-conventions.md b/repository-conventions.md