Skip to content

Commit 467f2f3

Browse files
Douglas Williamsclaude
andcommitted
Add IT Runbook Agent RAG system
RAG pipeline that retrieves relevant runbook sections and generates step-by-step IT incident guidance using sentence-transformers embeddings, ChromaDB vector search, and Ollama (Llama 3.1 8B). Includes 25 synthetic runbooks across 10 categories, Streamlit dashboard, and retrieval evaluation pipeline (Recall@5=1.0, MRR=0.95). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 parents  commit 467f2f3

21 files changed

+2307
-0
lines changed

.github/workflows/ci.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- name: Set up Python
16+
uses: actions/setup-python@v5
17+
with:
18+
python-version: "3.14"
19+
20+
- name: Install dependencies
21+
run: pip install -r requirements.txt
22+
23+
- name: Generate runbooks
24+
run: cd src && python generate_runbooks.py
25+
26+
- name: Build index
27+
run: cd src && python index_runbooks.py
28+
29+
- name: Run tests
30+
run: python -m pytest tests/ -v -m "not ollama"

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
venv/
2+
__pycache__/
3+
*.pyc
4+
.DS_Store
5+
runbooks/
6+
data/

CLAUDE.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Runbook Agent — CLAUDE.md
2+
3+
## Tech Stack
4+
Python 3.14, sentence-transformers (all-MiniLM-L6-v2), ChromaDB,
5+
Ollama (Llama 3.1 8B), Streamlit, pytest.
6+
7+
## Structure
8+
src/ All source modules (run from here with relative paths)
9+
tests/ pytest tests (conftest.py adds src/ to sys.path)
10+
runbooks/ Generated .md files (gitignored)
11+
data/ ChromaDB store + eval JSON (gitignored)
12+
13+
## How to Run
14+
cd src && python generate_runbooks.py # Generate 25 runbooks
15+
cd src && python index_runbooks.py # Embed + index into ChromaDB
16+
cd src && python query_engine.py # Test RAG query (needs Ollama)
17+
cd src && python -m streamlit run dashboard.py # Launch UI
18+
python -m pytest tests/ -v -m "not ollama" # Run tests (no Ollama needed)
19+
make all # runbooks + index + test
20+
21+
## Key Conventions
22+
All scripts run from src/ with relative paths (../runbooks/, ../data/).
23+
constants.py defines all paths, model names, and categories.
24+
Ollama tests are marked @pytest.mark.ollama and skipped in CI.
25+
Index is idempotent — deletes and recreates the collection each run.
26+
CLAUDE.md is in .gitignore (not checked in).

Dockerfile

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
FROM python:3.14-slim
2+
3+
WORKDIR /app
4+
5+
COPY requirements.txt .
6+
RUN pip install --no-cache-dir -r requirements.txt
7+
8+
# Pre-download the embedding model during build
9+
RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
10+
11+
COPY src/ src/
12+
COPY tests/ tests/
13+
14+
CMD ["sh", "-c", "cd src && python generate_runbooks.py && python index_runbooks.py"]

Makefile

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
PYTHON ?= venv/bin/python
2+
3+
.PHONY: runbooks index test run eval clean all
4+
5+
runbooks:
6+
cd src && $(PYTHON) generate_runbooks.py
7+
8+
index:
9+
cd src && $(PYTHON) index_runbooks.py
10+
11+
test:
12+
$(PYTHON) -m pytest tests/ -v -m "not ollama"
13+
14+
run:
15+
cd src && $(PYTHON) -m streamlit run dashboard.py
16+
17+
eval:
18+
cd src && $(PYTHON) generate_eval_questions.py
19+
cd src && $(PYTHON) evaluate_retrieval.py
20+
21+
clean:
22+
rm -rf runbooks/ data/
23+
24+
all: runbooks index test

NOTES.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# IT Runbook Agent — Interview Notes
2+
3+
## 30-Second Pitch
4+
5+
I built a RAG system that helps IT operations staff resolve incidents
6+
faster. It takes 25 enterprise-style runbooks, embeds them by section
7+
using sentence-transformers, stores them in ChromaDB, and uses cosine
8+
similarity to retrieve the most relevant sections for any natural
9+
language question. An Ollama-hosted Llama 3.1 model then generates a
10+
grounded answer citing specific runbook IDs. The whole pipeline is
11+
testable and runs in CI without needing a GPU or LLM server.
12+
13+
## 60-Second Pitch
14+
15+
This is a retrieval-augmented generation system for IT incident
16+
resolution. The core challenge is that help desk teams have dozens of
17+
runbooks but finding the right section under time pressure is slow and
18+
error-prone.
19+
20+
The pipeline has three stages. First, I generate 25 realistic runbooks
21+
across 10 IT categories — printers, networking, VPN, Active Directory,
22+
and so on. Each runbook is split into semantic sections: symptoms,
23+
resolution steps, escalation criteria. Second, every section is
24+
embedded with all-MiniLM-L6-v2 and stored in ChromaDB with full
25+
metadata. Third, when a user asks a question, it's embedded with the
26+
same model, the top-5 most similar chunks are retrieved, and Llama 3.1
27+
generates an answer constrained to cite only from those chunks.
28+
29+
I built an evaluation pipeline with 53 test questions that measures
30+
Recall@K and MRR without needing Ollama, so retrieval quality is
31+
validated in CI. The Streamlit dashboard shows both the answer and
32+
full retrieval diagnostics.
33+
34+
## Component Walkthrough
35+
36+
** constants.py **
37+
Central definition of all paths, model names, and categories.
38+
Same pattern as a config module — change one file, everything
39+
updates.
40+
41+
** generate_runbooks.py **
42+
25 runbooks with realistic IT content: specific commands, error
43+
codes, escalation paths. Each runbook follows a consistent
44+
markdown structure for reliable parsing.
45+
46+
** index_runbooks.py **
47+
Three functions: load_runbooks reads and parses the markdown,
48+
chunk_by_section splits by ## headers, build_index embeds all
49+
chunks and stores them in ChromaDB. The index is idempotent —
50+
it deletes and recreates the collection every time.
51+
52+
** ollama_client.py **
53+
Thin wrapper around Ollama's /api/chat endpoint. System prompt
54+
enforces grounding: answer only from context, cite runbook IDs,
55+
say clearly when information is insufficient. Temperature 0.0
56+
for deterministic output.
57+
58+
** query_engine.py **
59+
Orchestrates the RAG pipeline. retrieve_chunks embeds the
60+
question and queries ChromaDB. generate_answer formats the
61+
context and calls the LLM. ask is the end-to-end entry point.
62+
63+
** evaluate_retrieval.py **
64+
Measures Recall@K (did the expected runbook appear in top-K?)
65+
and MRR (how high did it rank?). Runs without Ollama so it
66+
works in CI.
67+
68+
## Technical Decisions
69+
70+
** Why section-based chunking? **
71+
Each section (Symptoms, Resolution Steps) is semantically
72+
coherent. Fixed-size windows would split mid-step, mixing
73+
symptoms with resolution content and hurting retrieval precision.
74+
75+
** Why explicit embeddings instead of ChromaDB built-in? **
76+
Using sentence-transformers directly makes the embedding step
77+
visible, testable, and explainable. I can show the embedding
78+
dimension (384), verify it in tests, and swap models without
79+
changing the storage layer.
80+
81+
** Why cosine similarity? **
82+
Sentence-transformer models are trained with cosine similarity
83+
as the objective. Using a different distance metric would
84+
misalign with the model's training.
85+
86+
** Why Ollama instead of an API? **
87+
Fully local inference means no API keys, no cost, no data
88+
leaving the machine. For a portfolio project, this also means
89+
anyone can clone and run it without an API account.
90+
91+
** Why temperature 0.0? **
92+
IT runbook guidance should be deterministic and reproducible.
93+
Creative variation in troubleshooting steps would be harmful.
94+
95+
## RAG Explained (for non-technical interviewers)
96+
97+
Imagine you're a librarian. Someone asks a question, and instead of
98+
writing an answer from memory, you first search the library for the
99+
most relevant book passages, then write your answer using only those
100+
passages. That's RAG — Retrieval-Augmented Generation.
101+
102+
The "retrieval" part finds the right runbook sections. The
103+
"generation" part writes a human-readable answer from those sections.
104+
The model is explicitly told not to make things up — it can only use
105+
what was retrieved.
106+
107+
## Potential Follow-Up Questions
108+
109+
** How would you handle runbook updates? **
110+
Re-run the indexing pipeline. It's idempotent — deletes the old
111+
collection and rebuilds from whatever's in the runbooks folder.
112+
In production, you'd trigger this from a CI/CD pipeline when
113+
runbooks are updated in the repo.
114+
115+
** How would you improve retrieval accuracy? **
116+
Add a cross-encoder re-ranking step. The initial retrieval uses
117+
bi-encoder similarity (fast but approximate). A cross-encoder
118+
scores each candidate against the query jointly (slower but more
119+
accurate). You retrieve top-20 with the bi-encoder, then re-rank
120+
to top-5 with the cross-encoder.
121+
122+
** How would you handle multi-turn conversations? **
123+
Add conversation memory to the query engine. Append the last N
124+
exchanges to the context window so the model can reference
125+
previous answers. For retrieval, combine the current question
126+
with conversation context before embedding.
127+
128+
** What if the runbook corpus grows to thousands of documents? **
129+
ChromaDB handles moderate scale well. For tens of thousands of
130+
chunks, consider a dedicated vector database like Weaviate or
131+
Pinecone. Also add metadata filtering (by category) to narrow
132+
the search space before similarity search.
133+
134+
** How do you prevent hallucination? **
135+
Three layers: the system prompt explicitly forbids answering
136+
outside the provided context, temperature is set to 0.0, and
137+
the context chunks include specific runbook IDs so the model
138+
can cite sources. The evaluation pipeline measures whether
139+
retrieved chunks actually match expected runbooks.

0 commit comments

Comments
 (0)