|
| 1 | +# IT Runbook Agent — Interview Notes |
| 2 | + |
| 3 | +## 30-Second Pitch |
| 4 | + |
| 5 | + I built a RAG system that helps IT operations staff resolve incidents |
| 6 | + faster. It takes 25 enterprise-style runbooks, embeds them by section |
| 7 | + using sentence-transformers, stores them in ChromaDB, and uses cosine |
| 8 | + similarity to retrieve the most relevant sections for any natural |
| 9 | + language question. An Ollama-hosted Llama 3.1 model then generates a |
| 10 | + grounded answer citing specific runbook IDs. The whole pipeline is |
| 11 | + testable and runs in CI without needing a GPU or LLM server. |
| 12 | + |
| 13 | +## 60-Second Pitch |
| 14 | + |
| 15 | + This is a retrieval-augmented generation system for IT incident |
| 16 | + resolution. The core challenge is that help desk teams have dozens of |
| 17 | + runbooks but finding the right section under time pressure is slow and |
| 18 | + error-prone. |
| 19 | + |
| 20 | + The pipeline has three stages. First, I generate 25 realistic runbooks |
| 21 | + across 10 IT categories — printers, networking, VPN, Active Directory, |
| 22 | + and so on. Each runbook is split into semantic sections: symptoms, |
| 23 | + resolution steps, escalation criteria. Second, every section is |
| 24 | + embedded with all-MiniLM-L6-v2 and stored in ChromaDB with full |
| 25 | + metadata. Third, when a user asks a question, it's embedded with the |
| 26 | + same model, the top-5 most similar chunks are retrieved, and Llama 3.1 |
| 27 | + generates an answer constrained to cite only from those chunks. |
| 28 | + |
| 29 | + I built an evaluation pipeline with 53 test questions that measures |
| 30 | + Recall@K and MRR without needing Ollama, so retrieval quality is |
| 31 | + validated in CI. The Streamlit dashboard shows both the answer and |
| 32 | + full retrieval diagnostics. |
| 33 | + |
| 34 | +## Component Walkthrough |
| 35 | + |
| 36 | + ** constants.py ** |
| 37 | + Central definition of all paths, model names, and categories. |
| 38 | + Same pattern as a config module — change one file, everything |
| 39 | + updates. |
| 40 | + |
| 41 | + ** generate_runbooks.py ** |
| 42 | + 25 runbooks with realistic IT content: specific commands, error |
| 43 | + codes, escalation paths. Each runbook follows a consistent |
| 44 | + markdown structure for reliable parsing. |
| 45 | + |
| 46 | + ** index_runbooks.py ** |
| 47 | + Three functions: load_runbooks reads and parses the markdown, |
| 48 | + chunk_by_section splits by ## headers, build_index embeds all |
| 49 | + chunks and stores them in ChromaDB. The index is idempotent — |
| 50 | + it deletes and recreates the collection every time. |
| 51 | + |
| 52 | + ** ollama_client.py ** |
| 53 | + Thin wrapper around Ollama's /api/chat endpoint. System prompt |
| 54 | + enforces grounding: answer only from context, cite runbook IDs, |
| 55 | + say clearly when information is insufficient. Temperature 0.0 |
| 56 | + for deterministic output. |
| 57 | + |
| 58 | + ** query_engine.py ** |
| 59 | + Orchestrates the RAG pipeline. retrieve_chunks embeds the |
| 60 | + question and queries ChromaDB. generate_answer formats the |
| 61 | + context and calls the LLM. ask is the end-to-end entry point. |
| 62 | + |
| 63 | + ** evaluate_retrieval.py ** |
| 64 | + Measures Recall@K (did the expected runbook appear in top-K?) |
| 65 | + and MRR (how high did it rank?). Runs without Ollama so it |
| 66 | + works in CI. |
| 67 | + |
| 68 | +## Technical Decisions |
| 69 | + |
| 70 | + ** Why section-based chunking? ** |
| 71 | + Each section (Symptoms, Resolution Steps) is semantically |
| 72 | + coherent. Fixed-size windows would split mid-step, mixing |
| 73 | + symptoms with resolution content and hurting retrieval precision. |
| 74 | + |
| 75 | + ** Why explicit embeddings instead of ChromaDB built-in? ** |
| 76 | + Using sentence-transformers directly makes the embedding step |
| 77 | + visible, testable, and explainable. I can show the embedding |
| 78 | + dimension (384), verify it in tests, and swap models without |
| 79 | + changing the storage layer. |
| 80 | + |
| 81 | + ** Why cosine similarity? ** |
| 82 | + Sentence-transformer models are trained with cosine similarity |
| 83 | + as the objective. Using a different distance metric would |
| 84 | + misalign with the model's training. |
| 85 | + |
| 86 | + ** Why Ollama instead of an API? ** |
| 87 | + Fully local inference means no API keys, no cost, no data |
| 88 | + leaving the machine. For a portfolio project, this also means |
| 89 | + anyone can clone and run it without an API account. |
| 90 | + |
| 91 | + ** Why temperature 0.0? ** |
| 92 | + IT runbook guidance should be deterministic and reproducible. |
| 93 | + Creative variation in troubleshooting steps would be harmful. |
| 94 | + |
| 95 | +## RAG Explained (for non-technical interviewers) |
| 96 | + |
| 97 | + Imagine you're a librarian. Someone asks a question, and instead of |
| 98 | + writing an answer from memory, you first search the library for the |
| 99 | + most relevant book passages, then write your answer using only those |
| 100 | + passages. That's RAG — Retrieval-Augmented Generation. |
| 101 | + |
| 102 | + The "retrieval" part finds the right runbook sections. The |
| 103 | + "generation" part writes a human-readable answer from those sections. |
| 104 | + The model is explicitly told not to make things up — it can only use |
| 105 | + what was retrieved. |
| 106 | + |
| 107 | +## Potential Follow-Up Questions |
| 108 | + |
| 109 | + ** How would you handle runbook updates? ** |
| 110 | + Re-run the indexing pipeline. It's idempotent — deletes the old |
| 111 | + collection and rebuilds from whatever's in the runbooks folder. |
| 112 | + In production, you'd trigger this from a CI/CD pipeline when |
| 113 | + runbooks are updated in the repo. |
| 114 | + |
| 115 | + ** How would you improve retrieval accuracy? ** |
| 116 | + Add a cross-encoder re-ranking step. The initial retrieval uses |
| 117 | + bi-encoder similarity (fast but approximate). A cross-encoder |
| 118 | + scores each candidate against the query jointly (slower but more |
| 119 | + accurate). You retrieve top-20 with the bi-encoder, then re-rank |
| 120 | + to top-5 with the cross-encoder. |
| 121 | + |
| 122 | + ** How would you handle multi-turn conversations? ** |
| 123 | + Add conversation memory to the query engine. Append the last N |
| 124 | + exchanges to the context window so the model can reference |
| 125 | + previous answers. For retrieval, combine the current question |
| 126 | + with conversation context before embedding. |
| 127 | + |
| 128 | + ** What if the runbook corpus grows to thousands of documents? ** |
| 129 | + ChromaDB handles moderate scale well. For tens of thousands of |
| 130 | + chunks, consider a dedicated vector database like Weaviate or |
| 131 | + Pinecone. Also add metadata filtering (by category) to narrow |
| 132 | + the search space before similarity search. |
| 133 | + |
| 134 | + ** How do you prevent hallucination? ** |
| 135 | + Three layers: the system prompt explicitly forbids answering |
| 136 | + outside the provided context, temperature is set to 0.0, and |
| 137 | + the context chunks include specific runbook IDs so the model |
| 138 | + can cite sources. The evaluation pipeline measures whether |
| 139 | + retrieved chunks actually match expected runbooks. |
0 commit comments