I wanted to explore a systematic approach to transforming naive Retrieval-Augmented Generation (RAG) into production-grade systems.
The main focus was on evaluations: end-to-end evals based on gold answers and component-specific evals targeting retrievals
I used Tesla Model 3 Owner's Manual as a sample PDF (https://www.tesla.com/ownersmanual/model3/en_us/Owners_Manual.pdf), chunked it into a ChromaDB vector store and built a retrieval/RAG system around it.
I started by inspecting the manual and curating a handful of gold answers (see evals/end-to-end/end_to_end_eval_questions.json)
for end-to-end evaluation.
Then I built a quick and dirty (but functional) RAG pipeline with all components in place: ingestion, retrieval, answer generation.
Next, I set up an LLM judge to score the RAG-generated answers against the ground-truth answers.
The judge scores answers using a 3-level rubric (see src/prompts.py):
- Score 0 (REJECT): The answer is factually incorrect, hallucinated, contradicts the expected answer, or is missing critical information.
- Score 1 (REVIEW): The answer is mostly accurate and complete but has minor gaps in grounding or minor omissions from the expected answer.
- Score 2 (ACCEPT): The answer is accurate, complete, and properly grounded in the retrieved context.
Pass criterion: 80% of answers must score 2.
Running evals against the initial prototype scored only 20% of answers as ACCEPTED. Rather than blindly adjusting hyperparameters, I evaluated pipeline components individually to keep all moving parts under control and identify which components were bottlenecks.
The RAG system has three main components: chunker, retriever, and answer-generator. I started by evaluating the retriever.
I put together another set of questions (see evals/component-evals/retrieval_eval_questions.json) and implemented
an LLM judge to score the retrieved chunks. This judge uses the following rubric (see src/prompts.py):
- Score 2: The chunk directly answers the question with sufficient detail for a user.
- Score 1: The chunk is on-topic but incomplete, ambiguous, or missing steps/details the user needs.
- Score 0: The chunk is off-topic or unhelpful for answering the question.
I then created 6 different chunking and retrieval strategies (see strategies.json) that vary chunk size,
chunk overlap, and the number of retrieved documents (top-k). The judge scored each strategy and produced Precision@k metrics.
Precision@k scores for the different strategies varied between 0.46 and 0.56, indicating that roughly half of retrieved documents were off-topic—a clear opportunity for improvement.
The next step would be to improve the chunking strategy and measure improvements against these eval metrics.
# Ingest PDFs into ChromaDB for a specific strategy
python scripts/ingest.py --strategy baselineLoads a sample Tesla Owner's Manual, chunks it according to strategy settings, and stores embeddings in ChromaDB.
End-to-End Evaluation - Evaluates complete RAG answers against gold answers:
python evals/end-to-end/run_end_to_end_eval.py --strategy baselineOutputs: Pass/fail verdict, score (0-2), and judge reasoning for each question.
Retrieval Evaluation - Evaluates document retrieval quality (Precision@k):
python evals/component-evals/run_retrieval_eval.py --strategy baselineOutputs: Judge scores for each retrieved document and precision metrics.
# Interactive mode
python scripts/query.py --strategy baseline
# Single query with streaming
python scripts/query.py -q "How do I restart the touchscreen?"Open the Jupyter notebooks to inspect evaluation results:
notebooks/inspect_end_to_end_eval_results.ipynb- View end-to-end results and generate summary reportsnotebooks/inspect_retrieval_eval_results.ipynb- Analyze retrieval performance metrics