Skip to content

juhapellotsalo/tesla-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tesla RAG System

Overview

I wanted to explore a systematic approach to transforming naive Retrieval-Augmented Generation (RAG) into production-grade systems.

The main focus was on evaluations: end-to-end evals based on gold answers and component-specific evals targeting retrievals

I used Tesla Model 3 Owner's Manual as a sample PDF (https://www.tesla.com/ownersmanual/model3/en_us/Owners_Manual.pdf), chunked it into a ChromaDB vector store and built a retrieval/RAG system around it.

Two-Level Evaluation Framework

End-to-end

I started by inspecting the manual and curating a handful of gold answers (see evals/end-to-end/end_to_end_eval_questions.json) for end-to-end evaluation. Then I built a quick and dirty (but functional) RAG pipeline with all components in place: ingestion, retrieval, answer generation. Next, I set up an LLM judge to score the RAG-generated answers against the ground-truth answers.

The judge scores answers using a 3-level rubric (see src/prompts.py):

  • Score 0 (REJECT): The answer is factually incorrect, hallucinated, contradicts the expected answer, or is missing critical information.
  • Score 1 (REVIEW): The answer is mostly accurate and complete but has minor gaps in grounding or minor omissions from the expected answer.
  • Score 2 (ACCEPT): The answer is accurate, complete, and properly grounded in the retrieved context.

Pass criterion: 80% of answers must score 2.

Running evals against the initial prototype scored only 20% of answers as ACCEPTED. Rather than blindly adjusting hyperparameters, I evaluated pipeline components individually to keep all moving parts under control and identify which components were bottlenecks.

The RAG system has three main components: chunker, retriever, and answer-generator. I started by evaluating the retriever.

Retrieval Quality

I put together another set of questions (see evals/component-evals/retrieval_eval_questions.json) and implemented an LLM judge to score the retrieved chunks. This judge uses the following rubric (see src/prompts.py):

  • Score 2: The chunk directly answers the question with sufficient detail for a user.
  • Score 1: The chunk is on-topic but incomplete, ambiguous, or missing steps/details the user needs.
  • Score 0: The chunk is off-topic or unhelpful for answering the question.

I then created 6 different chunking and retrieval strategies (see strategies.json) that vary chunk size, chunk overlap, and the number of retrieved documents (top-k). The judge scored each strategy and produced Precision@k metrics.

Precision@k scores for the different strategies varied between 0.46 and 0.56, indicating that roughly half of retrieved documents were off-topic—a clear opportunity for improvement.

The next step would be to improve the chunking strategy and measure improvements against these eval metrics.

Scripts and Evaluators

Ingest Documents

# Ingest PDFs into ChromaDB for a specific strategy
python scripts/ingest.py --strategy baseline

Loads a sample Tesla Owner's Manual, chunks it according to strategy settings, and stores embeddings in ChromaDB.

Run Evaluations

End-to-End Evaluation - Evaluates complete RAG answers against gold answers:

python evals/end-to-end/run_end_to_end_eval.py --strategy baseline

Outputs: Pass/fail verdict, score (0-2), and judge reasoning for each question.

Retrieval Evaluation - Evaluates document retrieval quality (Precision@k):

python evals/component-evals/run_retrieval_eval.py --strategy baseline

Outputs: Judge scores for each retrieved document and precision metrics.

Interactive Queries

# Interactive mode
python scripts/query.py --strategy baseline

# Single query with streaming
python scripts/query.py -q "How do I restart the touchscreen?"

Analyze Results

Open the Jupyter notebooks to inspect evaluation results:

  • notebooks/inspect_end_to_end_eval_results.ipynb - View end-to-end results and generate summary reports
  • notebooks/inspect_retrieval_eval_results.ipynb - Analyze retrieval performance metrics

About

How to use evals to improve RAG quality

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors