Tesla RAG System

Overview

I wanted to explore a systematic approach to transforming naive Retrieval-Augmented Generation (RAG) into production-grade systems.

The main focus was on evaluations: end-to-end evals based on gold answers and component-specific evals targeting retrievals

I used Tesla Model 3 Owner's Manual as a sample PDF (https://www.tesla.com/ownersmanual/model3/en_us/Owners_Manual.pdf), chunked it into a ChromaDB vector store and built a retrieval/RAG system around it.

Two-Level Evaluation Framework

End-to-end

I started by inspecting the manual and curating a handful of gold answers (see evals/end-to-end/end_to_end_eval_questions.json) for end-to-end evaluation. Then I built a quick and dirty (but functional) RAG pipeline with all components in place: ingestion, retrieval, answer generation. Next, I set up an LLM judge to score the RAG-generated answers against the ground-truth answers.

The judge scores answers using a 3-level rubric (see src/prompts.py):

Score 0 (REJECT): The answer is factually incorrect, hallucinated, contradicts the expected answer, or is missing critical information.
Score 1 (REVIEW): The answer is mostly accurate and complete but has minor gaps in grounding or minor omissions from the expected answer.
Score 2 (ACCEPT): The answer is accurate, complete, and properly grounded in the retrieved context.

Pass criterion: 80% of answers must score 2.

Running evals against the initial prototype scored only 20% of answers as ACCEPTED. Rather than blindly adjusting hyperparameters, I evaluated pipeline components individually to keep all moving parts under control and identify which components were bottlenecks.

The RAG system has three main components: chunker, retriever, and answer-generator. I started by evaluating the retriever.

Retrieval Quality

I put together another set of questions (see evals/component-evals/retrieval_eval_questions.json) and implemented an LLM judge to score the retrieved chunks. This judge uses the following rubric (see src/prompts.py):

Score 2: The chunk directly answers the question with sufficient detail for a user.
Score 1: The chunk is on-topic but incomplete, ambiguous, or missing steps/details the user needs.
Score 0: The chunk is off-topic or unhelpful for answering the question.

I then created 6 different chunking and retrieval strategies (see strategies.json) that vary chunk size, chunk overlap, and the number of retrieved documents (top-k). The judge scored each strategy and produced Precision@k metrics.

Precision@k scores for the different strategies varied between 0.46 and 0.56, indicating that roughly half of retrieved documents were off-topic—a clear opportunity for improvement.

The next step would be to improve the chunking strategy and measure improvements against these eval metrics.

Scripts and Evaluators

Ingest Documents

# Ingest PDFs into ChromaDB for a specific strategy
python scripts/ingest.py --strategy baseline

Loads a sample Tesla Owner's Manual, chunks it according to strategy settings, and stores embeddings in ChromaDB.

Run Evaluations

End-to-End Evaluation - Evaluates complete RAG answers against gold answers:

python evals/end-to-end/run_end_to_end_eval.py --strategy baseline

Outputs: Pass/fail verdict, score (0-2), and judge reasoning for each question.

Retrieval Evaluation - Evaluates document retrieval quality (Precision@k):

python evals/component-evals/run_retrieval_eval.py --strategy baseline

Outputs: Judge scores for each retrieved document and precision metrics.

Interactive Queries

# Interactive mode
python scripts/query.py --strategy baseline

# Single query with streaming
python scripts/query.py -q "How do I restart the touchscreen?"

Analyze Results

Open the Jupyter notebooks to inspect evaluation results:

notebooks/inspect_end_to_end_eval_results.ipynb - View end-to-end results and generate summary reports
notebooks/inspect_retrieval_eval_results.ipynb - Analyze retrieval performance metrics

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evals		evals
notebooks		notebooks
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
strategies.json		strategies.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tesla RAG System

Overview

Two-Level Evaluation Framework

End-to-end

Retrieval Quality

Scripts and Evaluators

Ingest Documents

Run Evaluations

Interactive Queries

Analyze Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tesla RAG System

Overview

Two-Level Evaluation Framework

End-to-end

Retrieval Quality

Scripts and Evaluators

Ingest Documents

Run Evaluations

Interactive Queries

Analyze Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages