This repository showcases a Retrieval-Augmented Generation (RAG) system for interacting with documentation that uses natural language queries to retrieve and summarize relevant information.
interactive-demo.webm
- Creates a Qdrant vector database for embeddings from the given CSV file(s)
- The vector database is used for fast similarity search to find relevant documentation
- We use a CSV based on Hugging Face documentation as an example
- Uses OpenAI's embeddings for similarity search and GPT models for high-quality responses
- Provides an interactive interface for querying the documentation using natural language
- Each query retrieves the most relevant documentation snippets for context
- Answers include source links for reference
- Valohai account to run the pipelines
- OpenAI account to use their APIs
- Less than $5 in OpenAI credits
If you can't find this project in your Valohai Templates, you can set it up manually:
-
Create a new project on Valohai
-
Set the project repository to:
https://github.com/valohai/rag-doc-example -
Save the settings and click "Fetch Repository"
-
π Create an OpenAI API key for this project
- We will need the API key next so record it down
-
Assign the API key to this project:

You will see β Changes to OPENAI_API_KEY saved if everything went correctly.
And now you are ready to run the pipelines!
- Navigate to the "Pipelines" tab
- Click the "Create Pipeline" button
- Select the "assistant-pipeline" pipeline template
- Click the "Create pipeline from template" button
- Feel free to look around and finally click the "Create pipeline" button
This will start the pipeline:

Feel free to explore around while it runs.
When it finishes, the last step will contain qualitative results to review:

This manual evaluation is a simplification how to validate the quality of the generated
responses. "LLM evals" is a large topic outside the scope of this particular example.
Now you have a mini-pipeline that maintains a RAG vector database and allows you to ask questions about the documentation. You can ask your own questions by creating new executions based on the "do-query" step.
The repository also contains a pipeline "assistant-pipeline-with-deployment" which deploys the RAG system to an HTTP endpoint after a manual human validation of the "manual-evaluation" pipeline step.
π€© Show Me!
-
Create a Valohai Deployment to tell where the HTTP endpoint should be hosted:

You can use Valohai Public Cloud and valohai.cloud as the target when trialing out. Make sure to name the deploymentpublic -
Create a pipeline as we did before, but use the "assistant-pipeline-with-deployment" template.
-
The pipeline will halt to a "β³οΈ Pending Approval" state, where you can click the "Approve" button to proceed.
-
After approval, the pipeline will build and deploy the endpoint.
-
You can use the "Test Deployment" button to run a test queries against the endpoint.
This example uses OpenAI for both the embedding and query models.
Either could be changed to a different provider or a local model.
π€© Show Me!
Changing models inside the OpenAI ecosystem is a matter of changing constants in
src/rag_doctor/consts.py:
EMBEDDING_MODEL = "text-embedding-ada-002"
EMBEDDING_LENGTH = 1_536 # the dimensions of a "text-embedding-ada-002" embedding vector
PROMPT_MODEL = "gpt-4o-mini"
PROMPT_MAX_TOKENS = 128_000 # model "context window" from https://platform.openai.com/docs/modelsFurther modifying the chat model involves reimplementing the query logic in
src/rag_doctor/query.py.
Similarly, modifying the embedding model is a matter of reimplementing the embedding logic in both
src/rag_doctor/database.py and src/rag_doctor/query.py.
If you decide to change the embedding model, remember to recreate the vector database.
This repository includes a comprehensive evaluation system that measures RAG performance across three key dimensions: retrieval quality, generation accuracy, and operational efficiency.
π€© Show Me!
Retrieval Metrics:
- Context Coverage: Uses LLM-as-a-judge to assess whether retrieved documents contain the information needed to answer the question correctly
- Response Rate: Percentage of questions that receive valid responses
Generation Metrics:
- Factuality Score: LLM-based evaluation of answer accuracy (1-5 scale)
- Response Quality: Average length and substantive response rate
Operational Metrics:
- Latency: Estimated response time per query
- Cost: Token-based cost estimation for embeddings and LLM calls
- For OpenAI models:
OPENAI_API_KEY(already set up from the initial setup) - For Anthropic models: Add
ANTHROPIC_API_KEYif usingprovider: anthropic - For other providers: Add the corresponding API key as needed
- Navigate to the "Pipelines" tab and create a new pipeline
- Select the "rag-evaluation-pipeline" template
- Select which model provider you would like to evaluate (default: OpenAI) and the questions to test the knowledge base on.

- The pipeline will:
The evaluation step produces detailed metrics logged to Valohai's metadata system:
{
"response_rate": 1.0,
"context_coverage": 0.85,
"factuality_score": 4.2,
"avg_response_length": 841.25,
"substantive_rate": 0.9,
"estimated_latency_seconds": 2.041,
"estimated_cost_usd": 0.0021
}These metrics help you:
- Monitor system performance over time
- Compare different models or configurations
- Validate changes before deploying to production
- Understand cost implications of your RAG system
The pipeline includes gold standard questions with ground truth answers for evaluation. You can customize these by:
- Creating your own evaluation dataset with columns:
question,ground_truth_answer - Updating the
gold_standardsinput in theevaluate-ragstep invalohai.yaml - Modifying the questions in the pipeline configuration
This evaluation framework follows MLOps best practices, providing the metrics needed to maintain and improve your RAG system in production.
You can compare different LLM providers (OpenAI vs Anthropic) side-by-side to understand their performance characteristics and make informed decisions about which model works best for your use case, by leveraging the Task feature in Valohai.
- Navigate to the "Pipelines" tab and create a new pipeline
- Select the "rag-evaluation-pipeline" template
- Select the
generate-responsesnode and convert it to a Task.
This will automatically create executions for the model providers available in the provider parameter.
You can take a look at the input file given to the "embedding" node and create a similar CSV from your own documentation and replace the input with that CSV.
You can also run the individual pieces locally by following instructions in the DEVELOPMENT file.









