diff --git a/Makefile b/Makefile index 3e4b4fb19..f00c9a6be 100644 --- a/Makefile +++ b/Makefile @@ -161,7 +161,7 @@ build-docs: ## Build all documentation @echo "Converting ipynb notebooks to md files..." $(Q)MKDOCS_CI=true uv run python $(GIT_ROOT)/docs/ipynb_to_md.py @echo "Building ragas documentation..." - $(Q)uv run --group docs mkdocs build + $(Q)MKDOCS_CI=false uv run --group docs mkdocs build serve-docs: ## Build and serve documentation locally - $(Q)uv run --group docs mkdocs serve --dirtyreload + $(Q)MKDOCS_CI=false uv run --group docs mkdocs serve --dirtyreload diff --git a/README.md b/README.md index e13b5bdc4..592fbf473 100644 --- a/README.md +++ b/README.md @@ -97,21 +97,39 @@ Available templates: ### Evaluate your LLM App -This is 5 main lines: +This is a simple example evaluating a summary for accuracy: ```python -from ragas import SingleTurnSample -from ragas.metrics import AspectCritic +import asyncio +from ragas.metrics.collections import AspectCritic +from ragas.llms import llm_factory +# Setup your LLM +llm = llm_factory("gpt-4o") + +# Create a metric +metric = AspectCritic( + name="summary_accuracy", + definition="Verify if the summary is accurate and captures key information.", + llm=llm +) + +# Evaluate test_data = { "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.", "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.", } -evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o")) -metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.") -await metric.single_turn_ascore(SingleTurnSample(**test_data)) + +score = await metric.ascore( + user_input=test_data["user_input"], + response=test_data["response"] +) +print(f"Score: {score.value}") +print(f"Reason: {score.reason}") ``` +> **Note**: Make sure your `OPENAI_API_KEY` environment variable is set. + Find the complete [Quickstart Guide](https://docs.ragas.io/en/latest/getstarted/evals) ## Want help in improving your AI application using evals? diff --git a/docs/_static/imgs/results/rag_eval_result.png b/docs/_static/imgs/results/rag_eval_result.png new file mode 100644 index 000000000..5c86fab60 Binary files /dev/null and b/docs/_static/imgs/results/rag_eval_result.png differ diff --git a/docs/getstarted/evals.md b/docs/getstarted/evals.md index 08de6b4cc..5179d67ee 100644 --- a/docs/getstarted/evals.md +++ b/docs/getstarted/evals.md @@ -2,183 +2,229 @@ The purpose of this guide is to illustrate a simple workflow for testing and evaluating an LLM application with `ragas`. It assumes minimum knowledge in AI application building and evaluation. Please refer to our [installation instruction](./install.md) for installing `ragas` +!!! tip "Get a Working Example" + The fastest way to see these concepts in action is to create a project using the quickstart command: -## Evaluation + === "uvx (Recommended)" + ```sh + uvx ragas quickstart rag_eval + cd rag_eval + uv sync + ``` -In this guide, you will evaluate a **text summarization pipeline**. The goal is to ensure that the output summary accurately captures all the key details specified in the text, such as growth figures, market insights, and other essential information. + === "Install Ragas First" + ```sh + pip install ragas + ragas quickstart rag_eval + cd rag_eval + uv sync + ``` -`ragas` offers a variety of methods for analyzing the performance of LLM applications, referred to as [metrics](../concepts/metrics/available_metrics/index.md). Each metric requires a predefined set of data points, which it uses to calculate scores that indicate performance. + This generates a complete project with sample code. Follow along with this guide to understand what's happening in your generated code. Let's get started! -### Evaluating using a Non-LLM Metric +## Project Structure -Here is a simple example that uses `BleuScore` to score a summary: +Here's what gets created for you: -```python -from ragas import SingleTurnSample -from ragas.metrics import BleuScore - -test_data = { - "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.", - "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.", - "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter." -} -metric = BleuScore() -test_data = SingleTurnSample(**test_data) -metric.single_turn_score(test_data) +```sh +rag_eval/ +├── README.md # Project documentation and setup instructions +├── pyproject.toml # Project configuration for uv and pip +├── evals.py # Your evaluation workflow +├── rag.py # Your RAG/LLM application +├── __init__.py # Makes this a Python package +└── evals/ # Evaluation artifacts + ├── datasets/ # Test data files (optional) + ├── experiments/ # Results from running evaluations (CSV files saved here) + └── logs/ # Evaluation execution logs ``` -Output -``` -0.137 -``` +**Key files to focus on:** -Here we used: +- **`evals.py`** - Your evaluation workflow with dataset loading and evaluation logic +- **`rag.py`** - Your RAG/LLM application code (query engine, retrieval, etc.) -- A test sample containing `user_input`, `response` (the output from the LLM), and `reference` (the expected output from the LLM) as data points to evaluate the summary. -- A non-LLM metric called [BleuScore](../concepts/metrics/available_metrics/traditional.md#bleu-score) +## Understanding the Code +In your generated project's `evals.py` file, you'll see the main workflow pattern: -As you may observe, this approach has two key limitations: +1. **Load Dataset** - Define your test cases with `SingleTurnSample` +2. **Query RAG System** - Get responses from your application +3. **Evaluate Responses** - Validate responses against ground truth +4. **Display Results** - Show evaluation summary in console +5. **Save Results** - Automatically saved to CSV in `evals/experiments/` directory -- **Time-consuming preparation:** Evaluating the application requires preparing the expected output (`reference`) for each input, which can be both time-consuming and challenging. +The template provides modular functions you can customize: -- **Inaccurate scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`. +```python +from ragas.dataset_schema import SingleTurnSample +from ragas import EvaluationDataset +def load_dataset(): + """Load test dataset for evaluation.""" + data_samples = [ + SingleTurnSample( + user_input="What is Ragas?", + response="", # Will be filled by querying RAG + reference="Ragas is an evaluation framework for LLM applications", + retrieved_contexts=[], + ), + # Add more test cases... + ] + return EvaluationDataset(samples=data_samples) +``` -!!! info - A **non-LLM metric** refers to a metric that does not rely on an LLM for evaluation. +You can extend this with [metrics](../concepts/metrics/available_metrics/index.md) and more sophisticated evaluation logic. Learn more about [evaluation in Ragas](../concepts/evaluation/index.md). -To address these issues, let's try an LLM-based metric. +### Choosing Your LLM Provider +Your quickstart project initializes the OpenAI LLM by default in the `_init_clients()` function. You can easily swap to any provider through the `llm_factory`: -### Evaluating using a LLM-based Metric +=== "OpenAI" + Set your OpenAI API key: + ```sh + export OPENAI_API_KEY="your-openai-key" + ``` -**Choose your LLM** ---8<-- -choose_evaluator_llm.md ---8<-- + In your `evals.py` `_init_clients()` function: -**Evaluation** + ```python + from ragas.llms import llm_factory -Here we will use [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md), which is an LLM based metric that outputs pass/fail given the evaluation criteria. + llm = llm_factory("gpt-4o") + ``` -```python -from ragas import SingleTurnSample -from ragas.metrics import AspectCritic + This is already set up in your quickstart project! -test_data = { - "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.", - "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.", -} +=== "Anthropic Claude" + Set your Anthropic API key: -metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.") -test_data = SingleTurnSample(**test_data) -await metric.single_turn_ascore(test_data) + ```sh + export ANTHROPIC_API_KEY="your-anthropic-key" + ``` -``` + In your `evals.py` `_init_clients()` function: -Output -``` -1 -``` + ```python + from ragas.llms import llm_factory -Success! Here 1 means pass and 0 means fail + llm = llm_factory("claude-3-5-sonnet-20241022", provider="anthropic") + ``` -!!! info - There are many other types of metrics that are available in `ragas` (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md). +=== "Google Gemini" + Set up your Google credentials: -### Evaluating on a Dataset + ```sh + export GOOGLE_API_KEY="your-google-api-key" + ``` -In the examples above, we used only a single sample to evaluate our application. However, evaluating on just one sample is not robust enough to trust the results. To ensure the evaluation is reliable, you should add more test samples to your test data. + In your `evals.py` `_init_clients()` function: -Here, we’ll load a dataset from Hugging Face Hub, but you can load data from any source, such as production logs or other datasets. Just ensure that each sample includes all the required attributes for the chosen metric. + ```python + from ragas.llms import llm_factory -In our case, the required attributes are: -- **`user_input`**: The input provided to the application (here the input text report). -- **`response`**: The output generated by the application (here the generated summary). + llm = llm_factory("gemini-1.5-pro", provider="google") + ``` -For example +=== "Local Models (Ollama)" + Install and run Ollama locally, then in your `evals.py` `_init_clients()` function: -```python -[ - # Sample 1 - { - "user_input": "summarise given text\nThe Q2 earnings report revealed a significant 15% increase in revenue, ...", - "response": "The Q2 earnings report showed a 15% revenue increase, ...", - }, - # Additional samples in the dataset - ...., - # Sample N - { - "user_input": "summarise given text\nIn 2023, North American sales experienced a 5% decline, ...", - "response": "Companies are strategizing to adapt to market challenges and ...", - } -] -``` + ```python + from ragas.llms import llm_factory -```python -from datasets import load_dataset -from ragas import EvaluationDataset -eval_dataset = load_dataset("explodinggradients/earning_report_summary",split="train") -eval_dataset = EvaluationDataset.from_hf_dataset(eval_dataset) -print("Features in dataset:", eval_dataset.features()) -print("Total samples in dataset:", len(eval_dataset)) -``` + llm = llm_factory( + "mistral", + provider="ollama", + base_url="http://localhost:11434" # Default Ollama URL + ) + ``` -Output -``` -Features in dataset: ['user_input', 'response'] -Total samples in dataset: 50 -``` +=== "Custom / Other Providers" + For any LLM with OpenAI-compatible API: -Evaluate using dataset + ```python + from ragas.llms import llm_factory -```python -from ragas import evaluate + llm = llm_factory( + "model-name", + api_key="your-api-key", + base_url="https://your-api-endpoint" + ) + ``` -results = evaluate(eval_dataset, metrics=[metric]) -results -``` + For more details, learn about [LLM integrations](../concepts/metrics/index.md). -!!! tip "Async Usage" - For production async applications, use `aevaluate()` to avoid event loop conflicts: - ```python - from ragas import aevaluate +### Using Pre-Built Metrics - # In an async function - results = await aevaluate(eval_dataset, metrics=[metric]) - ``` +`ragas` comes with pre-built metrics for common evaluation tasks. For example, [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md) evaluates any aspect of your output: - Or disable nest_asyncio in sync code: - ```python - results = evaluate(eval_dataset, metrics=[metric], allow_nest_asyncio=False) - ``` +```python +from ragas.metrics.collections import AspectCritic +from ragas.llms import llm_factory -Output -``` -{'summary_accuracy': 0.84} +# Setup your evaluator LLM +evaluator_llm = llm_factory("gpt-4o") + +# Use a pre-built metric +metric = AspectCritic( + name="summary_accuracy", + definition="Verify if the summary is accurate and captures key information.", + llm=evaluator_llm +) + +# Score your application's output +score = await metric.ascore( + user_input="Summarize this text: ...", + response="The summary of the text is..." +) +print(f"Score: {score.value}") # 1 = pass, 0 = fail +print(f"Reason: {score.reason}") ``` -This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, **It's -important to see why is this the case**. +Pre-built metrics like this save you from defining evaluation logic from scratch. Explore [all available metrics](../concepts/metrics/available_metrics/index.md). -Export the sample level scores to pandas dataframe +!!! info + There are many other types of metrics that are available in `ragas` (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md). + +### Evaluating on a Dataset + +In your quickstart project, you'll see in the `load_dataset()` function, which creates test data with multiple samples: ```python -results.to_pandas() -``` +from ragas import Dataset -Output -``` - user_input response summary_accuracy -0 summarise given text\nThe Q2 earnings report r... The Q2 earnings report showed a 15% revenue in... 1 -1 summarise given text\nIn 2023, North American ... Companies are strategizing to adapt to market ... 1 -2 summarise given text\nIn 2022, European expans... Many companies experienced a notable 15% growt... 1 -3 summarise given text\nSupply chain challenges ... Supply chain challenges in North America, caus... 1 +# Create a dataset with multiple test samples +dataset = Dataset( + name="test_dataset", + backend="local/csv", # Can also use JSONL, Google Drive, or in-memory + root_dir=".", +) + +# Add samples to the dataset +data_samples = [ + { + "user_input": "What is ragas?", + "response": "Ragas is an evaluation framework...", + "expected": "Ragas provides objective metrics..." + }, + { + "user_input": "How do metrics work?", + "response": "Metrics score your application...", + "expected": "Metrics evaluate performance..." + }, +] + +for sample in data_samples: + dataset.append(sample) + +# Save to disk +dataset.save() ``` -Viewing the sample-level results in a CSV file, as shown above, is fine for quick checks but not ideal for detailed analysis or comparing results across evaluation runs. +This gives you multiple test cases instead of evaluating one example at a time. Learn more about [datasets and experiments](../concepts/components/eval_dataset.md). + +Your generated project includes sample data in the `evals/datasets/` folder - you can edit those files to add more test cases. ### Want help in improving your AI application using evals? diff --git a/docs/getstarted/index.md b/docs/getstarted/index.md index 26ae9923a..0931d1ea7 100644 --- a/docs/getstarted/index.md +++ b/docs/getstarted/index.md @@ -11,7 +11,8 @@ If you have any questions about Ragas, feel free to join and ask in the `#questi Let's get started! +- [Quick Start: Get Running in 5 Minutes](./quickstart.md) - [Evaluate your first AI app](./evals.md) - [Run ragas metrics for evaluating RAG](rag_eval.md) - [Generate test data for evaluating RAG](rag_testset_generation.md) -- [Run your first experiment](experiments_quickstart.md) \ No newline at end of file +- [Run your first experiment](experiments_quickstart.md) diff --git a/docs/getstarted/quickstart.md b/docs/getstarted/quickstart.md new file mode 100644 index 000000000..a81bf12e0 --- /dev/null +++ b/docs/getstarted/quickstart.md @@ -0,0 +1,176 @@ +# Quick Start: Get Evaluations Running in a Flash + +Get started with Ragas in minutes. Create a complete evaluation project with just a few commands. + +## Step 1: Create Your Project + +Choose one of the following methods: + +=== "uvx (Recommended)" + No installation required. `uvx` automatically downloads and runs ragas: + + ```sh + uvx ragas quickstart rag_eval + cd rag_eval + ``` + +=== "Install Ragas First" + Install ragas first, then create the project: + + ```sh + pip install ragas + ragas quickstart rag_eval + cd rag_eval + ``` + +## Step 2: Install Dependencies + +Install the project dependencies: + +```sh +uv sync +``` + +Or if you prefer `pip`: + +```sh +pip install -e . +``` + +## Step 3: Set Your API Key + +Choose your LLM provider and set the environment variable: + +```sh +# OpenAI (default) +export OPENAI_API_KEY="your-openai-key" + +# Or use Anthropic Claude +export ANTHROPIC_API_KEY="your-anthropic-key" + +# Or use Google Gemini +export GOOGLE_API_KEY="your-google-key" +``` + +## Project Structure + +Your generated project includes: + +```sh +rag_eval/ +├── README.md # Project documentation +├── pyproject.toml # Project configuration +├── rag.py # Your RAG application +├── evals.py # Evaluation workflow +├── __init__.py # Makes this a Python package +└── evals/ + ├── datasets/ # Test data files + ├── experiments/ # Evaluation results + └── logs/ # Execution logs +``` + +## Step 4: Run Your Evaluation + +Run the evaluation script: + +```sh +uv run python evals.py +``` + +Or if you installed with `pip`: + +```sh +python evals.py +``` + +The evaluation will: +- Load test data from the `load_dataset()` function in `evals.py` +- Query your RAG application with test questions +- Evaluate responses +- Display results in the console +- Save results to CSV in the `evals/experiments/` directory + +![](../_static/imgs/results/rag_eval_result.png) + +--- + +## Customize Your Evaluation + +### Add More Test Cases + +Edit the `load_dataset()` function in `evals.py` to add more test questions: + +```python +from ragas.dataset_schema import SingleTurnSample + +def load_dataset(): + """Load test dataset for evaluation.""" + data_samples = [ + SingleTurnSample( + user_input="What is Ragas?", + response="", # Will be filled by querying RAG + reference="Ragas is an evaluation framework for LLM applications", + retrieved_contexts=[], + ), + SingleTurnSample( + user_input="How do metrics work?", + response="", + reference="Metrics evaluate the quality and performance of LLM responses", + retrieved_contexts=[], + ), + # Add more test cases here + ] + + dataset = EvaluationDataset(samples=data_samples) + return dataset +``` + +### Change the LLM Provider + +In the `_init_clients()` function in `evals.py`, update the LLM factory call: + +```python +from ragas.llms import llm_factory + +def _init_clients(): + """Initialize OpenAI client and RAG system.""" + openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) + rag_client = default_rag_client(llm_client=openai_client) + + # Use Anthropic Claude instead + llm = llm_factory("claude-3-5-sonnet-20241022", provider="anthropic") + + # Or use Google Gemini + # llm = llm_factory("gemini-1.5-pro", provider="google") + + # Or use local Ollama + # llm = llm_factory("mistral", provider="ollama", base_url="http://localhost:11434") + + return openai_client, rag_client, llm +``` + +### Customize Dataset and RAG System + +The template includes: +- `load_dataset()` - Define your test cases with `SingleTurnSample` +- `query_rag_system()` - Connect to your RAG system +- `evaluate_dataset()` - Implement your evaluation logic +- `display_results()` - Show results in the console +- `save_results_to_csv()` - Export results to CSV + +Edit these functions to customize your evaluation workflow. + +## What's Next? + +- **Learn the concepts**: Read the [Evaluate a Simple LLM Application](evals.md) guide for deeper understanding +- **Custom metrics**: [Write your own metrics](../howtos/customizations/metrics/_write_your_own_metric.md) tailored to your use case +- **Production integration**: [Integrate evaluations into your CI/CD pipeline](../howtos/index.md) +- **RAG evaluation**: Evaluate [RAG systems](rag_eval.md) with specialized metrics +- **Agent evaluation**: Explore [AI agent evaluation](../howtos/applications/text2sql.md) +- **Test data generation**: [Generate synthetic test datasets](rag_testset_generation.md) for your evaluations + +## Getting Help + +- 📚 [Full Documentation](https://docs.ragas.io/) +- 💬 [Join our Discord Community](https://discord.gg/5djav8GGNZ) +- 🐛 [Report Issues](https://github.com/explodinggradients/ragas/issues) diff --git a/docs/howtos/applications/align-llm-as-judge.md b/docs/howtos/applications/align-llm-as-judge.md index 6ef8edd46..dc3865a70 100644 --- a/docs/howtos/applications/align-llm-as-judge.md +++ b/docs/howtos/applications/align-llm-as-judge.md @@ -185,7 +185,7 @@ async def judge_experiment( ```python import os from openai import AsyncOpenAI -from ragas.llms import instructor_llm_factory +from ragas.llms import llm_factory from ragas_examples.judge_alignment import load_dataset # Load dataset @@ -194,7 +194,7 @@ print(f"Dataset loaded with {len(dataset)} samples") # Initialize LLM client openai_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY")) -llm = instructor_llm_factory("openai", model="gpt-4o-mini", client=openai_client) +llm = llm_factory("gpt-4o-mini", client=openai_client) # Run the experiment results = await judge_experiment.arun( diff --git a/docs/howtos/integrations/_haystack.md b/docs/howtos/integrations/_haystack.md index ba99746bb..be4435f45 100644 --- a/docs/howtos/integrations/_haystack.md +++ b/docs/howtos/integrations/_haystack.md @@ -50,7 +50,7 @@ docs = [Document(content=doc) for doc in dataset] ```python -from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder +from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small") text_embedder = OpenAITextEmbedder(model="text-embedding-3-small") @@ -133,8 +133,8 @@ Make sure to include all relevant data for each metric to ensure accurate evalua ```python from haystack_integrations.components.evaluators.ragas import RagasEvaluator - from langchain_openai import ChatOpenAI + from ragas.llms import LangchainLLMWrapper from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness @@ -252,7 +252,7 @@ In the example below, we will define two custom Ragas metrics: ```python -from ragas.metrics import RubricsScore, AspectCritic +from ragas.metrics import AspectCritic, RubricsScore SportsRelevanceMetric = AspectCritic( name="sports_relevance_metric", diff --git a/docs/howtos/integrations/_helicone.md b/docs/howtos/integrations/_helicone.md index 318f2d80b..e4e781577 100644 --- a/docs/howtos/integrations/_helicone.md +++ b/docs/howtos/integrations/_helicone.md @@ -25,16 +25,18 @@ First, let's install the required packages and set up our environment. ```python import os + from datasets import Dataset + from ragas import evaluate -from ragas.metrics import faithfulness, answer_relevancy, context_precision from ragas.integrations.helicone import helicone_config # import helicone_config - +from ragas.metrics import answer_relevancy, context_precision, faithfulness # Set up Helicone -helicone_config.api_key = ( +HELICONE_API_KEY = ( "your_helicone_api_key_here" # Replace with your actual Helicone API key ) +helicone_config.api_key = HELICONE_API_KEY os.environ["OPENAI_API_KEY"] = ( "your_openai_api_key_here" # Replace with your actual OpenAI API key ) diff --git a/docs/howtos/integrations/_langchain.md b/docs/howtos/integrations/_langchain.md index 0a31b98cf..475565c0b 100644 --- a/docs/howtos/integrations/_langchain.md +++ b/docs/howtos/integrations/_langchain.md @@ -13,8 +13,9 @@ With this integration you can easily evaluate your QA chains with the metrics of ```python # attach to the existing event loop when using jupyter notebooks -import nest_asyncio import os + +import nest_asyncio import openai from dotenv import load_dotenv @@ -35,9 +36,9 @@ First lets load the dataset. We are going to build a generic QA system over the ```python -from langchain_community.document_loaders import TextLoader -from langchain.indexes import VectorstoreIndexCreator from langchain.chains import RetrievalQA +from langchain.indexes import VectorstoreIndexCreator +from langchain_community.document_loaders import TextLoader from langchain_openai import ChatOpenAI loader = TextLoader("./nyc_wikipedia/nyc_text.txt") @@ -155,10 +156,10 @@ result["result"] ```python from ragas.langchain.evalchain import RagasEvaluatorChain from ragas.metrics import ( - faithfulness, answer_relevancy, context_precision, context_recall, + faithfulness, ) # create evaluation chains diff --git a/docs/howtos/integrations/_langsmith.md b/docs/howtos/integrations/_langsmith.md index d936c1f43..cedbe71e3 100644 --- a/docs/howtos/integrations/_langsmith.md +++ b/docs/howtos/integrations/_langsmith.md @@ -26,9 +26,9 @@ Once langsmith is setup, just run the evaluations as your normally would ```python from datasets import load_dataset -from ragas.metrics import context_precision, answer_relevancy, faithfulness -from ragas import evaluate +from ragas import evaluate +from ragas.metrics import answer_relevancy, context_precision, faithfulness fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval") diff --git a/docs/howtos/integrations/_zeno.md b/docs/howtos/integrations/_zeno.md index ebf2a6ca0..1386313f9 100644 --- a/docs/howtos/integrations/_zeno.md +++ b/docs/howtos/integrations/_zeno.md @@ -13,7 +13,7 @@ pip install zeno-client Next, create an account at [hub.zenoml.com](https://hub.zenoml.com) and generate an API key on your [account page](https://hub.zenoml.com/account). -We can now pick up the evaluation where we left off at the [Getting Started](./../../getstarted/index.md) guide: +We can now pick up the evaluation where we left off at the [Getting Started](../../getstarted/evaluation.md) guide: ```python @@ -21,6 +21,8 @@ import os import pandas as pd from datasets import load_dataset +from zeno_client import ZenoClient, ZenoMetric + from ragas import evaluate from ragas.metrics import ( answer_relevancy, @@ -28,7 +30,6 @@ from ragas.metrics import ( context_recall, faithfulness, ) -from zeno_client import ZenoClient, ZenoMetric ``` diff --git a/examples/ragas_examples/rag_eval/evals.py b/examples/ragas_examples/rag_eval/evals.py index c88fcaad3..de33483a3 100644 --- a/examples/ragas_examples/rag_eval/evals.py +++ b/examples/ragas_examples/rag_eval/evals.py @@ -1,4 +1,6 @@ import os +import sys +from pathlib import Path from openai import OpenAI @@ -6,7 +8,9 @@ from ragas.llms import llm_factory from ragas.metrics import DiscreteMetric -from .rag import default_rag_client +# Add the current directory to the path so we can import rag module when run as a script +sys.path.insert(0, str(Path(__file__).parent)) +from rag import default_rag_client openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) rag_client = default_rag_client(llm_client=openai_client) @@ -77,6 +81,11 @@ async def main(): print("Experiment completed successfully!") print("Experiment results:", experiment_results) + # Save experiment results to CSV + experiment_results.save() + csv_path = Path(".") / "experiments" / f"{experiment_results.name}.csv" + print(f"\nExperiment results saved to: {csv_path.resolve()}") + if __name__ == "__main__": import asyncio diff --git a/examples/ragas_examples/rag_eval/pyproject.toml b/examples/ragas_examples/rag_eval/pyproject.toml new file mode 100644 index 000000000..f7d1d146d --- /dev/null +++ b/examples/ragas_examples/rag_eval/pyproject.toml @@ -0,0 +1,26 @@ +[build-system] +requires = ["setuptools>=45", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "rag-eval" +version = "0.1.0" +description = "RAG evaluation example using Ragas" +requires-python = ">=3.9" +dependencies = [ + "ragas[all]>=0.3.0", + "openai>=1.0.0", +] + +[project.optional-dependencies] +dev = [ + "pytest>=7.0", +] + +[tool.setuptools] +py-modules = [] + +[tool.uv] +managed = true +# Note: When developing locally, use: +# uv sync --override ragas@path/to/ragas diff --git a/mkdocs.yml b/mkdocs.yml index 61883d0ce..673f45b0c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,6 +11,7 @@ nav: - "": index.md - 🚀 Get Started: - getstarted/index.md + - Quick Start: getstarted/quickstart.md - Installation: getstarted/install.md - Evaluate your first LLM App: getstarted/evals.md - Evaluate a simple RAG: getstarted/rag_eval.md @@ -254,7 +255,8 @@ extra: provider: google property: !ENV GOOGLE_ANALYTICS_KEY plugins: - - social + - social: + enabled: !ENV [MKDOCS_CI, true] - search - git-revision-date-localized: enabled: !ENV [MKDOCS_CI, false] diff --git a/src/ragas/cli.py b/src/ragas/cli.py index 6b901f685..09a0538b1 100644 --- a/src/ragas/cli.py +++ b/src/ragas/cli.py @@ -483,32 +483,34 @@ def quickstart( from pathlib import Path # Define available templates with descriptions + # Currently only rag_eval is available and fully tested templates = { "rag_eval": { "name": "RAG Evaluation", "description": "Evaluate a RAG (Retrieval Augmented Generation) system with custom metrics", "source_path": "ragas_examples/rag_eval", }, - "agent_evals": { - "name": "Agent Evaluation", - "description": "Evaluate AI agents with structured metrics and workflows", - "source_path": "ragas_examples/agent_evals", - }, - "benchmark_llm": { - "name": "LLM Benchmarking", - "description": "Benchmark and compare different LLM models with datasets", - "source_path": "ragas_examples/benchmark_llm", - }, - "prompt_evals": { - "name": "Prompt Evaluation", - "description": "Evaluate and compare different prompt variations", - "source_path": "ragas_examples/prompt_evals", - }, - "workflow_eval": { - "name": "Workflow Evaluation", - "description": "Evaluate complex LLM workflows and pipelines", - "source_path": "ragas_examples/workflow_eval", - }, + # Coming soon - not yet fully implemented: + # "agent_evals": { + # "name": "Agent Evaluation", + # "description": "Evaluate AI agents with structured metrics and workflows", + # "source_path": "ragas_examples/agent_evals", + # }, + # "benchmark_llm": { + # "name": "LLM Benchmarking", + # "description": "Benchmark and compare different LLM models with datasets", + # "source_path": "ragas_examples/benchmark_llm", + # }, + # "prompt_evals": { + # "name": "Prompt Evaluation", + # "description": "Evaluate and compare different prompt variations", + # "source_path": "ragas_examples/prompt_evals", + # }, + # "workflow_eval": { + # "name": "Workflow Evaluation", + # "description": "Evaluate complex LLM workflows and pipelines", + # "source_path": "ragas_examples/workflow_eval", + # }, } # If no template specified, list available templates @@ -533,7 +535,7 @@ def quickstart( console.print(" ragas quickstart [template_name]") console.print("\n[bold]Example:[/bold]") console.print(" ragas quickstart rag_eval") - console.print(" ragas quickstart agent_evals --output-dir ./my-project\n") + console.print(" ragas quickstart rag_eval --output-dir ./my-project\n") return # Validate template name @@ -648,7 +650,14 @@ def quickstart( console=console, ) as live: live.update(Spinner("dots", text="Copying template files...", style="green")) - shutil.copytree(source_path, output_path) + + # Copy template but exclude .venv and __pycache__ + def ignore_patterns(directory, files): + return { + f for f in files if f in {".venv", "__pycache__", "*.pyc", "uv.lock"} + } + + shutil.copytree(source_path, output_path, ignore=ignore_patterns) time.sleep(0.3) live.update( @@ -668,55 +677,107 @@ def quickstart( {template_info["description"]} -## Setup +## Quick Start + +### 1. Set Your API Key + +Choose your LLM provider: + +```bash +# OpenAI (default) +export OPENAI_API_KEY="your-openai-key" + +# Or use Anthropic Claude +export ANTHROPIC_API_KEY="your-anthropic-key" + +# Or use Google Gemini +export GOOGLE_API_KEY="your-google-key" +``` + +### 2. Install Dependencies + +Using `uv` (recommended): + +```bash +uv sync +``` + +Or using `pip`: + +```bash +pip install -e . +``` + +### 3. Run the Evaluation + +Using `uv`: + +```bash +uv run python evals.py +``` + +Or using `pip`: -1. Set your OpenAI API key (or other LLM provider): - ```bash - export OPENAI_API_KEY="your-api-key" - ``` +```bash +python evals.py +``` -2. Install dependencies: - ```bash - pip install ragas openai - ``` +### 4. Export Results to CSV -## Running the Example +Using `uv`: -Run the evaluation: ```bash -python app.py +uv run python export_csv.py ``` -Or run via the CLI: +Or using `pip`: + ```bash -ragas evals evals/evals.py --dataset test_data --metrics [metric_names] +python export_csv.py ``` ## Project Structure ``` {template}/ -├── app.py # Your application code (RAG system, agent, etc.) -├── evals/ # Evaluation-related code and data -│ ├── evals.py # Evaluation metrics and experiment definitions -│ ├── datasets/ # Test datasets -│ ├── experiments/ # Experiment results -│ └── logs/ # Evaluation logs and traces -└── README.md +├── README.md # This file +├── pyproject.toml # Project configuration +├── rag.py # Your RAG application code +├── evals.py # Evaluation workflow +├── export_csv.py # CSV export utility +├── __init__.py # Makes this a Python package +└── evals/ # Evaluation-related data + ├── datasets/ # Test datasets + ├── experiments/ # Experiment results (CSVs saved here) + └── logs/ # Evaluation logs and traces +``` + +## Customization + +### Modify the LLM Provider + +In `evals.py`, update the LLM configuration: + +```python +from ragas.llms import llm_factory + +# Use Anthropic Claude +llm = llm_factory("claude-3-5-sonnet-20241022", provider="anthropic") + +# Use Google Gemini +llm = llm_factory("gemini-1.5-pro", provider="google") + +# Use local Ollama +llm = llm_factory("mistral", provider="ollama", base_url="http://localhost:11434") ``` -This structure separates your application code from evaluation code, making it easy to: -- Develop and test your application independently -- Run evaluations without mixing concerns -- Track evaluation results separately from application logic +### Customize Test Cases + +Edit the `load_dataset()` function in `evals.py` to add or modify test cases. -## Next Steps +### Change Evaluation Metrics -1. Implement your application logic in `app.py` -2. Review and modify the metrics in `evals/evals.py` -3. Customize the dataset in `evals/datasets/` -4. Run experiments and analyze results -5. Iterate on your prompts and system design +Update the `my_metric` definition in `evals.py` to use different grading criteria. ## Documentation @@ -741,15 +802,12 @@ def quickstart( # Success message with next steps success(f"\n✓ Created {template_info['name']} project at: {output_path}") console.print("\n[bold cyan]Next Steps:[/bold cyan]") - console.print(f" 1. cd {output_path}") - console.print(" 2. export OPENAI_API_KEY='your-api-key'") - console.print(" 3. pip install ragas openai") - console.print(" 4. python app.py") - console.print("\n[bold]Project Structure:[/bold]") - console.print(" app.py - Your application code") - console.print(" evals/ - All evaluation-related code and data") - console.print("\n[bold]Quick Start:[/bold]") - console.print(f" cd {output_path} && python app.py\n") + console.print(f" cd {output_path}") + console.print(" uv sync") + console.print(" export OPENAI_API_KEY='your-api-key'") + console.print(" uv run python evals.py") + console.print("\n📚 For detailed instructions, see:") + console.print(" https://docs.ragas.io/en/latest/getstarted/quickstart/\n") @app.command() diff --git a/tests/unit/test_cli.py b/tests/unit/test_cli.py index 2ad5f7ffc..5ca58f83e 100644 --- a/tests/unit/test_cli.py +++ b/tests/unit/test_cli.py @@ -44,8 +44,8 @@ def test_quickstart_list_templates(): assert result.exit_code == 0 assert "Available Ragas Quickstart Templates" in result.stdout assert "rag_eval" in result.stdout - assert "agent_evals" in result.stdout - assert "benchmark_llm" in result.stdout + # Note: Other templates (agent_evals, benchmark_llm, etc.) are currently hidden + # as they are not yet fully implemented. Only rag_eval is available. def test_quickstart_invalid_template():