Skip to content

Commit da87e80

Browse files
committed
main sync
2 parents cd28d56 + e86105b commit da87e80

File tree

120 files changed

+8762
-1417
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

120 files changed

+8762
-1417
lines changed

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ build-docs: ## Build all documentation
161161
@echo "Converting ipynb notebooks to md files..."
162162
$(Q)MKDOCS_CI=true uv run python $(GIT_ROOT)/docs/ipynb_to_md.py
163163
@echo "Building ragas documentation..."
164-
$(Q)uv run --group docs mkdocs build
164+
$(Q)MKDOCS_CI=false uv run --group docs mkdocs build
165165

166166
serve-docs: ## Build and serve documentation locally
167-
$(Q)uv run --group docs mkdocs serve --dirtyreload
167+
$(Q)MKDOCS_CI=false uv run --group docs mkdocs serve --dirtyreload

README.md

Lines changed: 47 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ Ragas is your ultimate toolkit for evaluating and optimizing Large Language Mode
4848
Don't have a test dataset ready? We also do production-aligned test set generation.
4949

5050
> [!NOTE]
51-
> Need help setting up Evals for your AI application? We'd love to help! We are conducting Office Hours every week. You can sign up [here](https://cal.com/team/ragas/office-hours).
51+
> Need help setting up Evals for your AI application? We'd love to help! We are conducting Office Hours every week. You can sign up [here](https://cal.com/team/vibrantlabs/office-hours).
5252
5353
## Key Features
5454

@@ -73,23 +73,63 @@ pip install git+https://github.com/explodinggradients/ragas
7373

7474
## :fire: Quickstart
7575

76+
### Clone a Complete Example Project
77+
78+
The fastest way to get started is to use the `ragas quickstart` command:
79+
80+
```bash
81+
# List available templates
82+
ragas quickstart
83+
84+
# Create a RAG evaluation project
85+
ragas quickstart rag_eval
86+
87+
# Create an agent evaluation project
88+
ragas quickstart agent_evals -o ./my-project
89+
```
90+
91+
Available templates:
92+
- `rag_eval` - Evaluate RAG systems
93+
- `agent_evals` - Evaluate AI agents
94+
- `benchmark_llm` - Benchmark and compare LLMs
95+
- `prompt_evals` - Evaluate prompt variations
96+
- `workflow_eval` - Evaluate complex workflows
97+
7698
### Evaluate your LLM App
7799

78-
This is 5 main lines:
100+
This is a simple example evaluating a summary for accuracy:
79101

80102
```python
81-
from ragas import SingleTurnSample
82-
from ragas.metrics import AspectCritic
103+
import asyncio
104+
from ragas.metrics.collections import AspectCritic
105+
from ragas.llms import llm_factory
106+
107+
# Setup your LLM
108+
llm = llm_factory("gpt-4o")
83109

110+
# Create a metric
111+
metric = AspectCritic(
112+
name="summary_accuracy",
113+
definition="Verify if the summary is accurate and captures key information.",
114+
llm=llm
115+
)
116+
117+
# Evaluate
84118
test_data = {
85119
"user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
86120
"response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
87121
}
88-
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
89-
metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
90-
await metric.single_turn_ascore(SingleTurnSample(**test_data))
122+
123+
score = await metric.ascore(
124+
user_input=test_data["user_input"],
125+
response=test_data["response"]
126+
)
127+
print(f"Score: {score.value}")
128+
print(f"Reason: {score.reason}")
91129
```
92130

131+
> **Note**: Make sure your `OPENAI_API_KEY` environment variable is set.
132+
93133
Find the complete [Quickstart Guide](https://docs.ragas.io/en/latest/getstarted/evals)
94134

95135
## Want help in improving your AI application using evals?
60.5 KB
Loading

docs/concepts/components/eval_dataset.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ sample3 = SingleTurnSample(
6868
```
6969

7070
**Step 3:** Create the EvaluationDataset
71+
7172
Create an EvaluationDataset by passing a list of SingleTurnSample instances.
7273

7374
```python
@@ -91,4 +92,4 @@ Load the dataset into a Ragas EvaluationDataset object.
9192
from ragas import EvaluationDataset
9293

9394
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
94-
```
95+
```

docs/concepts/experimentation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ graph LR
3636

3737
## Creating Experiments with Ragas
3838

39-
Ragas provides an `@experiment` decorator to streamline the experiment creation process. If you prefer a hands-on intro first, see [Run your first experiment](../getstarted/experiments_quickstart.md).
39+
Ragas provides an `@experiment` decorator to streamline the experiment creation process. If you prefer a hands-on intro first, see the [Quick Start guide](../getstarted/quickstart.md).
4040

4141
### Basic Experiment Structure
4242

docs/concepts/index.md

Lines changed: 12 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,41 +3,36 @@
33

44
<div class="grid cards" markdown>
55

6-
- :material-widgets:{ .lg .middle } [__Components Guides__](components/index.md)
6+
- :material-flask-outline:{ .lg .middle } [__Experimentation__](experimentation.md)
77

88
---
99

10-
Discover the various components used within Ragas.
11-
12-
Components like [Prompt Object](components/prompt.md), [Evaluation Dataset](components/eval_dataset.md) and [more..](components/index.md)
10+
Learn how to systematically evaluate your AI applications using experiments.
1311

12+
Track changes, measure improvements, and compare results across different versions of your application.
1413

15-
- ::material-ruler-square:{ .lg .middle } [__Ragas Metrics__](metrics/index.md)
14+
- :material-database-export:{ .lg .middle } [__Datasets__](datasets.md)
1615

1716
---
1817

19-
Explore available metrics and understand how they work.
18+
Understand how to create, manage, and use evaluation datasets.
2019

21-
Metrics for evaluating [RAG](metrics/available_metrics/index.md#retrieval-augmented-generation), [Agentic workflows](metrics/available_metrics/index.md#agents-or-tool-use-cases) and [more..](metrics/available_metrics/index.md#list-of-available-metrics).
20+
Learn about dataset structure, storage backends, and best practices for maintaining your test data.
2221

23-
- :material-database-plus:{ .lg .middle } [__Test Data Generation__](test_data_generation/index.md)
22+
- ::material-ruler-square:{ .lg .middle } [__Ragas Metrics__](metrics/index.md)
2423

2524
---
2625

27-
Generate high-quality datasets for comprehensive testing.
28-
29-
Algorithms for synthesizing data to test [RAG](test_data_generation/rag.md), [Agentic workflows](test_data_generation/agents.md)
26+
Use our library of [available metrics](metrics/available_metrics/index.md) or create [custom metrics](metrics/overview/index.md) tailored to your use case.
3027

28+
Metrics for evaluating [RAG](metrics/available_metrics/index.md#retrieval-augmented-generation), [Agentic workflows](metrics/available_metrics/index.md#agents-or-tool-use-cases) and [more..](metrics/available_metrics/index.md#list-of-available-metrics).
3129

32-
- :material-chart-box-outline:{ .lg .middle } [__Feedback Intelligence__](feedback/index.md)
30+
- :material-database-plus:{ .lg .middle } [__Test Data Generation__](test_data_generation/index.md)
3331

3432
---
3533

36-
Leverage signals from production data to gain actionable insights.
37-
38-
Learn about to leveraging implicit and explicit signals from production data.
39-
40-
34+
Generate high-quality datasets for comprehensive testing.
4135

36+
Algorithms for synthesizing data to test [RAG](test_data_generation/rag.md), [Agentic workflows](test_data_generation/agents.md)
4237

4338
</div>

docs/concepts/metrics/available_metrics/answer_correctness.md

Lines changed: 58 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,44 @@ Answer correctness encompasses two critical aspects: semantic similarity between
1616
### Example
1717

1818
```python
19-
from datasets import Dataset
20-
from ragas.metrics import answer_correctness
21-
from ragas import evaluate
19+
from openai import AsyncOpenAI
20+
from ragas.llms import llm_factory
21+
from ragas.embeddings.base import embedding_factory
22+
from ragas.metrics.collections import AnswerCorrectness
23+
24+
# Setup LLM and embeddings
25+
client = AsyncOpenAI()
26+
llm = llm_factory("gpt-4o-mini", client=client)
27+
embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client)
28+
29+
# Create metric
30+
scorer = AnswerCorrectness(llm=llm, embeddings=embeddings)
31+
32+
# Evaluate
33+
result = await scorer.ascore(
34+
user_input="When was the first super bowl?",
35+
response="The first superbowl was held on Jan 15, 1967",
36+
reference="The first superbowl was held on January 15, 1967"
37+
)
38+
print(f"Answer Correctness Score: {result.value}")
39+
```
2240

23-
data_samples = {
24-
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
25-
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
26-
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
27-
}
28-
dataset = Dataset.from_dict(data_samples)
29-
score = evaluate(dataset,metrics=[answer_correctness])
30-
score.to_pandas()
41+
Output:
3142

3243
```
44+
Answer Correctness Score: 0.95
45+
```
46+
47+
!!! note "Synchronous Usage"
48+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
49+
50+
```python
51+
result = scorer.score(
52+
user_input="When was the first super bowl?",
53+
response="The first superbowl was held on Jan 15, 1967",
54+
reference="The first superbowl was held on January 15, 1967"
55+
)
56+
```
3357

3458
### Calculation
3559

@@ -57,3 +81,26 @@ Next, we calculate the semantic similarity between the generated answer and the
5781

5882
Once we have the semantic similarity, we take a weighted average of the semantic similarity and the factual similarity calculated above to arrive at the final score. You can adjust this weightage by modifying the `weights` parameter.
5983

84+
## Legacy Metrics API
85+
86+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
87+
88+
!!! warning "Deprecation Timeline"
89+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
90+
91+
### Example with Dataset
92+
93+
```python
94+
from datasets import Dataset
95+
from ragas.metrics import answer_correctness
96+
from ragas import evaluate
97+
98+
data_samples = {
99+
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
100+
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
101+
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
102+
}
103+
dataset = Dataset.from_dict(data_samples)
104+
score = evaluate(dataset,metrics=[answer_correctness])
105+
score.to_pandas()
106+
```

docs/concepts/metrics/available_metrics/answer_relevance.md

Lines changed: 69 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
## Response Relevancy
1+
## Answer Relevancy
22

3-
The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.
3+
The **Answer Relevancy** metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input.
4+
5+
An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
46

57
This metric is calculated using the `user_input` and the `response` as follows:
68

@@ -19,34 +21,50 @@ $$
1921
Where:
2022
- $E_{g_i}$: Embedding of the $i^{th}$ generated question.
2123
- $E_o$: Embedding of the user input.
22-
- $N$: Number of generated questions (default is 3).
24+
- $N$: Number of generated questions (default is 3, configurable via `strictness` parameter).
2325

2426
**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.
2527

26-
An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
27-
2828
### Example
2929

3030
```python
31-
from ragas import SingleTurnSample
32-
from ragas.metrics import ResponseRelevancy
31+
from openai import AsyncOpenAI
32+
from ragas.llms import llm_factory
33+
from ragas.embeddings.base import embedding_factory
34+
from ragas.metrics.collections import AnswerRelevancy
35+
36+
# Setup LLM and embeddings
37+
client = AsyncOpenAI()
38+
llm = llm_factory("gpt-4o-mini", client=client)
39+
embeddings = embedding_factory("openai", model="text-embedding-3-small", client=client, interface="modern")
40+
41+
# Create metric
42+
scorer = AnswerRelevancy(llm=llm, embeddings=embeddings)
43+
44+
# Evaluate
45+
result = await scorer.ascore(
46+
user_input="When was the first super bowl?",
47+
response="The first superbowl was held on Jan 15, 1967"
48+
)
49+
print(f"Answer Relevancy Score: {result.value}")
50+
```
3351

34-
sample = SingleTurnSample(
35-
user_input="When was the first super bowl?",
36-
response="The first superbowl was held on Jan 15, 1967",
37-
retrieved_contexts=[
38-
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
39-
]
40-
)
52+
Output:
4153

42-
scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
43-
await scorer.single_turn_ascore(sample)
44-
```
45-
Output
4654
```
47-
0.9165088378587264
55+
Answer Relevancy Score: 0.9165088378587264
4856
```
4957

58+
!!! note "Synchronous Usage"
59+
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:
60+
61+
```python
62+
result = scorer.score(
63+
user_input="When was the first super bowl?",
64+
response="The first superbowl was held on Jan 15, 1967"
65+
)
66+
```
67+
5068
### How It’s Calculated
5169

5270
!!! example
@@ -67,3 +85,35 @@ To calculate the relevance of the answer to the given question, we follow two st
6785
- **Step 2:** Calculate the mean cosine similarity between the generated questions and the actual question.
6886

6987
The underlying concept is that if the answer correctly addresses the question, it is highly probable that the original question can be reconstructed solely from the answer.
88+
89+
90+
## Legacy Metrics API
91+
92+
The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.
93+
94+
!!! warning "Deprecation Timeline"
95+
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.
96+
97+
### Example with SingleTurnSample
98+
99+
```python
100+
from ragas import SingleTurnSample
101+
from ragas.metrics import ResponseRelevancy
102+
103+
sample = SingleTurnSample(
104+
user_input="When was the first super bowl?",
105+
response="The first superbowl was held on Jan 15, 1967",
106+
retrieved_contexts=[
107+
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
108+
]
109+
)
110+
111+
scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
112+
await scorer.single_turn_ascore(sample)
113+
```
114+
115+
Output:
116+
117+
```
118+
0.9165088378587264
119+
```

0 commit comments

Comments
 (0)