Skip to content

Commit 6105b31

Browse files
committed
Merge branch 'main' into rbhatnagar/migrate_metrics_7
2 parents 4b094d9 + 09d22fc commit 6105b31

File tree

72 files changed

+1860
-703
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+1860
-703
lines changed

β€ŽMakefileβ€Ž

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ build-docs: ## Build all documentation
161161
@echo "Converting ipynb notebooks to md files..."
162162
$(Q)MKDOCS_CI=true uv run python $(GIT_ROOT)/docs/ipynb_to_md.py
163163
@echo "Building ragas documentation..."
164-
$(Q)uv run --group docs mkdocs build
164+
$(Q)MKDOCS_CI=false uv run --group docs mkdocs build
165165

166166
serve-docs: ## Build and serve documentation locally
167-
$(Q)uv run --group docs mkdocs serve --dirtyreload
167+
$(Q)MKDOCS_CI=false uv run --group docs mkdocs serve --dirtyreload

β€ŽREADME.mdβ€Ž

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -97,21 +97,39 @@ Available templates:
9797

9898
### Evaluate your LLM App
9999

100-
This is 5 main lines:
100+
This is a simple example evaluating a summary for accuracy:
101101

102102
```python
103-
from ragas import SingleTurnSample
104-
from ragas.metrics import AspectCritic
103+
import asyncio
104+
from ragas.metrics.collections import AspectCritic
105+
from ragas.llms import llm_factory
105106

107+
# Setup your LLM
108+
llm = llm_factory("gpt-4o")
109+
110+
# Create a metric
111+
metric = AspectCritic(
112+
name="summary_accuracy",
113+
definition="Verify if the summary is accurate and captures key information.",
114+
llm=llm
115+
)
116+
117+
# Evaluate
106118
test_data = {
107119
"user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
108120
"response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
109121
}
110-
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
111-
metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
112-
await metric.single_turn_ascore(SingleTurnSample(**test_data))
122+
123+
score = await metric.ascore(
124+
user_input=test_data["user_input"],
125+
response=test_data["response"]
126+
)
127+
print(f"Score: {score.value}")
128+
print(f"Reason: {score.reason}")
113129
```
114130

131+
> **Note**: Make sure your `OPENAI_API_KEY` environment variable is set.
132+
115133
Find the complete [Quickstart Guide](https://docs.ragas.io/en/latest/getstarted/evals)
116134

117135
## Want help in improving your AI application using evals?
60.5 KB
Loading

β€Ždocs/concepts/components/eval_dataset.mdβ€Ž

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ sample3 = SingleTurnSample(
6868
```
6969

7070
**Step 3:** Create the EvaluationDataset
71+
7172
Create an EvaluationDataset by passing a list of SingleTurnSample instances.
7273

7374
```python
@@ -91,4 +92,4 @@ Load the dataset into a Ragas EvaluationDataset object.
9192
from ragas import EvaluationDataset
9293

9394
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
94-
```
95+
```

β€Ždocs/concepts/metrics/available_metrics/index.mdβ€Ž

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,9 @@ Each metric are essentially paradigms that are designed to evaluate a particular
2121

2222
## Agents or Tool use cases
2323

24-
- [Topic adherence](agents.md#topic_adherence)
24+
- [Topic adherence](agents.md#topic-adherence)
2525
- [Tool call Accuracy](agents.md#tool-call-accuracy)
26+
- [Tool Call F1](agents.md#tool-call-f1)
2627
- [Agent Goal Accuracy](agents.md#agent-goal-accuracy)
2728

2829
## Natural Language Comparison
@@ -31,6 +32,7 @@ Each metric are essentially paradigms that are designed to evaluate a particular
3132
- [Semantic Similarity](semantic_similarity.md)
3233
- [Non LLM String Similarity](traditional.md#non-llm-string-similarity)
3334
- [BLEU Score](traditional.md#bleu-score)
35+
- [CHRF Score](traditional.md#chrf-score)
3436
- [ROUGE Score](traditional.md#rouge-score)
3537
- [String Presence](traditional.md#string-presence)
3638
- [Exact Match](traditional.md#exact-match)

β€Ždocs/concepts/metrics/available_metrics/nvidia_metrics.mdβ€Ž

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,10 +78,13 @@ Thus, the final **Answer Accuracy** score is **1**.
7878

7979
**Context Relevance** evaluates whether the **retrieved_contexts** (chunks or passages) are pertinent to the **user_input**. This is done via two independent "LLM-as-a-Judge" prompt calls that each rate the relevance on a scale of **0, 1, or 2**. The ratings are then converted to a [0,1] scale and averaged to produce the final score. Higher scores indicate that the contexts are more closely aligned with the user's query.
8080

81-
- **0** β†’ The retrieved contexts are not relevant to the user’s query at all.
81+
- **0** β†’ The retrieved contexts are not relevant to the user's query at all.
8282
- **1** β†’ The contexts are partially relevant.
8383
- **2** β†’ The contexts are completely relevant.
8484

85+
### Implementation Note
86+
87+
**Difference from Original Paper:** The original Ragas paper defines Context Relevance using sentence-level extraction (CR = number of relevant sentences / total sentences), but the current implementation uses a more robust discrete judgment approach. Each LLM is asked to rate overall context relevance on a 0-2 scale, which is more efficient and less prone to sentence boundary errors. This was an intentional design decision to improve reliability and reduce computational overhead while maintaining the core evaluation objective.
8588

8689
```python
8790
from ragas.dataset_schema import SingleTurnSample
@@ -104,9 +107,9 @@ Output
104107
1.0
105108
```
106109

107-
### How It’s Calculated
110+
### How It's Calculated
108111

109-
**Step 1:** The LLM is prompted with two distinct templates (template_relevance1 and template_relevance2) to evaluate the relevance of the retrieved contexts concerning the user's query. Each prompt returns a relevance rating of **0**, **1**, or **2**.
112+
**Step 1:** The LLM is prompted with two distinct templates (template_relevance1 and template_relevance2) to evaluate the relevance of the retrieved contexts concerning the user's query. Each prompt returns a relevance rating of **0**, **1**, or **2**. Using two independent evaluations provides robustness and helps mitigate individual LLM biases.
110113

111114
**Step 2:** Each rating is normalized to a [0,1] scale by dividing by 2. If both ratings are valid, the final score is the average of these normalized values; if only one is valid, that score is used.
112115

0 commit comments

Comments
Β (0)