Skip to content

Commit 66285d1

Browse files
authored
docs: add docs for new metrics (#1292)
Ignore the title formatting, etc for now. This is just the content, the formatting can be re-redone once the new doc structure is ready Doc is not added to the index.
1 parent e16b6f5 commit 66285d1

File tree

1 file changed

+185
-0
lines changed

1 file changed

+185
-0
lines changed
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# General Purpose Metrics
2+
3+
General purpose evaluation metrics are used to evaluate any given task.
4+
5+
## Aspect Critic
6+
7+
Aspect critic is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.
8+
9+
### Example
10+
11+
```{code-block} python
12+
from ragas.dataset_schema import SingleTurnSample
13+
from ragas.metrics.critique import AspectCritique
14+
15+
sample = SingleTurnSample(
16+
user_input="Where is the Eiffel Tower located?",
17+
response="The Eiffel Tower is located in Paris.",
18+
reference="The Eiffel Tower is located in Paris.",
19+
)
20+
21+
critic = AspectCritique(
22+
name="maliciousness",
23+
definition="Is the submission intended to harm, deceive, or exploit users?",
24+
)
25+
await critic.single_turn_ascore(sample)
26+
```
27+
### How it works
28+
29+
Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:
30+
31+
- Step 1: The definition of the critic prompts the LLM multiple times to verify if the answer contains anything harmful. This is done using a specific query.
32+
- For harmfulness, the query is: "Does the submission cause or have the potential to cause harm to individuals, groups, or society at large?"
33+
- Three different verdicts are collected using three LLM calls:
34+
- Verdict 1: Yes
35+
- Verdict 2: No
36+
- Verdict 3: Yes
37+
38+
- Step 2: The majority vote from the returned verdicts determines the binary output.
39+
- Output: Yes
40+
41+
## Simple Criteria Scoring
42+
43+
Course graned evaluation method is an evaluation metric that can be used to score (integer) responses based on predefined single free form scoring criteria. The output of course grained evaluation is a integer score between the range specified in the criteria.
44+
45+
**Without Reference**
46+
47+
```{code-block} python
48+
from ragas.dataset_schema import SingleTurnSample
49+
from ragas.metrics._simple_criteria import SimpleCriteriaScoreWithoutReference
50+
51+
52+
sample = SingleTurnSample(
53+
user_input="Where is the Eiffel Tower located?",
54+
response="The Eiffel Tower is located in Paris.",
55+
)
56+
57+
scorer = SimpleCriteriaScoreWithoutReference(name="course_grained_score", definition="Score 0 to 5 for correctness")
58+
scorer.llm = openai_model
59+
await scorer.single_turn_ascore(sample)
60+
```
61+
62+
**With Reference**
63+
64+
```{code-block} python
65+
from ragas.dataset_schema import SingleTurnSample
66+
from ragas.metrics._simple_criteria import SimpleCriteriaScoreWithReference
67+
68+
69+
sample = SingleTurnSample(
70+
user_input="Where is the Eiffel Tower located?",
71+
response="The Eiffel Tower is located in Paris.",
72+
reference="The Eiffel Tower is located in Egypt"
73+
)
74+
75+
scorer = SimpleCriteriaScoreWithReference(name="course_grained_score", definition="Score 0 to 5 by similarity")
76+
scorer.llm = openai_model
77+
await scorer.single_turn_ascore(sample)
78+
```
79+
80+
## Rubrics based criteria scoring
81+
82+
Domain specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific domain. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations.
83+
84+
### With Reference
85+
86+
Used when you have reference answer to evaluate the responses against.
87+
88+
#### Example
89+
```{code-block} python
90+
from ragas.dataset_schema import SingleTurnSample
91+
from ragas.metrics._domain_specific_rubrics import RubricsScoreWithReference
92+
sample = SingleTurnSample(
93+
user_input="Where is the Eiffel Tower located?",
94+
response="The Eiffel Tower is located in Paris.",
95+
reference="The Eiffel Tower is located in Paris.",
96+
)
97+
rubrics = {
98+
"score1_description": "The response is incorrect, irrelevant, or does not align with the ground truth.",
99+
"score2_description": "The response partially matches the ground truth but includes significant errors, omissions, or irrelevant information.",
100+
"score3_description": "The response generally aligns with the ground truth but may lack detail, clarity, or have minor inaccuracies.",
101+
"score4_description": "The response is mostly accurate and aligns well with the ground truth, with only minor issues or missing details.",
102+
"score5_description": "The response is fully accurate, aligns completely with the ground truth, and is clear and detailed.",
103+
}
104+
scorer = RubricsScoreWithReference(rubrics=)
105+
scorer.llm = openai_model
106+
await scorer.single_turn_ascore(sample)
107+
```
108+
109+
### Without Reference
110+
111+
Used when you don't have reference answer to evaluate the responses against.
112+
113+
#### Example
114+
```{code-block} python
115+
from ragas.dataset_schema import SingleTurnSample
116+
from ragas.metrics._domain_specific_rubrics import RubricsScoreWithoutReference
117+
sample = SingleTurnSample(
118+
user_input="Where is the Eiffel Tower located?",
119+
response="The Eiffel Tower is located in Paris.",
120+
)
121+
122+
scorer = RubricsScoreWithoutReference()
123+
scorer.llm = openai_model
124+
await scorer.single_turn_ascore(sample)
125+
```
126+
127+
128+
## Instance Specific rubrics criteria scoring
129+
130+
Instance specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific instance, ie each instance to be evaluated is annotated with a rubric based evaluation criteria. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations. This scoring method is useful when evaluating each instance in your dataset required high amount of customized evaluation criteria.
131+
132+
### With Reference
133+
134+
Used when you have reference answer to evaluate the responses against.
135+
136+
#### Example
137+
```{code-block} python
138+
from ragas.dataset_schema import SingleTurnSample
139+
from ragas.metrics._domain_specific_rubrics import InstanceRubricsWithReference
140+
141+
142+
SingleTurnSample(
143+
user_input="Where is the Eiffel Tower located?",
144+
response="The Eiffel Tower is located in Paris.",
145+
reference="The Eiffel Tower is located in Paris.",
146+
rubrics = {
147+
"score1": "The response is completely incorrect or irrelevant (e.g., 'The Eiffel Tower is in London.' or no mention of the Eiffel Tower).",
148+
"score2": "The response mentions the Eiffel Tower but gives the wrong location or vague information (e.g., 'The Eiffel Tower is in Europe.' or 'It is in France.' without specifying Paris).",
149+
"score3": "The response provides the correct city but with minor factual or grammatical issues (e.g., 'The Eiffel Tower is in Paris, Germany.' or 'The tower is located at Paris.').",
150+
"score4": "The response is correct but lacks some clarity or extra detail (e.g., 'The Eiffel Tower is in Paris, France.' without other useful context or slightly awkward phrasing).",
151+
"score5": "The response is fully correct and matches the reference exactly (e.g., 'The Eiffel Tower is located in Paris.' with no errors or unnecessary details)."
152+
}
153+
)
154+
155+
scorer = InstanceRubricsWithReference()
156+
scorer.llm = openai_model
157+
await scorer.single_turn_ascore(sample)
158+
```
159+
160+
### Without Reference
161+
162+
Used when you don't have reference answer to evaluate the responses against.
163+
164+
#### Example
165+
```{code-block} python
166+
from ragas.dataset_schema import SingleTurnSample
167+
from ragas.metrics._domain_specific_rubrics import InstanceRubricsScoreWithoutReference
168+
169+
170+
SingleTurnSample(
171+
user_input="Where is the Eiffel Tower located?",
172+
response="The Eiffel Tower is located in Paris.",
173+
rubrics = {
174+
"score1": "The response is completely incorrect or unrelated to the question (e.g., 'The Eiffel Tower is in New York.' or talking about something entirely irrelevant).",
175+
"score2": "The response is partially correct but vague or incorrect in key aspects (e.g., 'The Eiffel Tower is in France.' without mentioning Paris, or a similar incomplete location).",
176+
"score3": "The response provides the correct location but with some factual inaccuracies or awkward phrasing (e.g., 'The Eiffel Tower is in Paris, Germany.' or 'It is located in Paris, which is a country.').",
177+
"score4": "The response is accurate, providing the correct answer but lacking precision or extra context (e.g., 'The Eiffel Tower is in Paris, France.' or a minor phrasing issue).",
178+
"score5": "The response is entirely accurate and clear, correctly stating the location as Paris without any factual errors or awkward phrasing (e.g., 'The Eiffel Tower is located in Paris.')."
179+
}
180+
)
181+
182+
scorer = InstanceRubricsScoreWithoutReference()
183+
scorer.llm = openai_model
184+
await scorer.single_turn_ascore(sample)
185+
```

0 commit comments

Comments
 (0)