Skip to content

Commit f14cd85

Browse files
authored
feat: make general purpose metrics more general (#1666)
## Metrics Converted - [x] Aspect Critic - [x] Simple Criteria - [x] Rubric Based - both Instance and Domain specific a few different examples ### Aspect Critic ```py from ragas.metrics import AspectCritic from ragas.dataset_schema import SingleTurnSample only_response = SingleTurnSample( response="The Eiffel Tower is located in Paris." ) grammar_critic = AspectCritic( name="grammar", definition="Is the response grammatically correct?", llm=evaluator_llm ) await grammar_critic.single_turn_ascore(only_response) ``` with reference ```py answer_correctness_critic = AspectCritic( name="answer_correctness", definition="Is the response and reference answer are the same?", llm=evaluator_llm ) # data row sample = SingleTurnSample( user_input="Where is the Eiffel Tower located?", response="The Eiffel Tower is located in Paris.", reference="London" ) await answer_correctness_critic.single_turn_ascore(sample) ``` **Note:** this only works for multi-turn metrics for now
1 parent 29f70cf commit f14cd85

27 files changed

+1173
-1582
lines changed

docs/concepts/metrics/available_metrics/general_purpose.md

Lines changed: 11 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ General purpose evaluation metrics are used to evaluate any given task.
66

77
`AspectCritic` is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.
88

9-
**Without reference**
109

1110
### Example
1211

@@ -28,32 +27,6 @@ scorer = AspectCritic(
2827
await scorer.single_turn_ascore(sample)
2928
```
3029

31-
**With reference**
32-
33-
### Example
34-
35-
```python
36-
from ragas.dataset_schema import SingleTurnSample
37-
from ragas.metrics import AspectCriticWithReference
38-
39-
40-
sample = SingleTurnSample(
41-
user_input="Where is the Eiffel Tower located?",
42-
response="The Eiffel Tower is located in Paris.",
43-
reference="The Eiffel Tower is located in Paris.",
44-
)
45-
46-
scorer = AspectCritic(
47-
name="correctness",
48-
definition="Is the response factually similar to the reference?",
49-
llm=evaluator_llm
50-
51-
)
52-
53-
await scorer.single_turn_ascore(sample)
54-
55-
```
56-
5730
### How it works
5831

5932
Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:
@@ -74,41 +47,22 @@ Critics are essentially basic LLM calls using the defined criteria. For example,
7447

7548
Course graned evaluation method is an evaluation metric that can be used to score (integer) responses based on predefined single free form scoring criteria. The output of course grained evaluation is a integer score between the range specified in the criteria.
7649

77-
**Without Reference**
78-
79-
```python
80-
from ragas.dataset_schema import SingleTurnSample
81-
from ragas.metrics import SimpleCriteriaScoreWithoutReference
82-
83-
84-
sample = SingleTurnSample(
85-
user_input="Where is the Eiffel Tower located?",
86-
response="The Eiffel Tower is located in Paris.",
87-
)
88-
89-
scorer = SimpleCriteriaScoreWithoutReference(name="course_grained_score",
90-
definition="Score 0 to 5 for correctness",
91-
llm=evaluator_llm
92-
)
93-
await scorer.single_turn_ascore(sample)
94-
```
95-
96-
**With Reference**
97-
9850
```python
9951
from ragas.dataset_schema import SingleTurnSample
100-
from ragas.metrics import SimpleCriteriaScoreWithReference
52+
from ragas.metrics import SimpleCriteriaScore
10153

10254

10355
sample = SingleTurnSample(
104-
user_input="Where is the Eiffel Tower located?",
56+
user_input="Where is the Eiffel Tower loc
10557
response="The Eiffel Tower is located in Paris.",
10658
reference="The Eiffel Tower is located in Egypt"
10759
)
10860

109-
scorer = SimpleCriteriaScoreWithReference(name="course_grained_score",
110-
definition="Score 0 to 5 by similarity",
111-
llm=evaluator_llm)
61+
scorer = SimpleCriteriaScore(
62+
name="course_grained_score",
63+
definition="Score 0 to 5 by similarity",
64+
llm=evaluator_llm
65+
)
11266

11367
await scorer.single_turn_ascore(sample)
11468
```
@@ -117,14 +71,10 @@ await scorer.single_turn_ascore(sample)
11771

11872
Domain specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific domain. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations.
11973

120-
### With Reference
121-
122-
Used when you have reference answer to evaluate the responses against.
123-
12474
#### Example
12575
```python
12676
from ragas.dataset_schema import SingleTurnSample
127-
from ragas.metrics import RubricsScoreWithReference
77+
from ragas.metrics import RubricsScore
12878
sample = SingleTurnSample(
12979
user_input="Where is the Eiffel Tower located?",
13080
response="The Eiffel Tower is located in Paris.",
@@ -137,67 +87,18 @@ rubrics = {
13787
"score4_description": "The response is mostly accurate and aligns well with the ground truth, with only minor issues or missing details.",
13888
"score5_description": "The response is fully accurate, aligns completely with the ground truth, and is clear and detailed.",
13989
}
140-
scorer = RubricsScoreWithReference(rubrics=rubrics, llm=evaluator_llm)
90+
scorer = RubricsScore(rubrics=rubrics, llm=evaluator_llm)
14191
await scorer.single_turn_ascore(sample)
14292
```
14393

144-
### Without Reference
145-
146-
Used when you don't have reference answer to evaluate the responses against.
147-
148-
#### Example
149-
```python
150-
from ragas.dataset_schema import SingleTurnSample
151-
from ragas.metrics import RubricsScoreWithoutReference
152-
sample = SingleTurnSample(
153-
user_input="Where is the Eiffel Tower located?",
154-
response="The Eiffel Tower is located in Paris.",
155-
)
156-
157-
scorer = RubricsScoreWithoutReference(rubrics=rubrics, llm=evaluator_llm)
158-
await scorer.single_turn_ascore(sample)
159-
```
160-
161-
16294
## Instance Specific rubrics criteria scoring
16395

16496
Instance specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific instance, ie each instance to be evaluated is annotated with a rubric based evaluation criteria. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations. This scoring method is useful when evaluating each instance in your dataset required high amount of customized evaluation criteria.
16597

166-
### With Reference
167-
168-
Used when you have reference answer to evaluate the responses against.
169-
170-
#### Example
171-
```python
172-
from ragas.dataset_schema import SingleTurnSample
173-
from ragas.metrics import InstanceRubricsWithReference
174-
175-
176-
SingleTurnSample(
177-
user_input="Where is the Eiffel Tower located?",
178-
response="The Eiffel Tower is located in Paris.",
179-
reference="The Eiffel Tower is located in Paris.",
180-
rubrics = {
181-
"score1": "The response is completely incorrect or irrelevant (e.g., 'The Eiffel Tower is in London.' or no mention of the Eiffel Tower).",
182-
"score2": "The response mentions the Eiffel Tower but gives the wrong location or vague information (e.g., 'The Eiffel Tower is in Europe.' or 'It is in France.' without specifying Paris).",
183-
"score3": "The response provides the correct city but with minor factual or grammatical issues (e.g., 'The Eiffel Tower is in Paris, Germany.' or 'The tower is located at Paris.').",
184-
"score4": "The response is correct but lacks some clarity or extra detail (e.g., 'The Eiffel Tower is in Paris, France.' without other useful context or slightly awkward phrasing).",
185-
"score5": "The response is fully correct and matches the reference exactly (e.g., 'The Eiffel Tower is located in Paris.' with no errors or unnecessary details)."
186-
}
187-
)
188-
189-
scorer = InstanceRubricsWithReference(llm=evaluator_llm)
190-
await scorer.single_turn_ascore(sample)
191-
```
192-
193-
### Without Reference
194-
195-
Used when you don't have reference answer to evaluate the responses against.
196-
19798
#### Example
19899
```python
199100
from ragas.dataset_schema import SingleTurnSample
200-
from ragas.metrics import InstanceRubricsScoreWithoutReference
101+
from ragas.metrics import InstanceRubricsScore
201102

202103

203104
SingleTurnSample(
@@ -212,6 +113,6 @@ SingleTurnSample(
212113
}
213114
)
214115

215-
scorer = InstanceRubricsScoreWithoutReference(llm=evaluator_llm)
116+
scorer = InstanceRubricsScore(llm=evaluator_llm)
216117
await scorer.single_turn_ascore(sample)
217118
```

docs/howtos/customizations/metrics/_cost.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ For an example here is one that will parse OpenAI by using a parser we have defi
1313

1414
```python
1515
import os
16+
1617
os.environ["OPENAI_API_KEY"] = "your-api-key"
1718
```
1819

@@ -61,8 +62,6 @@ metric = AspectCriticWithReference(
6162
name="answer_correctness",
6263
definition="is the response correct compared to reference",
6364
)
64-
65-
6665
```
6766

6867
Repo card metadata block was not found. Setting CardData to empty.
@@ -73,8 +72,12 @@ metric = AspectCriticWithReference(
7372
from ragas import evaluate
7473
from ragas.cost import get_token_usage_for_openai
7574

76-
results = evaluate(eval_dataset[:5], metrics=[metric], llm=gpt4o,
77-
token_usage_parser=get_token_usage_for_openai,)
75+
results = evaluate(
76+
eval_dataset[:5],
77+
metrics=[metric],
78+
llm=gpt4o,
79+
token_usage_parser=get_token_usage_for_openai,
80+
)
7881
```
7982

8083
Evaluating: 100%|██████████| 5/5 [00:01<00:00, 2.81it/s]

docs/howtos/customizations/metrics/_write_your_own_metric.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,9 +90,9 @@ Now lets init the metric with the rubric and evaluator llm and evaluate the data
9090

9191

9292
```python
93-
from ragas.metrics import RubricsScoreWithoutReference
93+
from ragas.metrics import RubricsScore
9494

95-
hallucinations_rubric = RubricsScoreWithoutReference(
95+
hallucinations_rubric = RubricsScore(
9696
name="hallucinations_rubric", llm=evaluator_llm, rubrics=rubric
9797
)
9898

docs/howtos/customizations/metrics/cost.ipynb

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
"outputs": [],
3030
"source": [
3131
"import os\n",
32+
"\n",
3233
"os.environ[\"OPENAI_API_KEY\"] = \"your-api-key\""
3334
]
3435
},
@@ -105,8 +106,7 @@
105106
"metric = AspectCriticWithReference(\n",
106107
" name=\"answer_correctness\",\n",
107108
" definition=\"is the response correct compared to reference\",\n",
108-
")\n",
109-
"\n"
109+
")"
110110
]
111111
},
112112
{
@@ -126,8 +126,12 @@
126126
"from ragas import evaluate\n",
127127
"from ragas.cost import get_token_usage_for_openai\n",
128128
"\n",
129-
"results = evaluate(eval_dataset[:5], metrics=[metric], llm=gpt4o,\n",
130-
" token_usage_parser=get_token_usage_for_openai,)"
129+
"results = evaluate(\n",
130+
" eval_dataset[:5],\n",
131+
" metrics=[metric],\n",
132+
" llm=gpt4o,\n",
133+
" token_usage_parser=get_token_usage_for_openai,\n",
134+
")"
131135
]
132136
},
133137
{

docs/howtos/customizations/metrics/write_your_own_metric.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -160,9 +160,9 @@
160160
}
161161
],
162162
"source": [
163-
"from ragas.metrics import RubricsScoreWithoutReference\n",
163+
"from ragas.metrics import RubricsScore\n",
164164
"\n",
165-
"hallucinations_rubric = RubricsScoreWithoutReference(\n",
165+
"hallucinations_rubric = RubricsScore(\n",
166166
" name=\"hallucinations_rubric\", llm=evaluator_llm, rubrics=rubric\n",
167167
")\n",
168168
"\n",

docs/howtos/customizations/testgenerator/_persona_generator.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,18 @@ Which we can define as follows:
1414
```python
1515
from ragas.testset.persona import Persona
1616

17-
persona_new_joinee = Persona(name="New Joinee", role_description="Don't know much about the company and is looking for information on how to get started.")
18-
persona_manager = Persona(name="Manager", role_description="Wants to know about the different teams and how they collaborate with each other.")
19-
persona_senior_manager = Persona(name="Senior Manager", role_description="Wants to know about the company vision and how it is executed.")
17+
persona_new_joinee = Persona(
18+
name="New Joinee",
19+
role_description="Don't know much about the company and is looking for information on how to get started.",
20+
)
21+
persona_manager = Persona(
22+
name="Manager",
23+
role_description="Wants to know about the different teams and how they collaborate with each other.",
24+
)
25+
persona_senior_manager = Persona(
26+
name="Senior Manager",
27+
role_description="Wants to know about the company vision and how it is executed.",
28+
)
2029

2130
personas = [persona_new_joinee, persona_manager, persona_senior_manager]
2231
personas
@@ -49,7 +58,6 @@ testset_generator = TestsetGenerator(knowledge_graph=kg, persona_list=personas,
4958
# Generate the Testset
5059
testset = testset_generator.generate(testset_size=10)
5160
testset
52-
5361
```
5462

5563

docs/howtos/customizations/testgenerator/persona_generator.ipynb

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,18 @@
3838
"source": [
3939
"from ragas.testset.persona import Persona\n",
4040
"\n",
41-
"persona_new_joinee = Persona(name=\"New Joinee\", role_description=\"Don't know much about the company and is looking for information on how to get started.\")\n",
42-
"persona_manager = Persona(name=\"Manager\", role_description=\"Wants to know about the different teams and how they collaborate with each other.\")\n",
43-
"persona_senior_manager = Persona(name=\"Senior Manager\", role_description=\"Wants to know about the company vision and how it is executed.\")\n",
41+
"persona_new_joinee = Persona(\n",
42+
" name=\"New Joinee\",\n",
43+
" role_description=\"Don't know much about the company and is looking for information on how to get started.\",\n",
44+
")\n",
45+
"persona_manager = Persona(\n",
46+
" name=\"Manager\",\n",
47+
" role_description=\"Wants to know about the different teams and how they collaborate with each other.\",\n",
48+
")\n",
49+
"persona_senior_manager = Persona(\n",
50+
" name=\"Senior Manager\",\n",
51+
" role_description=\"Wants to know about the company vision and how it is executed.\",\n",
52+
")\n",
4453
"\n",
4554
"personas = [persona_new_joinee, persona_manager, persona_senior_manager]\n",
4655
"personas"
@@ -72,7 +81,7 @@
7281
"testset_generator = TestsetGenerator(knowledge_graph=kg, persona_list=personas, llm=llm)\n",
7382
"# Generate the Testset\n",
7483
"testset = testset_generator.generate(testset_size=10)\n",
75-
"testset\n"
84+
"testset"
7685
]
7786
},
7887
{

docs/howtos/integrations/_langgraph_agent_evaluation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,7 @@ ragas_trace = convert_to_ragas_messages(result["messages"])
289289

290290

291291
```python
292-
ragas_trace # List of Ragas messages
292+
ragas_trace # List of Ragas messages
293293
```
294294

295295

docs/howtos/integrations/langchain.ipynb

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
},
2626
{
2727
"cell_type": "code",
28-
"execution_count": null,
28+
"execution_count": 1,
2929
"id": "fb5deb25",
3030
"metadata": {},
3131
"outputs": [],
@@ -59,10 +59,31 @@
5959
},
6060
{
6161
"cell_type": "code",
62-
"execution_count": null,
62+
"execution_count": 2,
6363
"id": "4aa9a986",
6464
"metadata": {},
65-
"outputs": [],
65+
"outputs": [
66+
{
67+
"name": "stderr",
68+
"output_type": "stream",
69+
"text": [
70+
"/home/jjmachan/.pyenv/versions/ragas/lib/python3.10/site-packages/langchain/indexes/vectorstore.py:128: UserWarning: Using InMemoryVectorStore as the default vectorstore.This memory store won't persist data. You should explicitlyspecify a vectorstore when using VectorstoreIndexCreator\n",
71+
" warnings.warn(\n"
72+
]
73+
},
74+
{
75+
"ename": "ValidationError",
76+
"evalue": "1 validation error for VectorstoreIndexCreator\nembedding\n Field required [type=missing, input_value={}, input_type=dict]\n For further information visit https://errors.pydantic.dev/2.9/v/missing",
77+
"output_type": "error",
78+
"traceback": [
79+
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
80+
"\u001b[0;31mValidationError\u001b[0m Traceback (most recent call last)",
81+
"Cell \u001b[0;32mIn[2], line 7\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mlangchain_openai\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ChatOpenAI\n\u001b[1;32m 6\u001b[0m loader \u001b[38;5;241m=\u001b[39m TextLoader(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./nyc_wikipedia/nyc_text.txt\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m----> 7\u001b[0m index \u001b[38;5;241m=\u001b[39m \u001b[43mVectorstoreIndexCreator\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39mfrom_loaders([loader])\n\u001b[1;32m 10\u001b[0m llm \u001b[38;5;241m=\u001b[39m ChatOpenAI(temperature\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m 11\u001b[0m qa_chain \u001b[38;5;241m=\u001b[39m RetrievalQA\u001b[38;5;241m.\u001b[39mfrom_chain_type(\n\u001b[1;32m 12\u001b[0m llm,\n\u001b[1;32m 13\u001b[0m retriever\u001b[38;5;241m=\u001b[39mindex\u001b[38;5;241m.\u001b[39mvectorstore\u001b[38;5;241m.\u001b[39mas_retriever(),\n\u001b[1;32m 14\u001b[0m return_source_documents\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m,\n\u001b[1;32m 15\u001b[0m )\n",
82+
"File \u001b[0;32m~/.pyenv/versions/ragas/lib/python3.10/site-packages/pydantic/main.py:212\u001b[0m, in \u001b[0;36mBaseModel.__init__\u001b[0;34m(self, **data)\u001b[0m\n\u001b[1;32m 210\u001b[0m \u001b[38;5;66;03m# `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks\u001b[39;00m\n\u001b[1;32m 211\u001b[0m __tracebackhide__ \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[0;32m--> 212\u001b[0m validated_self \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__pydantic_validator__\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidate_python\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mself_instance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 213\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m validated_self:\n\u001b[1;32m 214\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[1;32m 215\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mA custom validator is returning a value other than `self`.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m 216\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReturning anything other than `self` from a top level model validator isn\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt supported when validating via `__init__`.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 217\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mSee the `model_validator` docs (https://docs.pydantic.dev/latest/concepts/validators/#model-validators) for more details.\u001b[39m\u001b[38;5;124m'\u001b[39m,\n\u001b[1;32m 218\u001b[0m category\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m 219\u001b[0m )\n",
83+
"\u001b[0;31mValidationError\u001b[0m: 1 validation error for VectorstoreIndexCreator\nembedding\n Field required [type=missing, input_value={}, input_type=dict]\n For further information visit https://errors.pydantic.dev/2.9/v/missing"
84+
]
85+
}
86+
],
6687
"source": [
6788
"from langchain_community.document_loaders import TextLoader\n",
6889
"from langchain.indexes import VectorstoreIndexCreator\n",
@@ -495,7 +516,7 @@
495516
"name": "python",
496517
"nbconvert_exporter": "python",
497518
"pygments_lexer": "ipython3",
498-
"version": "3.11.5"
519+
"version": "3.10.12"
499520
}
500521
},
501522
"nbformat": 4,

0 commit comments

Comments
 (0)