Skip to content

Commit f6ebf99

Browse files
Feat: Fix for Instance Based Metrics and Updated docs (#1827)
Fixes #1766
1 parent ad1c3cd commit f6ebf99

File tree

4 files changed

+58
-30
lines changed

4 files changed

+58
-30
lines changed

docs/concepts/metrics/available_metrics/general_purpose.md

Lines changed: 50 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -103,26 +103,59 @@ Output
103103

104104
## Instance Specific rubrics criteria scoring
105105

106-
Instance specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific instance, ie each instance to be evaluated is annotated with a rubric based evaluation criteria. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations. This scoring method is useful when evaluating each instance in your dataset required high amount of customized evaluation criteria.
106+
Instance Specific Evaluation Metric is a rubric-based method used to evaluate each item in a dataset individually. To use this metric, you need to provide a rubric along with the items you want to evaluate.
107+
108+
!!! note
109+
This differs from the `Rubric Based Criteria Scoring Metric`, where a single rubric is applied to uniformly evaluate all items in the dataset. In the `Instance-Specific Evaluation Metric`, you decide which rubric to use for each item. It's like the difference between giving the entire class the same quiz (rubric-based) and creating a personalized quiz for each student (instance-specific).
107110

108111
#### Example
109112
```python
110-
from ragas.dataset_schema import SingleTurnSample
111-
from ragas.metrics import InstanceRubrics
112-
113-
114-
sample = SingleTurnSample(
115-
user_input="Where is the Eiffel Tower located?",
116-
response="The Eiffel Tower is located in Paris.",
117-
rubrics = {
118-
"score1": "The response is completely incorrect or unrelated to the question (e.g., 'The Eiffel Tower is in New York.' or talking about something entirely irrelevant).",
119-
"score2": "The response is partially correct but vague or incorrect in key aspects (e.g., 'The Eiffel Tower is in France.' without mentioning Paris, or a similar incomplete location).",
120-
"score3": "The response provides the correct location but with some factual inaccuracies or awkward phrasing (e.g., 'The Eiffel Tower is in Paris, Germany.' or 'It is located in Paris, which is a country.').",
121-
"score4": "The response is accurate, providing the correct answer but lacking precision or extra context (e.g., 'The Eiffel Tower is in Paris, France.' or a minor phrasing issue).",
122-
"score5": "The response is entirely accurate and clear, correctly stating the location as Paris without any factual errors or awkward phrasing (e.g., 'The Eiffel Tower is located in Paris.')."
123-
}
113+
dataset = [
114+
# Relevance to Query
115+
{
116+
"user_query": "How do I handle exceptions in Python?",
117+
"response": "To handle exceptions in Python, use the `try` and `except` blocks to catch and handle errors.",
118+
"reference": "Proper error handling in Python involves using `try`, `except`, and optionally `else` and `finally` blocks to handle specific exceptions or perform cleanup tasks.",
119+
"rubrics": {
120+
"score0_description": "The response is off-topic or irrelevant to the user query.",
121+
"score1_description": "The response is fully relevant and focused on the user query.",
122+
},
123+
},
124+
# Code Efficiency
125+
{
126+
"user_query": "How can I create a list of squares for numbers 1 through 5 in Python?",
127+
"response": """
128+
# Using a for loop
129+
squares = []
130+
for i in range(1, 6):
131+
squares.append(i ** 2)
132+
print(squares)
133+
""",
134+
"reference": """
135+
# Using a list comprehension
136+
squares = [i ** 2 for i in range(1, 6)]
137+
print(squares)
138+
""",
139+
"rubrics": {
140+
"score0_description": "The code is inefficient and has obvious performance issues (e.g., unnecessary loops or redundant calculations).",
141+
"score1_description": "The code is efficient, optimized, and performs well even with larger inputs.",
142+
},
143+
},
144+
]
145+
146+
147+
evaluation_dataset = EvaluationDataset.from_list(dataset)
148+
149+
result = evaluate(
150+
dataset=evaluation_dataset,
151+
metrics=[InstanceRubrics(llm=evaluator_llm)],
152+
llm=evaluator_llm,
124153
)
125154

126-
scorer = InstanceRubrics(llm=evaluator_llm)
127-
await scorer.single_turn_ascore(sample)
155+
result
156+
```
157+
Output
158+
159+
```
160+
{'instance_rubrics': 0.5000}
128161
```

docs/concepts/metrics/available_metrics/semantic_similarity.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ from ragas.metrics import SemanticSimilarity
1414

1515
sample = SingleTurnSample(
1616
response="The Eiffel Tower is located in Paris.",
17-
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
17+
reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
1818
)
1919

2020
scorer = SemanticSimilarity()

src/ragas/metrics/_instance_specific_rubrics.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,13 @@ class MultiTurnInputWithRubric(MultiTurnInputWithoutRubric):
3838

3939

4040
class SingleTurnPrompt(PydanticPrompt[SingleTurnInputWithRubric, ScoreFeedback]):
41-
instruction = "" # this will be set in the constructor
41+
instruction = "Your task is to assign an appropriate score and provide feedback to the inputs based solely on the scoring criteria passed in the input."
4242
input_model = SingleTurnInputWithRubric
4343
output_model = ScoreFeedback
4444

4545

4646
class MultiTurnPrompt(PydanticPrompt[MultiTurnInputWithRubric, ScoreFeedback]):
47-
instruction = "" # this will be set in the constructor
47+
instruction = "Your task is to assign an appropriate score and provide feedback to the inputs based solely on the scoring criteria passed in the input."
4848
input_model = MultiTurnInputWithRubric
4949
output_model = ScoreFeedback
5050

src/ragas/metrics/_simple_criteria.py

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -58,15 +58,15 @@ class MultiTurnSimpleCriteriaInput(BaseModel):
5858
class SingleTurnSimpleCriteriaPrompt(
5959
PydanticPrompt[SingleTurnSimpleCriteriaInput, SimpleCriteriaOutput]
6060
):
61-
instruction = "" # this will be set in the constructor
61+
instruction = "Evaluate the input based on the criteria defined."
6262
input_model = SingleTurnSimpleCriteriaInput
6363
output_model = SimpleCriteriaOutput
6464

6565

6666
class MultiTurnSimpleCriteriaPrompt(
6767
PydanticPrompt[MultiTurnSimpleCriteriaInput, SimpleCriteriaOutput]
6868
):
69-
instruction = "" # this will be set in the constructor
69+
instruction = "Evaluate the input based on the criteria defined."
7070
input_model = MultiTurnSimpleCriteriaInput
7171
output_model = SimpleCriteriaOutput
7272

@@ -123,11 +123,6 @@ def __init__(
123123
self.single_turn_prompt = single_turn_prompt or SingleTurnSimpleCriteriaPrompt()
124124
self.multi_turn_prompt = multi_turn_prompt or MultiTurnSimpleCriteriaPrompt()
125125

126-
# update the instruction for the prompts with the definition
127-
instruction = f"Evaluate the Input based on the criterial defined. Give a score between 0 and 5.\nCriteria Definition: {self._definition}"
128-
self.single_turn_prompt.instruction = instruction
129-
self.multi_turn_prompt.instruction = instruction
130-
131126
# ensure odd number of checks to avoid tie in majority vote.
132127
self.strictness = strictness
133128
self.strictness = (
@@ -145,9 +140,9 @@ def definition(self) -> str:
145140
def definition(self, value: str) -> None:
146141
self._definition = value
147142
# Update the instruction for both prompts with the new definition
148-
instruction = f"Evaluate the Input based on the criterial defined. Give a score between 0 and 5.\nCriteria Definition: {self._definition}"
149-
self.single_turn_prompt.instruction = instruction
150-
self.multi_turn_prompt.instruction = instruction
143+
instruction = f"\nCriteria Definition: {self._definition}"
144+
self.single_turn_prompt.instruction += instruction
145+
self.multi_turn_prompt.instruction += instruction
151146

152147
def _compute_score(
153148
self, safe_loaded_responses: t.List[SimpleCriteriaOutput]

0 commit comments

Comments
 (0)