Skip to content

Commit 0f412de

Browse files
Yunnglinjjmachan
andauthored
Feat: add multimodal eval support (#1559)
I am a developer from [ModelScope](https://github.com/modelscope/modelscope). This framework is great and I would like to add some new features. Multi-modal RAG evaluation is important, as mentioned in #1030. This PR adds support for image-text context RAG evaluation. Currently, it preliminarily supports MultiModalFaithfulness and MultiModalRelevance by referring to LlamaIndex (reference: [faithfulness](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/faithfulness.py) and [relevancy](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/relevancy.py)). *The current evaluation metrics are still quite preliminary and can be further improved in the future.* The usage is as follows: ```python from ragas.metrics import MultiModalFaithfulness, MultiModalRelevance from datasets import Dataset from ragas import evaluate # load dataset dataset = Dataset.from_json("outputs/testset_multi_modal.json") # load metrics metrics = [MultiModalFaithfulness(), MultiModalRelevance()] # evaluate score = evaluate( dataset, metrics=metrics, llm=llm # models with interleaved image-text input, such as gpt-4o ) score_df = score.to_pandas() score_df ``` Input example: ```json [ { "user_input": "What brand is the car in the picture?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg", "The picture is related to an electric vehicle brand." ], "response": "Tesla is a car brand.", "reference": "The car brand in the picture is Tesla." }, { "user_input": "What about the Tesla Model X?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg" ], "response": "Cats are cute.", "reference": "The Tesla Model X is an electric SUV manufactured by Tesla." } ] ``` Output example: ```json [ { "user_input": "What brand is the car in the picture?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg", "The picture is related to an electric vehicle brand." ], "response": "Tesla is a car brand.", "reference": "The car brand in the picture is Tesla.", "faithful_rate": true, "relevance_rate": true }, { "user_input": "What about the Tesla Model X?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg" ], "response": "Cats are cute.", "reference": "The Tesla Model X is an electric SUV manufactured by Tesla.", "faithful_rate": false, "relevance_rate": false } ] ``` --------- Co-authored-by: jjmachan <[email protected]>
1 parent 61406d8 commit 0f412de

File tree

12 files changed

+554
-21
lines changed

12 files changed

+554
-21
lines changed

docs/concepts/metrics/available_metrics/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ Each metric are essentially paradigms that are designed to evaluate a particular
1111
- [Noise Sensitivity](noise_sensitivity.md)
1212
- [Response Relevancy](answer_relevance.md)
1313
- [Faithfulness](faithfulness.md)
14+
- [Multimodal Faithfulness](multi_modal_faithfulness.md)
15+
- [Multimodal Relevance](multi_modal_relevance.md)
1416

1517
## Agents or Tool use cases
1618

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
## MultiModalFaithfulness
2+
3+
`MultiModalFaithfulness` metric measures the factual consistency of the generated answer against both visual and textual context. It is calculated from the answer, retrieved textual context, and visual context. The answer is scaled to a (0,1) range, with higher scores indicating better faithfulness.
4+
5+
The generated answer is regarded as faithful if all the claims made in the answer can be inferred from either the visual or textual context provided. To determine this, the response is directly evaluated against the provided contexts, and the faithfulness score is either 0 or 1.
6+
7+
### Example
8+
9+
```python
10+
from ragas.database_schema import SingleTurnSample
11+
from ragas.metrics import MultiModalFaithfulness
12+
13+
sample = SingleTurnSample(
14+
user_input="What about the Tesla Model X?",
15+
response="Cats are cute.",
16+
retrieved_contexts=[
17+
"custom_eval/multimodal/images/tesla.jpg"
18+
]
19+
)
20+
scorer = MultiModalFaithfulness()
21+
await scorer.single_turn_ascore(sample)
22+
```
23+
24+
### How It’s Calculated
25+
26+
!!! example
27+
**Question**: What about the Tesla Model X?
28+
29+
**Context (visual)**:
30+
- An image of the Tesla Model X (custom_eval/multimodal/images/tesla.jpg)
31+
32+
**High faithfulness answer**: The Tesla Model X is an electric SUV manufactured by Tesla.
33+
34+
**Low faithfulness answer**: Cats are cute.
35+
36+
Let's examine how faithfulness was calculated using the low faithfulness answer:
37+
38+
- **Step 1:** Evaluate the generated response against the given contexts.
39+
- Response: "Cats are cute."
40+
41+
- **Step 2:** Verify if the response can be inferred from the given context.
42+
- Response: No
43+
44+
- **Step 3:** Use the result to determine the faithfulness score.
45+
46+
$$
47+
\text{Faithfulness} = 0
48+
$$
49+
50+
In this example, the response "Cats are cute" cannot be inferred from the image of the Tesla Model X, so the faithfulness score is 0.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
## MultiModalRelevance
2+
3+
`MultiModalRelevance` metric measures the relevance of the generated answer against both visual and textual context. It is calculated from the user input, response, and retrieved contexts (both visual and textual). The answer is scaled to a (0,1) range, with higher scores indicating better relevance.
4+
5+
The generated answer is regarded as relevant if it aligns with the visual or textual context provided. To determine this, the response is directly evaluated against the provided contexts, and the relevance score is either 0 or 1.
6+
7+
### Example
8+
9+
```python
10+
from ragas.database_schema import SingleTurnSample
11+
from ragas.metrics import MultiModalRelevance
12+
13+
sample = SingleTurnSample(
14+
user_input="What about the Tesla Model X?",
15+
response="Cats are cute.",
16+
retrieved_contexts=[
17+
"custom_eval/multimodal/images/tesla.jpg"
18+
]
19+
)
20+
scorer = MultiModalRelevance()
21+
await scorer.single_turn_ascore(sample)
22+
```
23+
24+
### How It’s Calculated
25+
26+
!!! example
27+
**Question**: What about the Tesla Model X?
28+
29+
**Context (visual)**:
30+
- An image of the Tesla Model X (custom_eval/multimodal/images/tesla.jpg)
31+
32+
**High relevance answer**: The Tesla Model X is an electric SUV manufactured by Tesla.
33+
34+
**Low relevance answer**: Cats are cute.
35+
36+
Let's examine how relevance was calculated using the low relevance answer:
37+
38+
- **Step 1:** Evaluate the generated response against the given contexts.
39+
- Response: "Cats are cute."
40+
41+
- **Step 2:** Verify if the response aligns with the given context.
42+
- Response: No
43+
44+
- **Step 3:** Use the result to determine the relevance score.
45+
46+
$$
47+
\text{Relevance} = 0
48+
$$
49+
50+
In this example, the response "Cats are cute" does not align with the image of the Tesla Model X, so the relevance score is 0.

src/ragas/exceptions.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,7 @@ class RagasOutputParserException(RagasException):
2727
"""
2828

2929
def __init__(self):
30-
msg = (
31-
"The output parser failed to parse the output including retries."
32-
)
30+
msg = "The output parser failed to parse the output including retries."
3331
super().__init__(msg)
3432

3533

src/ragas/metrics/__init__.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@
2121
from ragas.metrics._context_precision import (
2222
ContextPrecision,
2323
ContextUtilization,
24-
LLMContextPrecisionWithReference,
2524
LLMContextPrecisionWithoutReference,
25+
LLMContextPrecisionWithReference,
2626
NonLLMContextPrecisionWithReference,
2727
context_precision,
2828
)
@@ -47,6 +47,14 @@
4747
InstanceRubricsScoreWithoutReference,
4848
InstanceRubricsWithReference,
4949
)
50+
from ragas.metrics._multi_modal_faithfulness import (
51+
MultiModalFaithfulness,
52+
multimodal_faithness,
53+
)
54+
from ragas.metrics._multi_modal_relevance import (
55+
MultiModalRelevance,
56+
multimodal_relevance,
57+
)
5058
from ragas.metrics._noise_sensitivity import NoiseSensitivity
5159
from ragas.metrics._rogue_score import RougeScore
5260
from ragas.metrics._sql_semantic_equivalence import LLMSQLEquivalence
@@ -107,6 +115,10 @@
107115
"DistanceMeasure",
108116
"TopicAdherenceScore",
109117
"LLMSQLEquivalence",
118+
"MultiModalFaithfulness",
119+
"multimodal_faithness",
120+
"MultiModalRelevance",
121+
"multimodal_relevance",
110122
]
111123

112124
current_module = sys.modules[__name__]
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
from __future__ import annotations
2+
3+
import typing as t
4+
from dataclasses import dataclass, field
5+
6+
import numpy as np
7+
from pydantic import BaseModel, Field
8+
9+
from ragas.dataset_schema import SingleTurnSample
10+
from ragas.metrics.base import MetricType, MetricWithLLM, SingleTurnMetric
11+
from ragas.prompt import ImageTextPrompt
12+
13+
if t.TYPE_CHECKING:
14+
from langchain_core.callbacks import Callbacks
15+
16+
17+
class FaithfulnessInput(BaseModel):
18+
response: str = Field(description="response from AI")
19+
retrieved_contexts: list[str] = Field(description="contexts retrieved from the LLM")
20+
21+
def to_string_list(self):
22+
return [
23+
"inputs:",
24+
self.response,
25+
"retrieved_contexts: ",
26+
] + self.retrieved_contexts
27+
28+
29+
class FaithfulnessOutput(BaseModel):
30+
faithful: bool = Field(description="boolean indicating if request was faithful")
31+
32+
33+
class MultiModalFaithfulnessPrompt(
34+
ImageTextPrompt[FaithfulnessInput, FaithfulnessOutput]
35+
):
36+
# refer: https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/faithfulness.py
37+
instruction = "Please tell if a given piece of information is supported by the visual as well as textual context information. You need to answer with either True or False. Answer True if any of the image(s) and textual context supports the information"
38+
input_model = FaithfulnessInput
39+
output_model = FaithfulnessOutput
40+
examples = [
41+
(
42+
FaithfulnessInput(
43+
response="Apple pie is generally double-crusted.",
44+
retrieved_contexts=[
45+
"An apple pie is a fruit pie in which the principal filling ingredient is apples.",
46+
"Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.",
47+
"It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).",
48+
],
49+
),
50+
FaithfulnessOutput(faithful=True),
51+
),
52+
(
53+
FaithfulnessInput(
54+
response="Apple pies tastes bad.",
55+
retrieved_contexts=[
56+
"An apple pie is a fruit pie in which the principal filling ingredient is apples.",
57+
"Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.",
58+
"It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).",
59+
],
60+
),
61+
FaithfulnessOutput(faithful=False),
62+
),
63+
]
64+
65+
66+
@dataclass
67+
class MultiModalFaithfulness(MetricWithLLM, SingleTurnMetric):
68+
name: str = "faithful_rate" # type: ignore
69+
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
70+
default_factory=lambda: {
71+
MetricType.SINGLE_TURN: {
72+
"response",
73+
"retrieved_contexts",
74+
}
75+
}
76+
)
77+
faithfulness_prompt: ImageTextPrompt = MultiModalFaithfulnessPrompt()
78+
79+
async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
80+
prompt_input = FaithfulnessInput(
81+
response=row["response"], retrieved_contexts=row["retrieved_contexts"]
82+
)
83+
assert self.llm is not None, "LLM is not set"
84+
prompt_response = await self.faithfulness_prompt.generate(
85+
data=prompt_input, llm=self.llm, callbacks=callbacks
86+
)
87+
if prompt_response is None:
88+
return np.nan
89+
return float(prompt_response.faithful)
90+
91+
async def _single_turn_ascore(
92+
self, sample: SingleTurnSample, callbacks: Callbacks
93+
) -> float:
94+
row = sample.to_dict()
95+
return await self._ascore(row, callbacks)
96+
97+
98+
multimodal_faithness = MultiModalFaithfulness()
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
from __future__ import annotations
2+
3+
import typing as t
4+
from dataclasses import dataclass, field
5+
6+
import numpy as np
7+
from pydantic import BaseModel, Field
8+
9+
from ragas.dataset_schema import SingleTurnSample
10+
from ragas.metrics.base import MetricType, MetricWithLLM, SingleTurnMetric
11+
from ragas.prompt import ImageTextPrompt
12+
13+
if t.TYPE_CHECKING:
14+
from langchain_core.callbacks import Callbacks
15+
16+
17+
class RelevanceInput(BaseModel):
18+
user_input: str = Field(description="user input")
19+
response: str = Field(description="response from AI")
20+
retrieved_contexts: list[str] = Field(description="contexts retrieved from the LLM")
21+
22+
def to_string_list(self):
23+
return [
24+
f"Question: {self.user_input}",
25+
f"Response: {self.response}",
26+
"retrieved_contexts: ",
27+
] + self.retrieved_contexts
28+
29+
30+
class RelevanceOutput(BaseModel):
31+
relevance: bool = Field(description="boolean indicating if request was relevance")
32+
33+
34+
class MultiModalRelevancePrompt(ImageTextPrompt[RelevanceInput, RelevanceOutput]):
35+
# refer https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/relevancy.py
36+
instruction = """
37+
Your task is to evaluate if the response for the query is in line with the images and textual context information provided.
38+
You have two options to answer. Either True / False.
39+
Answer - True, if the response for the query is in line with context information otherwise False.
40+
"""
41+
input_model = RelevanceInput
42+
output_model = RelevanceOutput
43+
examples = [
44+
(
45+
RelevanceInput(
46+
user_input="What is the primary ingredient in a traditional Margherita pizza?",
47+
response="The primary ingredients in a Margherita pizza are tomatoes, mozzarella cheese, and fresh basil.",
48+
retrieved_contexts=[
49+
"A traditional Margherita pizza consists of a thin crust.",
50+
"The main toppings include tomatoes, mozzarella cheese, fresh basil, salt, and olive oil.",
51+
"It is one of the simplest and most classic types of pizza.",
52+
],
53+
),
54+
RelevanceOutput(relevance=True),
55+
),
56+
(
57+
RelevanceInput(
58+
user_input="Who won the Best Actor award at the Oscars in 2021?",
59+
response="The Best Actor award in 2021 was won by Leonardo DiCaprio.",
60+
retrieved_contexts=[
61+
"The 93rd Academy Awards were held in 2021.",
62+
"Anthony Hopkins won the Best Actor award for his role in 'The Father'.",
63+
"The event was unique due to COVID-19 restrictions.",
64+
],
65+
),
66+
RelevanceOutput(relevance=False),
67+
),
68+
]
69+
70+
71+
@dataclass
72+
class MultiModalRelevance(MetricWithLLM, SingleTurnMetric):
73+
name: str = "relevance_rate" # type: ignore
74+
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
75+
default_factory=lambda: {
76+
MetricType.SINGLE_TURN: {
77+
"user_input",
78+
"response",
79+
"retrieved_contexts",
80+
}
81+
}
82+
)
83+
relevance_prompt: ImageTextPrompt = MultiModalRelevancePrompt()
84+
85+
async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
86+
prompt_input = RelevanceInput(
87+
user_input=row["user_input"],
88+
response=row["response"],
89+
retrieved_contexts=row["retrieved_contexts"],
90+
)
91+
assert self.llm is not None, "LLM is not set"
92+
prompt_response = await self.relevance_prompt.generate(
93+
data=prompt_input, llm=self.llm, callbacks=callbacks
94+
)
95+
if prompt_response is None:
96+
return np.nan
97+
return float(prompt_response.relevance)
98+
99+
async def _single_turn_ascore(
100+
self, sample: SingleTurnSample, callbacks: Callbacks
101+
) -> float:
102+
row = sample.to_dict()
103+
return await self._ascore(row, callbacks)
104+
105+
106+
multimodal_relevance = MultiModalRelevance()

src/ragas/prompt/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from .base import BasePrompt, BoolIO, StringIO, StringPrompt
22
from .mixin import PromptMixin
3+
from .multi_modal_prompt import ImageTextPrompt, ImageTextPromptValue
34
from .pydantic_prompt import InputModel, OutputModel, PydanticPrompt
45

56
__all__ = [
@@ -11,4 +12,6 @@
1112
"PromptMixin",
1213
"InputModel",
1314
"OutputModel",
15+
"ImageTextPrompt",
16+
"ImageTextPromptValue",
1417
]

0 commit comments

Comments
 (0)