Feat: add multimodal eval support (#1559)

Yunnglin · jjmachan · web-flow · commit 0f412de0a469 · 2024-10-25T21:50:08.000+05:30
I am a developer from [ModelScope](https://github.com/modelscope/modelscope). This framework is great and I would like to add some new features. Multi-modal RAG evaluation is important, as mentioned in #1030. This PR adds support for image-text context RAG evaluation. Currently, it preliminarily supports MultiModalFaithfulness and MultiModalRelevance by referring to LlamaIndex (reference: [faithfulness](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/faithfulness.py) and [relevancy](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/relevancy.py)). *The current evaluation metrics are still quite preliminary and can be further improved in the future.* The usage is as follows: ```python from ragas.metrics import MultiModalFaithfulness, MultiModalRelevance from datasets import Dataset from ragas import evaluate # load dataset dataset = Dataset.from_json("outputs/testset_multi_modal.json") # load metrics metrics = [MultiModalFaithfulness(), MultiModalRelevance()] # evaluate score = evaluate( dataset, metrics=metrics, llm=llm # models with interleaved image-text input, such as gpt-4o ) score_df = score.to_pandas() score_df ``` Input example: ```json [ { "user_input": "What brand is the car in the picture?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg", "The picture is related to an electric vehicle brand." ], "response": "Tesla is a car brand.", "reference": "The car brand in the picture is Tesla." }, { "user_input": "What about the Tesla Model X?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg" ], "response": "Cats are cute.", "reference": "The Tesla Model X is an electric SUV manufactured by Tesla." } ] ``` Output example: ```json [ { "user_input": "What brand is the car in the picture?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg", "The picture is related to an electric vehicle brand." ], "response": "Tesla is a car brand.", "reference": "The car brand in the picture is Tesla.", "faithful_rate": true, "relevance_rate": true }, { "user_input": "What about the Tesla Model X?", "retrieved_contexts": [ "custom_eval/multimodal/images/tesla.jpg" ], "response": "Cats are cute.", "reference": "The Tesla Model X is an electric SUV manufactured by Tesla.", "faithful_rate": false, "relevance_rate": false } ] ``` --------- Co-authored-by: jjmachan <jamesjithin97@gmail.com>
diff --git a/docs/concepts/metrics/available_metrics/index.md b/docs/concepts/metrics/available_metrics/index.md
@@ -11,6 +11,8 @@ Each metric are essentially paradigms that are designed to evaluate a particular
 - [Noise Sensitivity](noise_sensitivity.md)
 - [Response Relevancy](answer_relevance.md)
 - [Faithfulness](faithfulness.md)
+- [Multimodal Faithfulness](multi_modal_faithfulness.md)
+- [Multimodal Relevance](multi_modal_relevance.md)
 
 ## Agents or Tool use cases
 
diff --git a/docs/concepts/metrics/available_metrics/multi_modal_faithfulness.md b/docs/concepts/metrics/available_metrics/multi_modal_faithfulness.md
@@ -0,0 +1,50 @@
+## MultiModalFaithfulness
+
+`MultiModalFaithfulness` metric measures the factual consistency of the generated answer against both visual and textual context. It is calculated from the answer, retrieved textual context, and visual context. The answer is scaled to a (0,1) range, with higher scores indicating better faithfulness.
+
+The generated answer is regarded as faithful if all the claims made in the answer can be inferred from either the visual or textual context provided. To determine this, the response is directly evaluated against the provided contexts, and the faithfulness score is either 0 or 1.
+
+### Example
+
+```python
+from ragas.database_schema import SingleTurnSample 
+from ragas.metrics import MultiModalFaithfulness
+
+sample = SingleTurnSample(
+        user_input="What about the Tesla Model X?",
+        response="Cats are cute.",
+        retrieved_contexts=[
+            "custom_eval/multimodal/images/tesla.jpg"
+        ]
+    )
+scorer = MultiModalFaithfulness()
+await scorer.single_turn_ascore(sample)
+```
+
+### How It’s Calculated 
+
+!!! example
+    **Question**: What about the Tesla Model X?
+
+    **Context (visual)**: 
+    - An image of the Tesla Model X (custom_eval/multimodal/images/tesla.jpg)
+
+    **High faithfulness answer**: The Tesla Model X is an electric SUV manufactured by Tesla.
+
+    **Low faithfulness answer**: Cats are cute.
+
+Let's examine how faithfulness was calculated using the low faithfulness answer:
+
+- **Step 1:** Evaluate the generated response against the given contexts.
+    - Response: "Cats are cute."
+
+- **Step 2:** Verify if the response can be inferred from the given context.
+    - Response: No
+
+- **Step 3:** Use the result to determine the faithfulness score.
+
+    $$
+    \text{Faithfulness} = 0
+    $$
+
+In this example, the response "Cats are cute" cannot be inferred from the image of the Tesla Model X, so the faithfulness score is 0.
diff --git a/docs/concepts/metrics/available_metrics/multi_modal_relevance.md b/docs/concepts/metrics/available_metrics/multi_modal_relevance.md
@@ -0,0 +1,50 @@
+## MultiModalRelevance
+
+`MultiModalRelevance` metric measures the relevance of the generated answer against both visual and textual context. It is calculated from the user input, response, and retrieved contexts (both visual and textual). The answer is scaled to a (0,1) range, with higher scores indicating better relevance.
+
+The generated answer is regarded as relevant if it aligns with the visual or textual context provided. To determine this, the response is directly evaluated against the provided contexts, and the relevance score is either 0 or 1.
+
+### Example
+
+```python
+from ragas.database_schema import SingleTurnSample 
+from ragas.metrics import MultiModalRelevance
+
+sample = SingleTurnSample(
+        user_input="What about the Tesla Model X?",
+        response="Cats are cute.",
+        retrieved_contexts=[
+            "custom_eval/multimodal/images/tesla.jpg"
+        ]
+    )
+scorer = MultiModalRelevance()
+await scorer.single_turn_ascore(sample)
+```
+
+### How It’s Calculated 
+
+!!! example
+    **Question**: What about the Tesla Model X?
+
+    **Context (visual)**: 
+    - An image of the Tesla Model X (custom_eval/multimodal/images/tesla.jpg)
+
+    **High relevance answer**: The Tesla Model X is an electric SUV manufactured by Tesla.
+
+    **Low relevance answer**: Cats are cute.
+
+Let's examine how relevance was calculated using the low relevance answer:
+
+- **Step 1:** Evaluate the generated response against the given contexts.
+    - Response: "Cats are cute."
+
+- **Step 2:** Verify if the response aligns with the given context.
+    - Response: No
+
+- **Step 3:** Use the result to determine the relevance score.
+
+    $$
+    \text{Relevance} = 0
+    $$
+
+In this example, the response "Cats are cute" does not align with the image of the Tesla Model X, so the relevance score is 0.
diff --git a/src/ragas/exceptions.py b/src/ragas/exceptions.py
@@ -27,9 +27,7 @@ class RagasOutputParserException(RagasException):
     """
 
     def __init__(self):
-        msg = (
-            "The output parser failed to parse the output including retries."
-        )
+        msg = "The output parser failed to parse the output including retries."
         super().__init__(msg)
 
 
diff --git a/src/ragas/metrics/__init__.py b/src/ragas/metrics/__init__.py
@@ -21,8 +21,8 @@
 from ragas.metrics._context_precision import (
     ContextPrecision,
     ContextUtilization,
-    LLMContextPrecisionWithReference,
     LLMContextPrecisionWithoutReference,
+    LLMContextPrecisionWithReference,
     NonLLMContextPrecisionWithReference,
     context_precision,
 )
@@ -47,6 +47,14 @@
     InstanceRubricsScoreWithoutReference,
     InstanceRubricsWithReference,
 )
+from ragas.metrics._multi_modal_faithfulness import (
+    MultiModalFaithfulness,
+    multimodal_faithness,
+)
+from ragas.metrics._multi_modal_relevance import (
+    MultiModalRelevance,
+    multimodal_relevance,
+)
 from ragas.metrics._noise_sensitivity import NoiseSensitivity
 from ragas.metrics._rogue_score import RougeScore
 from ragas.metrics._sql_semantic_equivalence import LLMSQLEquivalence
@@ -107,6 +115,10 @@
     "DistanceMeasure",
     "TopicAdherenceScore",
     "LLMSQLEquivalence",
+    "MultiModalFaithfulness",
+    "multimodal_faithness",
+    "MultiModalRelevance",
+    "multimodal_relevance",
 ]
 
 current_module = sys.modules[__name__]
diff --git a/src/ragas/metrics/_multi_modal_faithfulness.py b/src/ragas/metrics/_multi_modal_faithfulness.py
@@ -0,0 +1,98 @@
+from __future__ import annotations
+
+import typing as t
+from dataclasses import dataclass, field
+
+import numpy as np
+from pydantic import BaseModel, Field
+
+from ragas.dataset_schema import SingleTurnSample
+from ragas.metrics.base import MetricType, MetricWithLLM, SingleTurnMetric
+from ragas.prompt import ImageTextPrompt
+
+if t.TYPE_CHECKING:
+    from langchain_core.callbacks import Callbacks
+
+
+class FaithfulnessInput(BaseModel):
+    response: str = Field(description="response from AI")
+    retrieved_contexts: list[str] = Field(description="contexts retrieved from the LLM")
+
+    def to_string_list(self):
+        return [
+            "inputs:",
+            self.response,
+            "retrieved_contexts: ",
+        ] + self.retrieved_contexts
+
+
+class FaithfulnessOutput(BaseModel):
+    faithful: bool = Field(description="boolean indicating if request was faithful")
+
+
+class MultiModalFaithfulnessPrompt(
+    ImageTextPrompt[FaithfulnessInput, FaithfulnessOutput]
+):
+    # refer: https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/faithfulness.py
+    instruction = "Please tell if a given piece of information is supported by the visual as well as textual context information. You need to answer with either True or False. Answer True if any of the image(s) and textual context supports the information"
+    input_model = FaithfulnessInput
+    output_model = FaithfulnessOutput
+    examples = [
+        (
+            FaithfulnessInput(
+                response="Apple pie is generally double-crusted.",
+                retrieved_contexts=[
+                    "An apple pie is a fruit pie in which the principal filling ingredient is apples.",
+                    "Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.",
+                    "It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).",
+                ],
+            ),
+            FaithfulnessOutput(faithful=True),
+        ),
+        (
+            FaithfulnessInput(
+                response="Apple pies tastes bad.",
+                retrieved_contexts=[
+                    "An apple pie is a fruit pie in which the principal filling ingredient is apples.",
+                    "Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.",
+                    "It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).",
+                ],
+            ),
+            FaithfulnessOutput(faithful=False),
+        ),
+    ]
+
+
+@dataclass
+class MultiModalFaithfulness(MetricWithLLM, SingleTurnMetric):
+    name: str = "faithful_rate"  # type: ignore
+    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
+        default_factory=lambda: {
+            MetricType.SINGLE_TURN: {
+                "response",
+                "retrieved_contexts",
+            }
+        }
+    )
+    faithfulness_prompt: ImageTextPrompt = MultiModalFaithfulnessPrompt()
+
+    async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
+        prompt_input = FaithfulnessInput(
+            response=row["response"], retrieved_contexts=row["retrieved_contexts"]
+        )
+        assert self.llm is not None, "LLM is not set"
+        prompt_response = await self.faithfulness_prompt.generate(
+            data=prompt_input, llm=self.llm, callbacks=callbacks
+        )
+        if prompt_response is None:
+            return np.nan
+        return float(prompt_response.faithful)
+
+    async def _single_turn_ascore(
+        self, sample: SingleTurnSample, callbacks: Callbacks
+    ) -> float:
+        row = sample.to_dict()
+        return await self._ascore(row, callbacks)
+
+
+multimodal_faithness = MultiModalFaithfulness()
diff --git a/src/ragas/metrics/_multi_modal_relevance.py b/src/ragas/metrics/_multi_modal_relevance.py
@@ -0,0 +1,106 @@
+from __future__ import annotations
+
+import typing as t
+from dataclasses import dataclass, field
+
+import numpy as np
+from pydantic import BaseModel, Field
+
+from ragas.dataset_schema import SingleTurnSample
+from ragas.metrics.base import MetricType, MetricWithLLM, SingleTurnMetric
+from ragas.prompt import ImageTextPrompt
+
+if t.TYPE_CHECKING:
+    from langchain_core.callbacks import Callbacks
+
+
+class RelevanceInput(BaseModel):
+    user_input: str = Field(description="user input")
+    response: str = Field(description="response from AI")
+    retrieved_contexts: list[str] = Field(description="contexts retrieved from the LLM")
+
+    def to_string_list(self):
+        return [
+            f"Question: {self.user_input}",
+            f"Response: {self.response}",
+            "retrieved_contexts: ",
+        ] + self.retrieved_contexts
+
+
+class RelevanceOutput(BaseModel):
+    relevance: bool = Field(description="boolean indicating if request was relevance")
+
+
+class MultiModalRelevancePrompt(ImageTextPrompt[RelevanceInput, RelevanceOutput]):
+    # refer https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/multi_modal/relevancy.py
+    instruction = """
+Your task is to evaluate if the response for the query is in line with the images and textual context information provided.
+You have two options to answer. Either True / False.
+Answer - True, if the response for the query is in line with context information otherwise False.
+"""
+    input_model = RelevanceInput
+    output_model = RelevanceOutput
+    examples = [
+        (
+            RelevanceInput(
+                user_input="What is the primary ingredient in a traditional Margherita pizza?",
+                response="The primary ingredients in a Margherita pizza are tomatoes, mozzarella cheese, and fresh basil.",
+                retrieved_contexts=[
+                    "A traditional Margherita pizza consists of a thin crust.",
+                    "The main toppings include tomatoes, mozzarella cheese, fresh basil, salt, and olive oil.",
+                    "It is one of the simplest and most classic types of pizza.",
+                ],
+            ),
+            RelevanceOutput(relevance=True),
+        ),
+        (
+            RelevanceInput(
+                user_input="Who won the Best Actor award at the Oscars in 2021?",
+                response="The Best Actor award in 2021 was won by Leonardo DiCaprio.",
+                retrieved_contexts=[
+                    "The 93rd Academy Awards were held in 2021.",
+                    "Anthony Hopkins won the Best Actor award for his role in 'The Father'.",
+                    "The event was unique due to COVID-19 restrictions.",
+                ],
+            ),
+            RelevanceOutput(relevance=False),
+        ),
+    ]
+
+
+@dataclass
+class MultiModalRelevance(MetricWithLLM, SingleTurnMetric):
+    name: str = "relevance_rate"  # type: ignore
+    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
+        default_factory=lambda: {
+            MetricType.SINGLE_TURN: {
+                "user_input",
+                "response",
+                "retrieved_contexts",
+            }
+        }
+    )
+    relevance_prompt: ImageTextPrompt = MultiModalRelevancePrompt()
+
+    async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
+        prompt_input = RelevanceInput(
+            user_input=row["user_input"],
+            response=row["response"],
+            retrieved_contexts=row["retrieved_contexts"],
+        )
+        assert self.llm is not None, "LLM is not set"
+        prompt_response = await self.relevance_prompt.generate(
+            data=prompt_input, llm=self.llm, callbacks=callbacks
+        )
+        if prompt_response is None:
+            return np.nan
+        return float(prompt_response.relevance)
+
+    async def _single_turn_ascore(
+        self, sample: SingleTurnSample, callbacks: Callbacks
+    ) -> float:
+        row = sample.to_dict()
+        return await self._ascore(row, callbacks)
+
+
+multimodal_relevance = MultiModalRelevance()
diff --git a/src/ragas/prompt/__init__.py b/src/ragas/prompt/__init__.py
@@ -1,5 +1,6 @@
 from .base import BasePrompt, BoolIO, StringIO, StringPrompt
 from .mixin import PromptMixin
+from .multi_modal_prompt import ImageTextPrompt, ImageTextPromptValue
 from .pydantic_prompt import InputModel, OutputModel, PydanticPrompt
 
 __all__ = [
@@ -11,4 +12,6 @@
     "PromptMixin",
     "InputModel",
     "OutputModel",
+    "ImageTextPrompt",
+    "ImageTextPromptValue",
 ]
diff --git a/src/ragas/prompt/multi_modal_prompt.py b/src/ragas/prompt/multi_modal_prompt.py
diff --git a/src/ragas/prompt/pydantic_prompt.py b/src/ragas/prompt/pydantic_prompt.py
diff --git a/src/ragas/testset/synthesizers/generate.py b/src/ragas/testset/synthesizers/generate.py
diff --git a/tests/unit/test_prompt.py b/tests/unit/test_prompt.py