Skip to content

Commit e38e911

Browse files
Modify rag evaluation paragraph titles, add tables, change image to reflect chosen palette
1 parent 3cb7c89 commit e38e911

File tree

2 files changed

+44
-16
lines changed

2 files changed

+44
-16
lines changed

md-docs/imgs/rag_evaluation.svg

Lines changed: 1 addition & 0 deletions
Loading

md-docs/user_guide/modules/rag_evaluation.md

Lines changed: 43 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Rag Evaluation
22

3-
## What is RAG Evaluation?
3+
## Introduction
44

55
RAG (Retrieval-Augmented Generation) is a way of building AI models that enhances their ability to generate accurate and contextually relevant responses by combining two main steps: **retrieval** and **generation**.
66

@@ -11,9 +11,11 @@ Evaluating RAG involves assessing how well the model does in both retrieval and
1111

1212
Our RAG evaluation module analyzes the three main components of a RAG framework:
1313

14-
- **User Input**: The query or question posed by the user.
15-
- **Context**: The retrieved documents or information that the model uses to generate a response. A context can consist of one or more chunks of text.
16-
- **Response**: The generated answer or output provided by the model.
14+
| Component | Description |
15+
| ---------- | ---------------------------------------------------------------------------------- |
16+
| User Input | The query or question posed by the user. |
17+
| Context | The retrieved documents or information that the model uses to generate a response. |
18+
| Response | The generated answer or output provided by the model. |
1719

1820
In particular, the analysis is performed on the relationships between these components:
1921

@@ -22,34 +24,59 @@ In particular, the analysis is performed on the relationships between these comp
2224
- **User Input - Response**: Response Evaluation
2325

2426
<figure markdown>
25-
![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.png){ width="600"}
27+
![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.svg){ width="600"}
2628
<figcaption>ML cube Platform RAG Evaluation</figcaption>
2729
</figure>
2830

29-
The evaluation is performed through an LLM-as-a-Judge approach, where a Language Model (LM) acts as a judge to evaluate the quality of a RAG model.
31+
The evaluation is performed through an LLM-as-a-Judge approach, where a Large Language Model (LLM) acts as a judge to evaluate the quality of a RAG model.
3032

31-
## What are the computed metrics?
33+
## Computed metrics
3234

33-
Below are the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above.
35+
In this paragraph we describe the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above. Every metrics computed is composed by a **score** and an **explanation** provided by the LLM, which explains the reasons behind the assigned score.
36+
37+
Below is a summary table of the metrics:
38+
39+
| Metric | User Input | Context | Response |
40+
| ------------ | ---------------- | ---------------- | ---------------- |
41+
| Relevance | :material-check: | :material-check: | |
42+
| Usefulness | :material-check: | :material-check: | |
43+
| Utilization | :material-check: | :material-check: | |
44+
| Attribution | :material-check: | :material-check: | |
45+
| Faithfulness | | :material-check: | :material-check: |
46+
| Satisfaction | :material-check: | | :material-check: |
3447

3548
### User Input - Context
3649

37-
- **Relevance**: Measures how much the retrieved context is relevant to the user input. The score ranges from 1 to 5, with 5 being the highest relevance.
38-
- **Usefulness**: Evaluates how useful the retrieved context is in generating the response. For example, if a context talks about the topic of the user query but it does not contain the information needed to answer the question, it is relevant but not useful. The score ranges from 1 to 5, with 5 being the highest usefulness.
39-
- **Utilization**: Measures the percentage of the retrieved context that contains information that can be used to generate the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. The score ranges from 0 to 100.
40-
- **Attribution**: Indicates which of the chunks of the retrieved context can be used to generate the response. It is a list of the indices of the chunks that are used, with the first chunk having index 1.
50+
| Metric | Description | Score Range (Lowest-Highest) |
51+
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------- |
52+
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 |
53+
| Usefulness | Evaluates how useful the retrieved context is in generating the response, that is if it contains the information to answer the user query | 1-5 |
54+
| Utilization | Measures the percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 |
55+
| Attribution | Indicates which of the chunks of the retrieved context can be used to generate the response. | List of indices of the used chunks, first chunk has index 1 |
56+
57+
!!! example
58+
The **combination** of the metrics provides a comprehensive evaluation of the quality of the retrieved context.
59+
For instance, a **high relevance** score but **low usefulness** score indicates a context that talks about the topic of the user query but does not contain the information needed to answer it.
4160

4261
### Context - Response
43-
- **Faithfulness**: Measures how much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. The score ranges from 1 to 5, with 5 being the highest faithfulness.
62+
63+
| Metric | Description | Score Range (Lowest-Highest) |
64+
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
65+
| Faithfulness | Measures how much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 |
4466

4567
### User Input - Response
46-
- **Satisfaction**: Evaluates how satisfied the user would be with the generated response. The score ranges from 1 to 5, with 1 a response that does not address the user query and 5 a response that fully addresses and answers the user query.
4768

48-
## What is the required data?
69+
| Metric | Description | Score Range (Lowest-Highest) |
70+
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
71+
| Satisfaction | Evaluates how satisfied the user would be with the generated response. A low score indicates a response that does not address the user quey, a high score indicates a response that fully addresses and answers the user query. | 1-5 |
72+
73+
## Required data
4974

5075
The RAG evaluation module computes the metrics based on the data availability for each sample.
5176
If a sample lacks one of the three components (User Input, Context or Response), only the applicable metrics are computed.
52-
For instance, if a sample is missing the response, only the **User Input - Context** metrics are computed.
77+
For instance, if in a sample the **response is missing**, only the **User Input - Context** metrics are computed for that sample.
78+
79+
Regarding the metrics that cannot be computed for a specific sample, the lowest score is assigned, with the explanation mentioning the component that is missing.
5380

5481
If data added to a [Task] contains contexts with multiple chunks of text, a [context separator](../task.md#retrieval-augmented-generation) must be provided.
5582

0 commit comments

Comments
 (0)