Skip to content

Commit 8daef3a

Browse files
Improve RAG evaluation doc
1 parent 791f945 commit 8daef3a

File tree

1 file changed

+90
-46
lines changed

1 file changed

+90
-46
lines changed
Lines changed: 90 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,107 @@
1-
# Rag Evaluation
2-
3-
## Introduction
1+
# RAG Evaluation
42

53
RAG (Retrieval-Augmented Generation) is a way of building AI models that enhances their ability to generate accurate and contextually relevant responses by combining two main steps: **retrieval** and **generation**.
64

7-
1. **Retrieval**: The model first searches through a large set of documents or pieces of information to "retrieve" the most relevant ones based on the user query.
5+
1. **Retrieval**: The model first searches through a large set of documents or pieces of information from a specific knowledge base defined by the system designer to "retrieve" the most relevant ones based on the user query.
86
2. **Generation**: It then uses these retrieved documents as context to generate a response, which is typically more accurate and aligned with the question than if it had generated text from scratch without specific guidance.
97

10-
Evaluating RAG involves assessing how well the model does in both retrieval and generation.
8+
Evaluating RAG involves assessing how well the model performs in both retrieval and generation. This evaluation is crucial to ensure that the model provides accurate and relevant responses to user queries.
119

12-
The RAG evaluation module analyzes the three main components of a RAG framework:
10+
The three main components of a RAG framework are:
1311

1412
| Component | Description |
1513
| ---------- | ---------------------------------------------------------------------------------- |
1614
| User Input | The query or question posed by the user. |
1715
| Context | The retrieved documents or information that the model uses to generate a response. |
1816
| Response | The generated answer or output provided by the model. |
1917

20-
In particular, the analysis is performed on the relationships between these components:
18+
!!! example
19+
This is an example of the three components of a RAG:
20+
21+
- **User Input**: "What is the capital of France?"
22+
- **Context**: "Paris, the capital of France, ..."
23+
- **Response**: "The capital of France is Paris."
24+
25+
## RAG Evaluation Module
26+
The ML cube Platform RAG evaluation module is available for [RAG Tasks](../task.md#retrieval-augmented-generation) and generates an evaluation report for a given set of samples.
27+
28+
!!! info
29+
It is possible to compute a RAG evaluation report both from [Web App] and [SDK]. The computed report can be viewed in the Web App and exported as an Excel file from the SDK.
2130

22-
| Relationship | Evaluation |
23-
| --------------------- | --------------------------- |
24-
| User Input - Context | Retrieval Evaluation |
25-
| Context - Response | Context Factual Correctness |
26-
| User Input - Response | Response Evaluation |
31+
The report is computed by analyzing the relationships between the three RAG components:
2732

2833
<figure markdown>
2934
![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.png){ width="600"}
30-
<figcaption>ML cube Platform RAG Evaluation</figcaption>
35+
<figcaption>The three evaluations performed by the RAG Evaluation Module</figcaption>
3136
</figure>
3237

33-
The evaluation is performed through an LLM-as-a-Judge approach, where a Large Language Model (LLM) acts as a judge to evaluate the quality of a RAG model.
3438

35-
## Computed metrics
39+
### Computed metrics
3640

37-
This paragraph describes the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above. Every metrics computed is composed by a **score** and an **explanation** provided by the LLM, which explains the reasons behind the assigned score.
38-
39-
Below is a summary table of the metrics:
40-
41-
| Metric | User Input | Context | Response |
42-
| ------------ | ---------------- | ---------------- | ---------------- |
43-
| Relevance | :material-check: | :material-check: | |
44-
| Usefulness | :material-check: | :material-check: | |
45-
| Utilization | :material-check: | :material-check: | |
46-
| Attribution | :material-check: | :material-check: | |
47-
| Faithfulness | | :material-check: | :material-check: |
48-
| Satisfaction | :material-check: | | :material-check: |
41+
This paragraph describes the metrics computed by the RAG evaluation module, divided into the three relationships shown above. Every metrics computed is composed by a **value** and an **explanation** of the reasons behind the assigned value.
4942

50-
### User Input - Context
43+
#### Retrieval Evaluation (User Input - Context)
5144

52-
| Metric | Description | Score Range (Lowest-Highest) |
45+
| Metric | Description | Returned Value |
5346
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
54-
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 |
55-
| Usefulness | How useful the retrieved context is in generating the response, that is if it contains the information to answer the user query. | 1-5 |
56-
| Utilization | The percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 |
47+
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 (lowest-highest) |
48+
| Usefulness | How useful the retrieved context is in generating the response, that is if it contains the information to answer the user query. | 1-5 (lowest-highest) |
49+
| Utilization | The percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 (lowest-highest) |
5750
| Attribution | Which of the chunks of the retrieved context can be used to generate the response. | List of indices of the used chunks, first chunk has index 1 |
5851

59-
!!! example
52+
!!! note
6053
The **combination** of the metrics provides a comprehensive evaluation of the quality of the retrieved context.
61-
For instance, a **high relevance** score but **low usefulness** score indicates a context that talks about the topic of the user query but does not contain the information needed to answer it.
6254

63-
### Context - Response
55+
For instance, a **high relevance** score but **low usefulness** score indicates a context that talks about the topic of the user query but does not contain the information needed to answer it:
56+
57+
- **User Input**: "How many ECTS does a Computer Science bachelor's degree have?"
58+
- **Context**: "The main exams in a Bachelor's Degree in Computer Science typically cover programming, data structures, algorithms, computer architecture, operating systems, databases, and software engineering."
59+
60+
This example has a high relevance score because the context talks about a Computer Science bachelor's degree, but a low usefulness score because it does not contain the specific information about the number of ECTS.
61+
62+
#### Context Factual Correctness (Context - Response)
6463

65-
| Metric | Description | Score Range (Lowest-Highest) |
66-
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
67-
| Faithfulness | How much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 |
64+
| Metric | Description | Returned Value |
65+
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------- |
66+
| Faithfulness | How much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 (lowest-highest) |
6867

69-
### User Input - Response
68+
#### Response Evaluation (User Input - Response)
7069

71-
| Metric | Description | Score Range (Lowest-Highest) |
72-
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
73-
| Satisfaction | How satisfied the user would be with the generated response. A low score indicates a response that does not address the user query, a high score indicates a response that fully addresses and answers the user query. | 1-5 |
70+
| Metric | Description | Returned Value |
71+
| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------- |
72+
| Satisfaction | How satisfied the user would be with the generated response. A low score indicates a response that does not address the user query, a high score indicates a response that fully addresses and answers the user query. | 1-5 (lowest-highest) |
7473

75-
## Required data
74+
!!! example
75+
This is an example of the three evaluations performed by the RAG Evaluation Module:
76+
77+
- **User Input**: "How many ECTS does a Computer Science bachelor's degree have?"
78+
- **Context**: "The main exams in a Bachelor's Degree in Computer Science typically cover programming, data structures, algorithms, computer architecture, operating systems, databases, and software engineering."
79+
- **Response**: "A Bachelor's Degree in Computer Science typically has 180 ECTS."
80+
81+
| Metric | Value | Explanation |
82+
| ------------ | ----- | ------------------------------------------------------------------------------------------------------ |
83+
| Relevance | 5 | High relevance becayse the context talks about a Computer Science bachelor's degree. |
84+
| Usefulness | 1 | Low usefulness because the context does not contain the specific information about the number of ECTS. |
85+
| Utilization | 0 | No information in the context to generate the response. |
86+
| Attribution | [] | No chunk of the context can be used to generate the response. |
87+
| Faithfulness | 5 | High faithfulness because the response does not contradict the context. |
88+
| Satisfaction | 5 | High satisfaciton because the response fully addresses the user query. |
89+
90+
### Required data
91+
92+
Below is a summary table of the input data needed for each metric:
93+
94+
| Metric | User Input | Context | Response |
95+
| ------------ | ---------------- | ---------------- | ---------------- |
96+
| Relevance | :material-check: | :material-check: | |
97+
| Usefulness | :material-check: | :material-check: | |
98+
| Utilization | :material-check: | :material-check: | |
99+
| Attribution | :material-check: | :material-check: | |
100+
| Faithfulness | | :material-check: | :material-check: |
101+
| Satisfaction | :material-check: | | :material-check: |
76102

77-
The RAG evaluation module computes the metrics based on the data availability for each sample.
78-
If a sample lacks one of the three components (User Input, Context or Response), only the applicable metrics are computed.
103+
The RAG evaluation module computes the metrics for each sample based on the data availability.
104+
If a sample lacks one of the three components (User Input, Context or Response), only the applicable metrics are computed for that sample.
79105
For instance, if in a sample the **response is missing**, only the **User Input - Context** metrics are computed for that sample.
80106

81107
Regarding the metrics that cannot be computed for a specific sample, the lowest score is assigned, with the explanation mentioning the component that is missing.
@@ -89,12 +115,30 @@ When requesting the evaluation, a **timestamp interval** must be provided to spe
89115
The following code demonstrates how to compute a rag evaluation report for a given timestamp interval.
90116

91117
```python
118+
# Computing the RAG evaluation report
92119
rag_evaluation_job_id = client.compute_rag_evaluation_report(
93120
task_id=task_id,
94121
report_name="rag_evaluation_report_name",
95122
from_timestamp=from_timestamp,
96123
to_timestamp=to_timestamp,
97124
)
125+
126+
# Waiting for the job to complete
127+
client.wait_job_completion(job_id=rag_evaluation_job_id)
128+
129+
# Getting the evaluation report id
130+
reports = client.get_rag_evaluation_reports(task_id=task_id)
131+
report_id = reports[-1].id
132+
133+
# Exporting the evaluation report
134+
folder_path = 'path/to/folder/where/to/save/report/'
135+
client.export_rag_evaluation_report(
136+
report_id=report_id,
137+
folder=folder_path,
138+
file_name='rag_evaluation_report'
139+
)
98140
```
99141

100-
[Task]: ../task.md
142+
[Task]: ../task.md
143+
[Web App]: https://app.platform.mlcube.com/
144+
[SDK]: ../../api/python/index.md

0 commit comments

Comments
 (0)