Skip to content

Commit 791f945

Browse files
In rag evaluation use png instead of svg, remove we/our reference, add table, add SDK example
1 parent e38e911 commit 791f945

File tree

3 files changed

+33
-19
lines changed

3 files changed

+33
-19
lines changed

md-docs/imgs/rag_evaluation.png

-1.35 MB
Loading

md-docs/imgs/rag_evaluation.svg

Lines changed: 0 additions & 1 deletion
This file was deleted.

md-docs/user_guide/modules/rag_evaluation.md

Lines changed: 33 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ RAG (Retrieval-Augmented Generation) is a way of building AI models that enhance
99

1010
Evaluating RAG involves assessing how well the model does in both retrieval and generation.
1111

12-
Our RAG evaluation module analyzes the three main components of a RAG framework:
12+
The RAG evaluation module analyzes the three main components of a RAG framework:
1313

1414
| Component | Description |
1515
| ---------- | ---------------------------------------------------------------------------------- |
@@ -19,20 +19,22 @@ Our RAG evaluation module analyzes the three main components of a RAG framework:
1919

2020
In particular, the analysis is performed on the relationships between these components:
2121

22-
- **User Input - Context**: Retrieval Evaluation
23-
- **Context - Response**: Context Factual Correctness
24-
- **User Input - Response**: Response Evaluation
22+
| Relationship | Evaluation |
23+
| --------------------- | --------------------------- |
24+
| User Input - Context | Retrieval Evaluation |
25+
| Context - Response | Context Factual Correctness |
26+
| User Input - Response | Response Evaluation |
2527

2628
<figure markdown>
27-
![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.svg){ width="600"}
29+
![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.png){ width="600"}
2830
<figcaption>ML cube Platform RAG Evaluation</figcaption>
2931
</figure>
3032

3133
The evaluation is performed through an LLM-as-a-Judge approach, where a Large Language Model (LLM) acts as a judge to evaluate the quality of a RAG model.
3234

3335
## Computed metrics
3436

35-
In this paragraph we describe the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above. Every metrics computed is composed by a **score** and an **explanation** provided by the LLM, which explains the reasons behind the assigned score.
37+
This paragraph describes the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above. Every metrics computed is composed by a **score** and an **explanation** provided by the LLM, which explains the reasons behind the assigned score.
3638

3739
Below is a summary table of the metrics:
3840

@@ -47,28 +49,28 @@ Below is a summary table of the metrics:
4749

4850
### User Input - Context
4951

50-
| Metric | Description | Score Range (Lowest-Highest) |
51-
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------- |
52-
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 |
53-
| Usefulness | Evaluates how useful the retrieved context is in generating the response, that is if it contains the information to answer the user query | 1-5 |
54-
| Utilization | Measures the percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 |
55-
| Attribution | Indicates which of the chunks of the retrieved context can be used to generate the response. | List of indices of the used chunks, first chunk has index 1 |
52+
| Metric | Description | Score Range (Lowest-Highest) |
53+
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
54+
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 |
55+
| Usefulness | How useful the retrieved context is in generating the response, that is if it contains the information to answer the user query. | 1-5 |
56+
| Utilization | The percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 |
57+
| Attribution | Which of the chunks of the retrieved context can be used to generate the response. | List of indices of the used chunks, first chunk has index 1 |
5658

5759
!!! example
5860
The **combination** of the metrics provides a comprehensive evaluation of the quality of the retrieved context.
5961
For instance, a **high relevance** score but **low usefulness** score indicates a context that talks about the topic of the user query but does not contain the information needed to answer it.
6062

6163
### Context - Response
6264

63-
| Metric | Description | Score Range (Lowest-Highest) |
64-
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
65-
| Faithfulness | Measures how much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 |
65+
| Metric | Description | Score Range (Lowest-Highest) |
66+
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
67+
| Faithfulness | How much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 |
6668

6769
### User Input - Response
6870

69-
| Metric | Description | Score Range (Lowest-Highest) |
70-
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
71-
| Satisfaction | Evaluates how satisfied the user would be with the generated response. A low score indicates a response that does not address the user quey, a high score indicates a response that fully addresses and answers the user query. | 1-5 |
71+
| Metric | Description | Score Range (Lowest-Highest) |
72+
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
73+
| Satisfaction | How satisfied the user would be with the generated response. A low score indicates a response that does not address the user query, a high score indicates a response that fully addresses and answers the user query. | 1-5 |
7274

7375
## Required data
7476

@@ -82,4 +84,17 @@ If data added to a [Task] contains contexts with multiple chunks of text, a [con
8284

8385
When requesting the evaluation, a **timestamp interval** must be provided to specify the time range of the data to be evaluated.
8486

87+
??? code-block "SDK Example"
88+
89+
The following code demonstrates how to compute a rag evaluation report for a given timestamp interval.
90+
91+
```python
92+
rag_evaluation_job_id = client.compute_rag_evaluation_report(
93+
task_id=task_id,
94+
report_name="rag_evaluation_report_name",
95+
from_timestamp=from_timestamp,
96+
to_timestamp=to_timestamp,
97+
)
98+
```
99+
85100
[Task]: ../task.md

0 commit comments

Comments
 (0)