You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The evaluation is performed through an LLM-as-a-Judge approach, where a Large Language Model (LLM) acts as a judge to evaluate the quality of a RAG model.
32
34
33
35
## Computed metrics
34
36
35
-
In this paragraph we describe the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above. Every metrics computed is composed by a **score** and an **explanation** provided by the LLM, which explains the reasons behind the assigned score.
37
+
This paragraph describes the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above. Every metrics computed is composed by a **score** and an **explanation** provided by the LLM, which explains the reasons behind the assigned score.
36
38
37
39
Below is a summary table of the metrics:
38
40
@@ -47,28 +49,28 @@ Below is a summary table of the metrics:
47
49
48
50
### User Input - Context
49
51
50
-
| Metric | Description | Score Range (Lowest-Highest) |
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 |
53
-
| Usefulness |Evaluates how useful the retrieved context is in generating the response, that is if it contains the information to answer the user query | 1-5 |
54
-
| Utilization |Measures the percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 |
55
-
| Attribution |Indicates which of the chunks of the retrieved context can be used to generate the response. | List of indices of the used chunks, first chunk has index 1 |
52
+
| Metric | Description | Score Range (Lowest-Highest) |
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 |
55
+
| Usefulness |How useful the retrieved context is in generating the response, that is if it contains the information to answer the user query.| 1-5 |
56
+
| Utilization |The percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 |
57
+
| Attribution |Which of the chunks of the retrieved context can be used to generate the response.| List of indices of the used chunks, first chunk has index 1 |
56
58
57
59
!!! example
58
60
The **combination** of the metrics provides a comprehensive evaluation of the quality of the retrieved context.
59
61
For instance, a **high relevance** score but **low usefulness** score indicates a context that talks about the topic of the user query but does not contain the information needed to answer it.
60
62
61
63
### Context - Response
62
64
63
-
| Metric | Description | Score Range (Lowest-Highest) |
| Faithfulness |Measures how much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 |
65
+
| Metric | Description | Score Range (Lowest-Highest) |
| Faithfulness |How much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 |
66
68
67
69
### User Input - Response
68
70
69
-
| Metric | Description | Score Range (Lowest-Highest) |
| Satisfaction |Evaluates how satisfied the user would be with the generated response. A low score indicates a response that does not address the user quey, a high score indicates a response that fully addresses and answers the user query. | 1-5 |
71
+
| Metric | Description | Score Range (Lowest-Highest) |
| Satisfaction |How satisfied the user would be with the generated response. A low score indicates a response that does not address the user query, a high score indicates a response that fully addresses and answers the user query. | 1-5 |
72
74
73
75
## Required data
74
76
@@ -82,4 +84,17 @@ If data added to a [Task] contains contexts with multiple chunks of text, a [con
82
84
83
85
When requesting the evaluation, a **timestamp interval** must be provided to specify the time range of the data to be evaluated.
84
86
87
+
??? code-block "SDK Example"
88
+
89
+
The following code demonstrates how to compute a rag evaluation report for a given timestamp interval.
0 commit comments