Modify rag evaluation paragraph titles, add tables, change image to reflect chosen palette

andreacerasani · andreacerasani · commit e38e91119315 · 2024-11-13T14:43:08.000+01:00
diff --git a/md-docs/imgs/rag_evaluation.svg b/md-docs/imgs/rag_evaluation.svg
@@ -0,0 +1 @@
+<svg overflow="hidden" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" viewBox="284.48971878515187 302.112 704.071991001125 253.44" style="max-height: 500px" width="704.071991001125" height="253.44"><g><rect fill="#FFFFFF" height="720" width="1280" y="0" x="0"/><path fill-rule="evenodd" fill="#00868C" d="M375 339.333C375 333.626 379.626 329 385.333 329L477.667 329C483.374 329 488 333.626 488 339.333L488 380.667C488 386.374 483.374 391 477.667 391L385.333 391C379.626 391 375 386.374 375 380.667Z"/><text transform="translate(407.859 354)" font-size="21" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#FFFFFF">User</text><text transform="translate(403.772 379)" font-size="21" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#FFFFFF">Input</text><path fill-rule="evenodd" fill="#00868C" d="M792 339.333C792 333.626 796.626 329 802.333 329L894.667 329C900.374 329 905 333.626 905 339.333L905 380.667C905 386.374 900.374 391 894.667 391L802.333 391C796.626 391 792 386.374 792 380.667Z"/><text transform="translate(807.394 367)" font-size="21" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#FFFFFF">Context</text><path fill-rule="evenodd" fill="#00868C" d="M584 492.334C584 486.626 588.626 482 594.334 482L685.666 482C691.374 482 696 486.626 696 492.334L696 533.666C696 539.374 691.374 544 685.666 544L594.334 544C588.626 544 584 539.374 584 533.666Z"/><text transform="translate(607.58 507)" font-size="21" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#FFFFFF">Model</text><text transform="translate(602.833 532)" font-size="21" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#FFFFFF">Output</text><path fill="#27313F" d="M494.667 359 502.667 359 502.667 361 494.667 361ZM508.667 359 516.667 359 516.667 361 508.667 361ZM522.667 359 530.667 359 530.667 361 522.667 361ZM536.667 359 544.667 359 544.667 361 536.667 361ZM550.667 359 558.667 359 558.667 361 550.667 361ZM564.667 359 572.667 359 572.667 361 564.667 361ZM578.667 359 586.667 359 586.667 361 578.667 361ZM592.667 359 600.667 359 600.667 361 592.667 361ZM606.667 359 614.667 359 614.667 361 606.667 361ZM620.667 359 628.667 359 628.667 361 620.667 361ZM634.667 359 642.667 359 642.667 361 634.667 361ZM648.667 359 656.667 359 656.667 361 648.667 361ZM662.667 359 670.667 359 670.667 361 662.667 361ZM676.667 359 684.667 359 684.667 361 676.667 361ZM690.667 359 698.667 359 698.667 361 690.667 361ZM704.667 359 712.667 359 712.667 361 704.667 361ZM718.667 359 726.667 359 726.667 361 718.667 361ZM732.667 359 740.667 359 740.667 361 732.667 361ZM746.667 359 754.667 359 754.667 361 746.667 361ZM760.667 359 768.667 359 768.667 361 760.667 361ZM774.667 359 782.667 359 782.667 361 774.667 361ZM496 364 488 360 496 356ZM784.973 356 792.973 360 784.973 364Z"/><path fill="#27313F" d="M436.834 394.378 443.086 399.37 441.838 400.933 435.586 395.941ZM447.774 403.113 454.026 408.105 452.778 409.668 446.527 404.676ZM458.715 411.849 464.967 416.84 463.719 418.403 457.467 413.412ZM469.656 420.584 475.907 425.576 474.659 427.138 468.408 422.147ZM480.596 429.319 486.848 434.311 485.6 435.874 479.348 430.882ZM491.537 438.054 497.788 443.046 496.541 444.609 490.289 439.617ZM502.477 446.79 508.729 451.781 507.481 453.344 501.229 448.353ZM513.418 455.525 519.67 460.516 518.422 462.079 512.17 457.088ZM524.358 464.26 530.61 469.252 529.362 470.815 523.11 465.823ZM535.299 472.995 541.551 477.987 540.303 479.55 534.051 474.558ZM546.24 481.73 552.491 486.722 551.243 488.285 544.992 483.293ZM557.18 490.466 563.432 495.457 562.184 497.02 555.932 492.029ZM568.121 499.201 574.372 504.193 573.124 505.755 566.873 500.764ZM434.756 399.117 431 391 439.748 392.866ZM579.73 504.632 583.486 512.749 574.739 510.883Z"/><path transform="matrix(1 0 0 -1 696 512.749)" fill="#27313F" d="M5.83365 3.37809 12.0854 8.36964 10.8375 9.93258 4.58577 4.94103ZM16.7742 12.1133 23.026 17.1048 21.7781 18.6678 15.5263 13.6762ZM27.7148 20.8485 33.9666 25.8401 32.7187 27.403 26.4669 22.4114ZM38.6554 29.5837 44.9071 34.5753 43.6592 36.1382 37.4075 31.1467ZM49.5959 38.3189 55.8477 43.3105 54.5998 44.8734 48.3481 39.8819ZM60.5365 47.0541 66.7883 52.0457 65.5404 53.6086 59.2886 48.6171ZM71.4771 55.7893 77.7288 60.7809 76.4809 62.3438 70.2292 57.3523ZM82.4177 64.5246 88.6694 69.5161 87.4215 71.079 81.1698 66.0875ZM93.3582 73.2598 99.61 78.2513 98.3621 79.8142 92.1103 74.8227ZM104.299 81.995 110.551 86.9865 109.303 88.5495 103.051 83.5579ZM115.239 90.7302 121.491 95.7217 120.243 97.2847 113.991 92.2931ZM126.18 99.4654 132.432 104.457 131.184 106.02 124.932 101.028ZM137.121 108.201 143.372 113.192 142.124 114.755 135.873 109.764ZM3.75598 8.11743 0 0 8.74753 1.86567ZM148.731 113.632 152.487 121.749 143.739 119.883Z"/><text transform="translate(548.417 334)" font-size="19" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#27313F">Retrieval</text><text transform="translate(634.083 334)" font-size="19" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#27313F">Evaluation</text><text transform="translate(783.938 468)" font-size="19" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#27313F">Context</text><text transform="translate(861.438 468)" font-size="19" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#27313F">Factual</text><text transform="translate(801.018 491)" font-size="19" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#27313F">Correctness</text><text transform="translate(299.277 479)" font-size="19" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#27313F">Response</text><text transform="translate(394.611 479)" font-size="19" font-weight="400" font-family="Poppins,Poppins_MSFontService,sans-serif" fill="#27313F">Evaluation</text></g></svg>
diff --git a/md-docs/user_guide/modules/rag_evaluation.md b/md-docs/user_guide/modules/rag_evaluation.md
@@ -1,6 +1,6 @@
 # Rag Evaluation
 
-## What is RAG Evaluation?
+## Introduction
 
 RAG (Retrieval-Augmented Generation) is a way of building AI models that enhances their ability to generate accurate and contextually relevant responses by combining two main steps: **retrieval** and **generation**.
 
@@ -11,9 +11,11 @@ Evaluating RAG involves assessing how well the model does in both retrieval and
 
 Our RAG evaluation module analyzes the three main components of a RAG framework:
 
-- **User Input**: The query or question posed by the user.
-- **Context**: The retrieved documents or information that the model uses to generate a response. A context can consist of one or more chunks of text.
-- **Response**: The generated answer or output provided by the model.
+| Component  | Description                                                                        |
+| ---------- | ---------------------------------------------------------------------------------- |
+| User Input | The query or question posed by the user.                                           |
+| Context    | The retrieved documents or information that the model uses to generate a response. |
+| Response   | The generated answer or output provided by the model.                              |
 
 In particular, the analysis is performed on the relationships between these components:
 
@@ -22,34 +24,59 @@ In particular, the analysis is performed on the relationships between these comp
 - **User Input - Response**: Response Evaluation
 
 <figure markdown>
-  ![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.png){ width="600"}
+  ![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.svg){ width="600"}
   <figcaption>ML cube Platform RAG Evaluation</figcaption>
 </figure>
 
-The evaluation is performed through an LLM-as-a-Judge approach, where a Language Model (LM) acts as a judge to evaluate the quality of a RAG model.
+The evaluation is performed through an LLM-as-a-Judge approach, where a Large Language Model (LLM) acts as a judge to evaluate the quality of a RAG model.
 
-## What are the computed metrics?
+## Computed metrics
 
-Below are the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above.
+In this paragraph we describe the metrics computed by the RAG evaluation module, divided into the three relationships mentioned above. Every metrics computed is composed by a **score** and an **explanation** provided by the LLM, which explains the reasons behind the assigned score.
+
+Below is a summary table of the metrics:
+
+| Metric       | User Input       | Context          | Response         |
+| ------------ | ---------------- | ---------------- | ---------------- |
+| Relevance    | :material-check: | :material-check: |                  |
+| Usefulness   | :material-check: | :material-check: |                  |
+| Utilization  | :material-check: | :material-check: |                  |
+| Attribution  | :material-check: | :material-check: |                  |
+| Faithfulness |                  | :material-check: | :material-check: |
+| Satisfaction | :material-check: |                  | :material-check: |
 
 ### User Input - Context
 
-- **Relevance**: Measures how much the retrieved context is relevant to the user input. The score ranges from 1 to 5, with 5 being the highest relevance.
-- **Usefulness**: Evaluates how useful the retrieved context is in generating the response. For example, if a context talks about the topic of the user query but it does not contain the information needed to answer the question, it is relevant but not useful. The score ranges from 1 to 5, with 5 being the highest usefulness.
-- **Utilization**: Measures the percentage of the retrieved context that contains information that can be used to generate the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. The score ranges from 0 to 100.
-- **Attribution**: Indicates which of the chunks of the retrieved context can be used to generate the response. It is a list of the indices of the chunks that are used, with the first chunk having index 1.
+| Metric      | Description                                                                                                                                                                                                  | Score Range (Lowest-Highest)                                |
+| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------- |
+| Relevance   | How much the retrieved context is relevant to the user input.                                                                                                                                                | 1-5                                                         |
+| Usefulness  | Evaluates how useful the retrieved context is in generating the response, that is if it contains the information to answer the user query                                                                    | 1-5                                                         |
+| Utilization | Measures the percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for  generating the response. | 0-100                                                       |
+| Attribution | Indicates which of the chunks of the retrieved context can be used to generate the response.                                                                                                                 | List of indices of the used chunks, first chunk has index 1 |
+
+!!! example
+    The **combination** of the metrics provides a comprehensive evaluation of the quality of the retrieved context.
+    For instance, a **high relevance** score but **low usefulness** score indicates a context that talks about the topic of the user query but does not contain the information needed to answer it.
 
 ### Context - Response
-- **Faithfulness**: Measures how much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. The score ranges from 1 to 5, with 5 being the highest faithfulness.
+
+| Metric       | Description                                                                                                                                                 | Score Range (Lowest-Highest) |
+| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
+| Faithfulness | Measures how much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5                          |
 
 ### User Input - Response
-- **Satisfaction**: Evaluates how satisfied the user would be with the generated response. The score ranges from 1 to 5, with 1 a response that does not address the user query and 5 a response that fully addresses and answers the user query.
 
-## What is the required data?
+| Metric       | Description                                                                                                                                                                                                                     | Score Range (Lowest-Highest) |
+| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
+| Satisfaction | Evaluates how satisfied the user would be with the generated response. A low score indicates a response that does not address the user quey, a high score indicates a response that fully addresses and answers the user query. | 1-5                          |
+
+## Required data
 
 The RAG evaluation module computes the metrics based on the data availability for each sample. 
 If a sample lacks one of the three components (User Input, Context or Response), only the applicable metrics are computed. 
-For instance, if a sample is missing the response, only the **User Input - Context** metrics are computed.
+For instance, if in a sample the **response is missing**, only the **User Input - Context** metrics are computed for that sample.
+
+Regarding the metrics that cannot be computed for a specific sample, the lowest score is assigned, with the explanation mentioning the component that is missing.
 
 If data added to a [Task] contains contexts with multiple chunks of text, a [context separator](../task.md#retrieval-augmented-generation) must be provided.