Skip to content

Commit 333c52c

Browse files
committed
Merge branch 'dev' into dev-topics
2 parents 26ce68a + 22deb5c commit 333c52c

File tree

9 files changed

+3995
-611
lines changed

9 files changed

+3995
-611
lines changed

Makefile

Lines changed: 3 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,17 @@
11
# === USER PARAMETERS
22

33
ifdef OS
4-
export PYTHON_COMMAND=python
5-
export UV_INSTALL_CMD=powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
64
export VENV_BIN=.venv/Scripts
75
else
8-
export PYTHON_COMMAND=python3.12
9-
export UV_INSTALL_CMD=curl -LsSf https://astral.sh/uv/install.sh | sh
106
export VENV_BIN=.venv/bin
117
endif
128

13-
export SRC_DIR=ml3_platform_docs
14-
15-
DEPLOY_ENVIRONMENT=$(shell if [ $(findstring main, $(BRANCH_NAME)) ]; then \
16-
echo 'prod'; \
17-
elif [ $(findstring pre, $(BRANCH_NAME)) ]; then \
18-
echo 'pre'; \
19-
else \
20-
echo 'dev'; \
21-
fi)
22-
# If use deploy_environment in the tag system
23-
# `y` => yes
24-
# `n` => no
25-
USE_DEPLOY_ENVIRONMENT=n
26-
279
# == SETUP REPOSITORY AND DEPENDENCIES
2810

29-
install-uv:
30-
# install uv package manager
31-
# $(UV_INSTALL_CMD)
32-
# create environment
33-
uv venv -p $(PYTHON_COMMAND)
34-
35-
compile:
36-
# install extra dev group
37-
uv pip compile pyproject.toml -o requirements.txt --extra dev --cache-dir .uv_cache
38-
39-
install:
40-
uv pip sync requirements.txt --cache-dir .uv_cache
11+
dev-sync:
12+
uv sync --cache-dir .uv_cache --all-extras
4113

42-
setup: install-uv compile install
14+
setup: dev-sync
4315

4416
build-docs:
4517
. $(VENV_BIN)/activate && mkdocs build

docs/imgs/rag_evaluation.png

1.46 MB
Loading

md-docs/imgs/rag_evaluation.png

108 KB
Loading

md-docs/user_guide/modules/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,30 +29,30 @@ Modules can be always active or on-demand: Monitoring module and Drift Explainab
2929

3030
Update your model to handle the drift.
3131

32-
[:octicons-arrow-right-24: More info](user_guide/data.md)
32+
[:octicons-arrow-right-24: More info](retraining.md)
3333

3434
- :material-text-box-check:{ .lg .middle } **RAG Evaluation**
3535

3636
---
3737

3838
Check the quality of your RAG system.
3939

40-
[:octicons-arrow-right-24: More info](user_guide/integrations/index.md)
40+
[:octicons-arrow-right-24: More info](rag_evaluation.md)
4141

4242
- :material-shield-lock:{ .lg .middle } **LLM Security**
4343

4444
---
4545

4646
Verify robustness of your solution.
4747

48-
[:octicons-arrow-right-24: More info](api/index.md)
48+
[:octicons-arrow-right-24: More info](llm_security.md)
4949

5050
- :material-view-dashboard:{ .lg .middle } **Topic Modeling**
5151

5252
---
5353

5454
Identify sub-domains in your data.
5555

56-
[:octicons-arrow-right-24: More info](api/examples.md)
56+
[:octicons-arrow-right-24: More info](topic_modeling.md)
5757

5858
</div>
Lines changed: 144 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,144 @@
1-
# RAG Evaluation
1+
# RAG Evaluation
2+
3+
RAG (Retrieval-Augmented Generation) is a way of building AI models that enhances their ability to generate accurate and contextually relevant responses by combining two main steps: **retrieval** and **generation**.
4+
5+
1. **Retrieval**: The model first searches through a large set of documents or pieces of information from a specific knowledge base defined by the system designer to "retrieve" the most relevant ones based on the user query.
6+
2. **Generation**: It then uses these retrieved documents as context to generate a response, which is typically more accurate and aligned with the question than if it had generated text from scratch without specific guidance.
7+
8+
Evaluating RAG involves assessing how well the model performs in both retrieval and generation. This evaluation is crucial to ensure that the model provides accurate and relevant responses to user queries.
9+
10+
The three main components of a RAG framework are:
11+
12+
| Component | Description |
13+
| ---------- | ---------------------------------------------------------------------------------- |
14+
| User Input | The query or question posed by the user. |
15+
| Context | The retrieved documents or information that the model uses to generate a response. |
16+
| Response | The generated answer or output provided by the model. |
17+
18+
!!! example
19+
This is an example of the three components of a RAG:
20+
21+
- **User Input**: "What is the capital of France?"
22+
- **Context**: "Paris, the capital of France, ..."
23+
- **Response**: "The capital of France is Paris."
24+
25+
## RAG Evaluation Module
26+
The ML cube Platform RAG evaluation module is available for [RAG Tasks](../task.md#retrieval-augmented-generation) and generates an evaluation report for a given set of samples.
27+
28+
!!! info
29+
It is possible to compute a RAG evaluation report both from [Web App] and [SDK]. The computed report can be viewed in the Web App and exported as an Excel file from the SDK.
30+
31+
The report is computed by analyzing the relationships between the three RAG components:
32+
33+
<figure markdown>
34+
![ML cube Platform RAG Evaluation](../../imgs/rag_evaluation.png){ width="600"}
35+
<figcaption>The three evaluations performed by the RAG Evaluation Module</figcaption>
36+
</figure>
37+
38+
39+
### Computed metrics
40+
41+
This paragraph describes the metrics computed by the RAG evaluation module, divided into the three relationships shown above. Every metrics computed is composed by a **value** and an **explanation** of the reasons behind the assigned value.
42+
43+
#### Retrieval Evaluation (User Input - Context)
44+
45+
| Metric | Description | Returned Value |
46+
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
47+
| Relevance | How much the retrieved context is relevant to the user input. | 1-5 (lowest-highest) |
48+
| Usefulness | How useful the retrieved context is in generating the response, that is if it contains the information to answer the user query. | 1-5 (lowest-highest) |
49+
| Utilization | The percentage of the retrieved context that contains information for the response. A higher utilization score indicates that more of the retrieved context is useful for generating the response. | 0-100 (lowest-highest) |
50+
| Attribution | Which of the chunks of the retrieved context can be used to generate the response. | List of indices of the used chunks, first chunk has index 1 |
51+
52+
!!! note
53+
The **combination** of the metrics provides a comprehensive evaluation of the quality of the retrieved context.
54+
55+
For instance, a **high relevance** score but **low usefulness** score indicates a context that talks about the topic of the user query but does not contain the information needed to answer it:
56+
57+
- **User Input**: "How many ECTS does a Computer Science bachelor's degree have?"
58+
- **Context**: "The main exams in a Bachelor's Degree in Computer Science typically cover programming, data structures, algorithms, computer architecture, operating systems, databases, and software engineering."
59+
60+
This example has a high relevance score because the context talks about a Computer Science bachelor's degree, but a low usefulness score because it does not contain the specific information about the number of ECTS.
61+
62+
#### Context Factual Correctness (Context - Response)
63+
64+
| Metric | Description | Returned Value |
65+
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------- |
66+
| Faithfulness | How much the response contradicts the retrieved context. A higher faithfulness score indicates that the response is more aligned with the context. | 1-5 (lowest-highest) |
67+
68+
#### Response Evaluation (User Input - Response)
69+
70+
| Metric | Description | Returned Value |
71+
| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------- |
72+
| Satisfaction | How satisfied the user would be with the generated response. A low score indicates a response that does not address the user query, a high score indicates a response that fully addresses and answers the user query. | 1-5 (lowest-highest) |
73+
74+
!!! example
75+
This is an example of the three evaluations performed by the RAG Evaluation Module:
76+
77+
- **User Input**: "How many ECTS does a Computer Science bachelor's degree have?"
78+
- **Context**: "The main exams in a Bachelor's Degree in Computer Science typically cover programming, data structures, algorithms, computer architecture, operating systems, databases, and software engineering."
79+
- **Response**: "A Bachelor's Degree in Computer Science typically has 180 ECTS."
80+
81+
| Metric | Value | Explanation |
82+
| ------------ | ----- | ------------------------------------------------------------------------------------------------------ |
83+
| Relevance | 5 | High relevance becayse the context talks about a Computer Science bachelor's degree. |
84+
| Usefulness | 1 | Low usefulness because the context does not contain the specific information about the number of ECTS. |
85+
| Utilization | 0 | No information in the context to generate the response. |
86+
| Attribution | [] | No chunk of the context can be used to generate the response. |
87+
| Faithfulness | 5 | High faithfulness because the response does not contradict the context. |
88+
| Satisfaction | 5 | High satisfaciton because the response fully addresses the user query. |
89+
90+
### Required data
91+
92+
Below is a summary table of the input data needed for each metric:
93+
94+
| Metric | User Input | Context | Response |
95+
| ------------ | ---------------- | ---------------- | ---------------- |
96+
| Relevance | :material-check: | :material-check: | |
97+
| Usefulness | :material-check: | :material-check: | |
98+
| Utilization | :material-check: | :material-check: | |
99+
| Attribution | :material-check: | :material-check: | |
100+
| Faithfulness | | :material-check: | :material-check: |
101+
| Satisfaction | :material-check: | | :material-check: |
102+
103+
The RAG evaluation module computes the metrics for each sample based on the data availability.
104+
If a sample lacks one of the three components (User Input, Context or Response), only the applicable metrics are computed for that sample.
105+
For instance, if in a sample the **response is missing**, only the **User Input - Context** metrics are computed for that sample.
106+
107+
Regarding the metrics that cannot be computed for a specific sample, the lowest score is assigned, with the explanation mentioning the component that is missing.
108+
109+
If data added to a [Task] contains contexts with multiple chunks of text, a [context separator](../task.md#retrieval-augmented-generation) must be provided.
110+
111+
When requesting the evaluation, a **timestamp interval** must be provided to specify the time range of the data to be evaluated.
112+
113+
??? code-block "SDK Example"
114+
115+
The following code demonstrates how to compute a rag evaluation report for a given timestamp interval.
116+
117+
```python
118+
# Computing the RAG evaluation report
119+
rag_evaluation_job_id = client.compute_rag_evaluation_report(
120+
task_id=task_id,
121+
report_name="rag_evaluation_report_name",
122+
from_timestamp=from_timestamp,
123+
to_timestamp=to_timestamp,
124+
)
125+
126+
# Waiting for the job to complete
127+
client.wait_job_completion(job_id=rag_evaluation_job_id)
128+
129+
# Getting the evaluation report id
130+
reports = client.get_rag_evaluation_reports(task_id=task_id)
131+
report_id = reports[-1].id
132+
133+
# Exporting the evaluation report
134+
folder_path = 'path/to/folder/where/to/save/report/'
135+
client.export_rag_evaluation_report(
136+
report_id=report_id,
137+
folder=folder_path,
138+
file_name='rag_evaluation_report'
139+
)
140+
```
141+
142+
[Task]: ../task.md
143+
[Web App]: https://app.platform.mlcube.com/
144+
[SDK]: ../../api/python/index.md

md-docs/user_guide/task.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Indeed, each Task Type has a set of ML cube Platform modules:
5050
| LLM Security | :material-close: | :material-close: | :material-check: | :material-close: |
5151

5252
!!! Tip
53-
On the left side of the web app page the Task menù is present, with links to the above mentioned modules and Task settings.
53+
On the left side of the web app page the Task menu is present, with links to the above mentioned modules and Task settings.
5454

5555
## Task Type
5656

@@ -140,9 +140,9 @@ Moreover, in this Task, the Prediction is a text as well and the input is compos
140140
RAG tasks has additional the attribute *context separator* which is a string used to separate different retrieved contexts into chunks. Context data is sent as a single string, however, in RAG settings multiple documents can be retrieved. In this case, context separator is used to distinguish them. It is optional since a single context can be provided.
141141

142142
!!! example
143-
Context separator: <<sep>>
143+
Context separator: <<sep\>\>
144144

145-
Context data: The capital of Italy is Rome.<<sep>>Rome is the capital of Italy.<<sep>>Rome was the capital of Roman Empire.
145+
Context data: The capital of Italy is Rome.<<sep\>\>Rome is the capital of Italy.<<sep\>\>Rome was the capital of Roman Empire.
146146

147147
Contexts:
148148

pyproject.toml

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,3 @@
1-
[build-system]
2-
requires = ["setuptools >= 61.0"]
3-
build-backend = "setuptools.build_meta"
4-
51
[project]
62
name = "ml3-platform-docs"
73
version = "0.0.1"
@@ -23,7 +19,7 @@ dependencies = [
2319
"pandas",
2420
]
2521

26-
[project.optional-dependencies]
22+
[dependency-groups]
2723
dev = [
2824
"ml3-platform-sdk==0.0.22",
2925
"polars>=0.19.3",

0 commit comments

Comments
 (0)