Skip to content

Commit ec39339

Browse files
committed
agent eval updates
1 parent 395f9f7 commit ec39339

File tree

1 file changed

+11
-4
lines changed

1 file changed

+11
-4
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ Retrieval quality is very important given its upstream role in RAG: if the retri
9393
```python
9494
from azure.ai.evaluation import DocumentRetrievalEvaluator
9595

96+
# these query_relevance_label are given by your human- or LLM-judges.
9697
retrieval_ground_truth = [
9798
{
9899
"document_id": "1",
@@ -115,8 +116,11 @@ retrieval_ground_truth = [
115116
"query_relevance_label": 0
116117
},
117118
]
119+
# the min and max of the label scores are inputs to document retrieval evaluator
120+
ground_truth_label_min = 0
121+
ground_truth_label_max = 4
118122

119-
# these reterieval scores
123+
# these relevance scores come from your search retrieval system
120124
retrieved_documents = [
121125
{
122126
"document_id": "2",
@@ -141,6 +145,8 @@ retrieved_documents = [
141145
]
142146

143147
document_retrieval_evaluator = DocumentRetrievalEvaluator(
148+
ground_truth_label_min=ground_truth_label_min,
149+
ground_truth_label_max=ground_truth_label_max,
144150
ndcg_threshold = 0.5,
145151
xdcg_threshold = 50.0,
146152
fidelity_threshold = 0.5,
@@ -154,7 +160,7 @@ document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retr
154160

155161
### Document retrieval output
156162

157-
The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
163+
All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise.
158164

159165
```python
160166
{
@@ -165,15 +171,16 @@ The numerical score on a likert scale (integer 1 to 5) and a higher score is bet
165171
"top3_max_relevance": 2,
166172
"holes": 30,
167173
"holes_ratio": 0.6000000000000001,
168-
"holes_is_higher_better": False,
169-
"holes_ratio_is_higher_better": False,
174+
"holes_higher_is_better": False,
175+
"holes_ratio_higher_is_better": False,
170176
"total_retrieved_documents": 50,
171177
"total_groundtruth_documents": 1565,
172178
"ndcg@3_result": "pass",
173179
"xdcg@3_result": "pass",
174180
"fidelity_result": "fail",
175181
"top1_relevance_result": "fail",
176182
"top3_max_relevance_result": "fail",
183+
# omitting more fields ...
177184
}
178185
```
179186

0 commit comments

Comments
 (0)