You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-studio/how-to/develop/evaluate-sdk.md
+11-7Lines changed: 11 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,6 +89,10 @@ Built-in evaluators can accept *either* query and respons pairs or a list of con
89
89
- Ground truth: the response generated by user/human as the true answer
90
90
- Conversation: a list of messages of user and assistant turns. See more in the next section.
91
91
92
+
93
+
> [!NOTE]
94
+
> All evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they will consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)
95
+
92
96
#### Evaluating multi-turn conversations
93
97
94
98
For evaluators that support conversations as input, you can just pass in the conversation directly into the evaluator:
> `GroundednessEvaluator` (open-source prompt-based) supports `query` as an optional input. If `query` is provided, their optimal scenario will be Retrieval Augmented Generation Question and Answering (RAG QA); and otherwise, the optimal scenario will be summarization. This is different from `GroundednessProEvaluator` (powered by Azure Content Safety) which requires `query`.
165
171
166
172
Here's an example of the result:
167
173
@@ -176,8 +182,7 @@ Here's an example of the result:
176
182
177
183
> [!NOTE]
178
184
> We strongly recommend users to migrate their code to use the key without prefixes (for example, `groundedness.groundedness`) to allow your code to support more evaluator models.
179
-
> All evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they will consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)
180
-
> `GroundednessEvaluator` (open-source prompt-based) supports `query` as an optional input. If `query` is provided, their optimal scenario will be Retrieval Augmented Generation Question and Answering (RAG QA); and otherwise, the optimal scenario will be summarization. This is different from `GroundednessProEvaluator` (powered by Azure Content Safety) which requires `query`.
185
+
181
186
182
187
183
188
### Risk and safety evaluators
@@ -264,7 +269,7 @@ Built-in evaluators are great out of the box to start evaluating your applicatio
264
269
265
270
### Code-based evaluators
266
271
267
-
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. Given a simple Python class in an example`answer_len/answer_length.py"`that calculates the length of an answer under a directory `answer_len/`:
272
+
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. You can create your own code-based evaluator, for example, with a simple Python class that calculates the length of an answer in `answer_length.py` under directory `answer_len/`:
268
273
269
274
```python
270
275
classAnswerLengthEvaluator:
@@ -274,8 +279,7 @@ class AnswerLengthEvaluator:
274
279
def__call__(self, *, answer: str, **kwargs):
275
280
return {"answer_length": len(answer)}
276
281
```
277
-
278
-
You can create your own code-based evaluator and run it on a row of data by importing a callable class:
282
+
Then run the evalutor on a row of data by importing a callable class:
279
283
280
284
```python
281
285
withopen("answer_len/answer_length.py") as fin:
@@ -579,7 +583,7 @@ After local evaluations of your generative AI applications, you may want to trig
579
583
580
584
### Installation Instructions
581
585
582
-
1. Create a **virtual environment of you choice**. To create one using conda, run the following command:
586
+
1. Create a **virtual Python environment of you choice**. To create one using conda, run the following command:
583
587
```bash
584
588
conda create -n remote-evaluation
585
589
conda activate remote-evaluation
@@ -634,7 +638,7 @@ from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEv
0 commit comments