Skip to content

Commit ef91f7b

Browse files
committed
minor updates for clarity
1 parent 7d09b0c commit ef91f7b

File tree

1 file changed

+11
-7
lines changed

1 file changed

+11
-7
lines changed

articles/ai-studio/how-to/develop/evaluate-sdk.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,10 @@ Built-in evaluators can accept *either* query and respons pairs or a list of con
8989
- Ground truth: the response generated by user/human as the true answer
9090
- Conversation: a list of messages of user and assistant turns. See more in the next section.
9191

92+
93+
> [!NOTE]
94+
> All evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they will consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)
95+
9296
#### Evaluating multi-turn conversations
9397

9498
For evaluators that support conversations as input, you can just pass in the conversation directly into the evaluator:
@@ -162,6 +166,8 @@ groundedness_score = groundedness_eval(
162166
)
163167
print(groundedness_score)
164168
```
169+
> [!NOTE]
170+
> `GroundednessEvaluator` (open-source prompt-based) supports `query` as an optional input. If `query` is provided, their optimal scenario will be Retrieval Augmented Generation Question and Answering (RAG QA); and otherwise, the optimal scenario will be summarization. This is different from `GroundednessProEvaluator` (powered by Azure Content Safety) which requires `query`.
165171
166172
Here's an example of the result:
167173

@@ -176,8 +182,7 @@ Here's an example of the result:
176182

177183
> [!NOTE]
178184
> We strongly recommend users to migrate their code to use the key without prefixes (for example, `groundedness.groundedness`) to allow your code to support more evaluator models.
179-
> All evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they will consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)
180-
> `GroundednessEvaluator` (open-source prompt-based) supports `query` as an optional input. If `query` is provided, their optimal scenario will be Retrieval Augmented Generation Question and Answering (RAG QA); and otherwise, the optimal scenario will be summarization. This is different from `GroundednessProEvaluator` (powered by Azure Content Safety) which requires `query`.
185+
181186

182187

183188
### Risk and safety evaluators
@@ -264,7 +269,7 @@ Built-in evaluators are great out of the box to start evaluating your applicatio
264269

265270
### Code-based evaluators
266271

267-
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. Given a simple Python class in an example `answer_len/answer_length.py"` that calculates the length of an answer under a directory `answer_len/`:
272+
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. You can create your own code-based evaluator, for example, with a simple Python class that calculates the length of an answer in `answer_length.py` under directory `answer_len/`:
268273

269274
```python
270275
class AnswerLengthEvaluator:
@@ -274,8 +279,7 @@ class AnswerLengthEvaluator:
274279
def __call__(self, *, answer: str, **kwargs):
275280
return {"answer_length": len(answer)}
276281
```
277-
278-
You can create your own code-based evaluator and run it on a row of data by importing a callable class:
282+
Then run the evalutor on a row of data by importing a callable class:
279283

280284
```python
281285
with open("answer_len/answer_length.py") as fin:
@@ -579,7 +583,7 @@ After local evaluations of your generative AI applications, you may want to trig
579583

580584
### Installation Instructions
581585

582-
1. Create a **virtual environment of you choice**. To create one using conda, run the following command:
586+
1. Create a **virtual Python environment of you choice**. To create one using conda, run the following command:
583587
```bash
584588
conda create -n remote-evaluation
585589
conda activate remote-evaluation
@@ -634,7 +638,7 @@ from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEv
634638
print("F1 Score evaluator id:", F1ScoreEvaluator.id)
635639
```
636640
- **From UI**: Follows these steps to fetch evaluator ids after they are registered to your project:
637-
- Select on **Evaluation** of your Azure AI project;
641+
- Select **Evaluation** tab in your Azure AI project;
638642
- Select Evaluator library;
639643
- Select your evaluator(s) of choice by comparing the descriptions;
640644
- Copy its "Asset ID" which will be your evaluator id, for example, `azureml://registries/azureml/models/Groundedness-Pro-Evaluator/versions/1`.

0 commit comments

Comments
 (0)