You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/evaluations.md
+69-30Lines changed: 69 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,14 +18,39 @@ The evaluation of large language models is a critical step in measuring their pe
18
18
19
19
Azure OpenAI evaluation enables developers to create evaluation runs to test against expected input/output pairs, assessing the model’s performance across key metrics such as accuracy, reliability, and overall performance.
20
20
21
-
## Evaluations support
21
+
## Evaluation support
22
22
23
23
### Regional availability
24
24
25
-
- East US2
25
+
- Australia East
26
+
- Brazil South
27
+
- Canada Central
28
+
- Central US
29
+
- East US 2
30
+
- France Central
31
+
- Germany West Central
32
+
- Italy North
33
+
- Japan East
34
+
- Japan West
35
+
- Korea Central
26
36
- North Central US
37
+
- Norway East
38
+
- Poland Central
39
+
- South Africa North
40
+
- Southeast Asia
41
+
- Spain Central
27
42
- Sweden Central
43
+
- Switzerland North
28
44
- Switzerland West
45
+
- UAE North
46
+
- UK South
47
+
- UK West
48
+
- West Europe
49
+
- West US
50
+
- West US 2
51
+
- West US 3
52
+
53
+
If your preferred region is missing, please refer to [Azure OpenAI regions](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models?tabs=global-standard%2Cstandard-chat-completions#global-standard-model-availability) and check if it is one of the Azure OpenAI regional availability zones.
29
54
30
55
### Supported deployment types
31
56
@@ -36,6 +61,10 @@ Azure OpenAI evaluation enables developers to create evaluation runs to test aga
36
61
- Global provisioned-managed
37
62
- Data zone provisioned-managed
38
63
64
+
## Evaluation API (preview)
65
+
66
+
Evaluation API lets users test and improve model outputs directly through API calls, making the experience simple and customizable for developers to programmatically assess model quality and performance in their development workflows. To use Evaluation API, check out the [REST API documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/authoring-reference-preview#evaluation---get-list).
67
+
39
68
## Evaluation pipeline
40
69
41
70
### Test data
@@ -100,20 +129,32 @@ You can evaluate either base or fine-tuned model deployments. The deployments av
100
129
101
130
Testing criteria is used to assess the effectiveness of each output generated by the target model. These tests compare the input data with the output data to ensure consistency. You have the flexibility to configure different criteria to test and measure the quality and relevance of the output at different levels.
102
131
103
-
:::image type="content" source="../media/how-to/evaluations/testing-criteria.png" alt-text="Screenshot that shows the evaluations testing criteria options." lightbox="../media/how-to/evaluations/testing-criteria.png":::
132
+
:::image type="content" source="../media/how-to/evaluations/eval-testing-criteria.png" alt-text="Screenshot that shows the evaluations testing criteria options." lightbox="../media/how-to/evaluations/eval-testing-criteria.png":::
133
+
134
+
When you click into each testing criteria, you will see different types of graders as well as preset schemas that you can modify per your own evaluation dataset and criteria.
135
+
136
+
:::image type="content" source="../media/how-to/evaluations/eval-testing-criteria-2.png" alt-text="Screenshot that shows the evaluations testing criteria options." lightbox="../media/how-to/evaluations/eval-testing-criteria-2.png":::
104
137
105
138
## Getting started
106
139
107
140
1. Select the **Azure OpenAI Evaluation (PREVIEW)** within [Azure AI Foundry portal](https://ai.azure.com/). To see this view as an option may need to first select an existing Azure OpenAI resource in a supported region.
108
-
2. Select **New evaluation**
141
+
2. Select **+ New evaluation**
109
142
110
143
:::image type="content" source="../media/how-to/evaluations/new-evaluation.png" alt-text="Screenshot of the Azure OpenAI evaluation UX with new evaluation selected." lightbox="../media/how-to/evaluations/new-evaluation.png":::
111
144
112
-
3. Enter a name of your evaluation. By default a random name is automatically generated unless you edit and replace it. Select **Upload new dataset**.
145
+
3. Choose how you would like to provide test data for evaluation. You can import stored Chat Completions, create data using provided default templates, or upload your own data. Let's walk through uploading your own data.
146
+
147
+
:::image type="content" source="../media/how-to/evaluations/create-new-eval.png" alt-text="Screenshot of the Azure OpenAI create new evaluation." lightbox="../media/how-to/evaluations/create-new-eval.png":::
148
+
149
+
4. Select your evaluation data which will be in `.jsonl` format. If you already have an existing data, you can select one, or upload a new data.
113
150
114
-
:::image type="content" source="../media/how-to/evaluations/upload.png" alt-text="Screenshot of the Azure OpenAI upload UX." lightbox="../media/how-to/evaluations/upload.png":::
151
+
:::image type="content" source="../media/how-to/evaluations/upload-data-1.png" alt-text="Screenshot of data upload" lightbox="../media/how-to/evaluations/upload-data-1.png":::
115
152
116
-
4. Select your evaluation which will be in `.jsonl` format. If you need a sample test file you can save these 10 lines to a file called `eval-test.jsonl`:
153
+
When you upload new data, you'll see the first three lines of the file as a preview on the right side:
154
+
155
+
:::image type="content" source="../media/how-to/evaluations/upload-data-2.png" alt-text="Screenshot of data upload" lightbox="../media/how-to/evaluations/upload-data-2.png":::
156
+
157
+
If you need a sample test file, you can use this sample `.jsonl` text. This sample contains sentences of various technical content, and we are going to be assessing semantic similiarity across these sentences.
117
158
118
159
```jsonl
119
160
{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}
@@ -127,48 +168,46 @@ Testing criteria is used to assess the effectiveness of each output generated by
127
168
{"input": [{"role": "system", "content": "Generate a clear and focused summary that captures the essential technical details while maintaining their relationships and significance."}, {"role": "user", "content": "Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second task. This approach is powerful in deep learning, where pre-trained models on large datasets (like ImageNet for computer vision or BERT for NLP) are fine-tuned on specific downstream tasks. Transfer learning reduces the need for large amounts of task-specific training data and computational resources, as the model has already learned useful features from the source domain. Common strategies include feature extraction (freezing pre-trained layers) and fine-tuning (updating all or some pre-trained weights)."}], "output": "Transfer learning reuses models trained on one task for different tasks, particularly effective in deep learning. It leverages pre-trained models through feature extraction or fine-tuning, reducing data and computational needs for new tasks."}
128
169
{"input": [{"role": "system", "content": "Provide a precise and informative summary that distills the key technical concepts while maintaining their relationships and practical importance."}, {"role": "user", "content": "Ensemble methods combine multiple machine learning models to create a more robust and accurate predictor. Common techniques include bagging (training models on random data subsets), boosting (sequentially training models to correct earlier errors), and stacking (using a meta-model to combine base model predictions). Random Forests, a popular bagging method, create multiple decision trees using random feature subsets. Gradient Boosting builds trees sequentially, with each tree correcting the errors of previous ones. These methods often outperform single models by reducing overfitting and variance while capturing different data aspects."}], "output": "Ensemble methods enhance prediction accuracy by combining multiple models through techniques like bagging, boosting, and stacking. Popular implementations include Random Forests (using multiple trees with random features) and Gradient Boosting (sequential error correction), offering better performance than single models."}
129
170
```
171
+
172
+
5. If you would like to create new responses using inputs from your test data, you can select 'Generate new responses'. This will inject the input fields from our evaluation file into individual prompts for a model of your choice to generate output.
130
173
131
-
You'll see the first three lines of the file as a preview:
132
-
133
-
:::image type="content" source="../media/how-to/evaluations/preview.png" alt-text="Screenshot that shows a preview of an uploaded evaluation file." lightbox="../media/how-to/evaluations/preview.png":::
174
+
:::image type="content" source="../media/how-to/evaluations/eval-generate-1.png" alt-text="Screenshot of the UX for generating model responses" lightbox="../media/how-to/evaluations/eval-generate-1.png":::
175
+
176
+
You will select the model of your choice. If you do not have a model, you can create a new model deployment. The selected model will take the input data and generate its own unique outputs, which in this case will be stored in a variable called `{{sample.output_text}}`. We'll then use that output later as part of our testing criteria. Alternatively, you could provide your own custom system message and individual message examples manually.
134
177
135
-
5. Under **Responses** select the **Create** button. Select `{{item.input}}` from the **Create with a template** dropdown. This will inject the input fields from our evaluation file into individual prompts for a new model run that we want to able to compare against our evaluation dataset. The model will take that input and generate its own unique outputs which in this case will be stored in a variable called `{{sample.output_text}}`. We'll then use that sample output text later as part of our testing criteria. Alternatively you could provide your own custom system message and individual message examples manually.
178
+
:::image type="content" source="../media/how-to/evaluations/eval-generate-2.png" alt-text="Screenshot of the UX for generating model responses" lightbox="../media/how-to/evaluations/eval-generate-2.png":::
136
179
137
-
6. Select which model you want to generate responses based on your evaluation. If you don't have a model you can create one. For the purpose of this example we're using a standard deployment of `gpt-4o-mini`.
180
+
6. For creating a test criteria, select **Add**. For the example file we provided above, we are going to be assessing semantic similarity. Select **Model Scorer**, which contains test criteria presets for Semantic Semilarity.
138
181
139
-
:::image type="content" source="../media/how-to/evaluations/item-input.png" alt-text="Screenshot of the UX for generating model responses with a model selected." lightbox="../media/how-to/evaluations/item-input.png":::
182
+
:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-1.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-1.png":::
140
183
141
-
The settings/sprocket symbol controls the basic parameters that are passed to the model. Only the following parameters are supported at this time:
184
+
Select **Semantic Similarity** at the top. Scroll to the bottom, and in `User` section, specify `{{item.output}}` as `Ground truth`, and specify `{{sample.output_text}}` as `Output`. This will take the original reference output from your evaluation `.jsonl` file (the sample file above) and compare it against the output that is generated by the model you chose in the previous step.
142
185
143
-
- **Temperature**
144
-
- **Maximum length**
145
-
- **Top P**
186
+
:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-2.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-2.png":::
146
187
147
-
Maximum length is currently capped at 2048 regardless of what model you select.
188
+
:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-3.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-3.png":::
148
189
149
-
7. Select **Addtesting criteria** select **Add**.
190
+
7. Select **Add** to add this testing criteria. If you would like to add additional testing criteria, you can add them at this step.
150
191
151
-
8. Select **Semantic Similarity** > Under **Compare** add `{{item.output}}` under **With** add ``{{sample.output_text}}``. This will take the original reference output from your evaluation `.jsonl` file and compare it against the output that will be generated by giving the model prompts based on your ``{{item.input}}``.
192
+
8. You are ready to create your Evaluation. Provide your Evaluation name, review everything looks correct, and **Submit** to create the Evaluation job. You'll be taken to a status page for your evaluation job, which will show the status as "Waiting".
152
193
153
-
:::image type="content" source="../media/how-to/evaluations/semantic-similarity-config.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/semantic-similarity-config.png":::
194
+
:::image type="content" source="../media/how-to/evaluations/eval-submit-job.png" alt-text="Screenshot of the evaluation job submit UX." lightbox="../media/how-to/evaluations/eval-submit-job.png.png":::
154
195
155
-
9. Select **Add** > at this point you can either add additional testing criteria or you select Create to initiate the evaluation job run.
196
+
:::image type="content" source="../media/how-to/evaluations/eval-submit-job-2.png" alt-text="Screenshot of the evaluation job submit UX." lightbox="../media/how-to/evaluations/eval-submit-job-2.png.png":::
156
197
157
-
10. Once you select **Create** you'll be taken to a status page for your evaluation job.
158
-
159
-
:::image type="content" source="../media/how-to/evaluations/status.png" alt-text="Screenshot of the evaluation status UX." lightbox="../media/how-to/evaluations/status.png":::
160
-
161
-
11. Once your evaluation job has created you can select the job to view the full details of the job:
198
+
9. Once your evaluation job has created, you can select the job to view the full details of the job:
162
199
163
200
:::image type="content" source="../media/how-to/evaluations/test-complete.png" alt-text="Screenshot of a completed semantic similarity test with mix of pass and failures." lightbox="../media/how-to/evaluations/test-complete.png":::
164
201
165
-
12. For semantic similarity **View output details** contains a JSON representation that you can copy/paste of your passing tests.
202
+
10. For semantic similarity **View output details** contains a JSON representation that you can copy/paste of your passing tests.
166
203
167
204
:::image type="content" source="../media/how-to/evaluations/output-details.png" alt-text="Screenshot of the evaluation status UX with output details." lightbox="../media/how-to/evaluations/output-details.png":::
168
205
169
-
## Testing criteria details
206
+
11. You can also add more Eval runs by selecting **+ Add Run** button at the top left corner of your evluation job page.
207
+
208
+
## Types of Testing Criteria
170
209
171
-
Azure OpenAI Evaluation offers multiple testing criteria options. The section below provides additional details on each option.
210
+
Azure OpenAI Evaluation offers various evaluation testing criteria on top of Semantic Similiarity we saw in the example above. This section provides information about each testing criteria at much more detail.
0 commit comments