You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/evaluations.md
+13-14Lines changed: 13 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,7 +50,7 @@ Azure OpenAI evaluation enables developers to create evaluation runs to test aga
50
50
- West US 2
51
51
- West US 3
52
52
53
-
If your preferred region is missing, please refer to [Azure OpenAI regions](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models?tabs=global-standard%2Cstandard-chat-completions#global-standard-model-availability) and check if it is one of the Azure OpenAI regional availability zones.
53
+
If your preferred region is missing, please refer to [Azure OpenAI regions](https://learn.microsoft.com/azure/ai-services/openai/concepts/models?tabs=global-standard%2Cstandard-chat-completions#global-standard-model-availability) and check if it is one of the Azure OpenAI regional availability zones.
54
54
55
55
### Supported deployment types
56
56
@@ -63,7 +63,7 @@ If your preferred region is missing, please refer to [Azure OpenAI regions](http
63
63
64
64
## Evaluation API (preview)
65
65
66
-
Evaluation API lets users test and improve model outputs directly through API calls, making the experience simple and customizable for developers to programmatically assess model quality and performance in their development workflows. To use Evaluation API, check out the [REST API documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/authoring-reference-preview#evaluation---get-list).
66
+
Evaluation API lets users test and improve model outputs directly through API calls, making the experience simple and customizable for developers to programmatically assess model quality and performance in their development workflows. To use Evaluation API, check out the [REST API documentation](https://learn.microsoft.com/azure/ai-services/openai/authoring-reference-preview#evaluation---get-list).
67
67
68
68
## Evaluation pipeline
69
69
@@ -154,7 +154,7 @@ When you click into each testing criteria, you will see different types of grade
154
154
155
155
:::image type="content" source="../media/how-to/evaluations/upload-data-2.png" alt-text="Screenshot of data upload" lightbox="../media/how-to/evaluations/upload-data-2.png":::
156
156
157
-
If you need a sample test file, you can use this sample `.jsonl` text. This sample contains sentences of various technical content, and we are going to be assessing semantic similiarity across these sentences.
157
+
If you need a sample test file, you can use this sample `.jsonl` text. This sample contains sentences of various technical content, and we are going to be assessing semantic similarity across these sentences.
158
158
159
159
```jsonl
160
160
{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}
@@ -171,43 +171,42 @@ When you click into each testing criteria, you will see different types of grade
171
171
172
172
5. If you would like to create new responses using inputs from your test data, you can select 'Generate new responses'. This will inject the input fields from our evaluation file into individual prompts for a model of your choice to generate output.
173
173
174
-
:::image type="content" source="../media/how-to/evaluations/eval-generate-1.png" alt-text="Screenshot of the UX for generating model responses" lightbox="../media/how-to/evaluations/eval-generate-1.png":::
174
+
:::image type="content" source="../media/how-to/evaluations/eval-generate-1.png" alt-text="Screenshot of the UX for generating model responses" lightbox="../media/how-to/evaluations/eval-generate-1.png":::
175
175
176
176
You will select the model of your choice. If you do not have a model, you can create a new model deployment. The selected model will take the input data and generate its own unique outputs, which in this case will be stored in a variable called `{{sample.output_text}}`. We'll then use that output later as part of our testing criteria. Alternatively, you could provide your own custom system message and individual message examples manually.
177
177
178
-
:::image type="content" source="../media/how-to/evaluations/eval-generate-2.png" alt-text="Screenshot of the UX for generating model responses" lightbox="../media/how-to/evaluations/eval-generate-2.png":::
178
+
:::image type="content" source="../media/how-to/evaluations/eval-generate-2.png" alt-text="Screenshot of the UX for generating model responses" lightbox="../media/how-to/evaluations/eval-generate-2.png":::
179
179
180
-
6. For creating a test criteria, select **Add**. For the example file we provided above, we are going to be assessing semantic similarity. Select **Model Scorer**, which contains test criteria presets for Semantic Semilarity.
180
+
6. For creating a test criteria, select **Add**. For the example file we provided above, we are going to be assessing semantic similarity. Select **Model Scorer**, which contains test criteria presets for Semantic Similarity.
181
181
182
182
:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-1.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-1.png":::
183
183
184
184
Select **Semantic Similarity** at the top. Scroll to the bottom, and in `User` section, specify `{{item.output}}` as `Ground truth`, and specify `{{sample.output_text}}` as `Output`. This will take the original reference output from your evaluation `.jsonl` file (the sample file above) and compare it against the output that is generated by the model you chose in the previous step.
185
185
186
186
:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-2.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-2.png":::
187
187
188
-
:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-3.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-3.png":::
188
+
:::image type="content" source="../media/how-to/evaluations/eval-semantic-similarity-3.png" alt-text="Screenshot of the semantic similarity UX config." lightbox="../media/how-to/evaluations/eval-semantic-similarity-3.png":::
189
189
190
190
7. Select **Add** to add this testing criteria. If you would like to add additional testing criteria, you can add them at this step.
191
191
192
192
8. You are ready to create your Evaluation. Provide your Evaluation name, review everything looks correct, and **Submit** to create the Evaluation job. You'll be taken to a status page for your evaluation job, which will show the status as "Waiting".
193
193
194
-
:::image type="content" source="../media/how-to/evaluations/eval-submit-job.png" alt-text="Screenshot of the evaluation job submit UX." lightbox="../media/how-to/evaluations/eval-submit-job.png.png":::
195
-
196
-
:::image type="content" source="../media/how-to/evaluations/eval-submit-job-2.png" alt-text="Screenshot of the evaluation job submit UX." lightbox="../media/how-to/evaluations/eval-submit-job-2.png.png":::
194
+
:::image type="content" source="../media/how-to/evaluations/eval-submit-job.png" alt-text="Screenshot of the evaluation job submit UX." lightbox="../media/how-to/evaluations/eval-submit-job.png":::
195
+
:::image type="content" source="../media/how-to/evaluations/eval-submit-job-2.png" alt-text="Screenshot of the evaluation job submit UX." lightbox="../media/how-to/evaluations/eval-submit-job-2.png":::
197
196
198
197
9. Once your evaluation job has created, you can select the job to view the full details of the job:
199
198
200
-
:::image type="content" source="../media/how-to/evaluations/test-complete.png" alt-text="Screenshot of a completed semantic similarity test with mix of pass and failures." lightbox="../media/how-to/evaluations/test-complete.png":::
199
+
:::image type="content" source="../media/how-to/evaluations/test-complete.png" alt-text="Screenshot of a completed semantic similarity test with mix of pass and failures." lightbox="../media/how-to/evaluations/test-complete.png":::
201
200
202
201
10. For semantic similarity **View output details** contains a JSON representation that you can copy/paste of your passing tests.
203
202
204
-
:::image type="content" source="../media/how-to/evaluations/output-details.png" alt-text="Screenshot of the evaluation status UX with output details." lightbox="../media/how-to/evaluations/output-details.png":::
203
+
:::image type="content" source="../media/how-to/evaluations/output-details.png" alt-text="Screenshot of the evaluation status UX with output details." lightbox="../media/how-to/evaluations/output-details.png":::
205
204
206
-
11. You can also add more Eval runs by selecting **+ Add Run** button at the top left corner of your evluation job page.
205
+
11. You can also add more Eval runs by selecting **+ Add Run** button at the top left corner of your evaluation job page.
207
206
208
207
## Types of Testing Criteria
209
208
210
-
Azure OpenAI Evaluation offers various evaluation testing criteria on top of Semantic Similiarity we saw in the example above. This section provides information about each testing criteria at much more detail.
209
+
Azure OpenAI Evaluation offers various evaluation testing criteria on top of Semantic Similarity we saw in the example above. This section provides information about each testing criteria at much more detail.
0 commit comments