Skip to content

Commit 6987c76

Browse files
authored
Merge pull request #4660 from MicrosoftDocs/main
5/8/2025 PM Publish
2 parents e5ca531 + 7447963 commit 6987c76

File tree

7 files changed

+313
-8
lines changed

7 files changed

+313
-8
lines changed
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
title: How to run an evaluation in GitHub Action
3+
titleSuffix: Azure AI Foundry
4+
description: How to run evaluation in GitHub Action to streamline the evaluation process, allowing you to assess model performance and make informed decisions before deploying to production.
5+
manager: scottpolly
6+
ms.service: azure-ai-foundry
7+
ms.topic: how-to
8+
ms.date: 05/08/2025
9+
ms.reviewer: hanch
10+
ms.author: lagayhar
11+
author: lgayhardt
12+
---
13+
14+
# How to run an evaluation in GitHub Action (preview)
15+
16+
[!INCLUDE [feature-preview](../includes/feature-preview.md)]
17+
18+
This GitHub Action enables offline evaluation of AI models and agents within your CI/CD pipelines. It's designed to streamline the evaluation process, allowing you to assess model performance and make informed decisions before deploying to production.
19+
20+
Offline evaluation involves testing AI models and agents using test datasets to measure their performance on various quality and safety metrics such as fluency, coherence, and content safety. After you select a model in the [Azure AI Model Catalog](https://azure.microsoft.com/products/ai-model-catalog?msockid=1f44c87dd9fa6d1e257fdd6dd8406c42) or [GitHub Model marketplace](https://github.com/marketplace/models), offline pre-production evaluation is crucial for AI application validation during integration testing. This process allows developers to identify potential issues and make improvements before deploying the model or application to production, such as when creating and updating agents.
21+
22+
[!INCLUDE [features](../includes/evaluation-github-action-azure-devops-features.md)]
23+
24+
- **Seamless Integration**: Easily integrate with existing GitHub workflows to run evaluation based on rules that you specify in your workflows (for examples, when changes are committed to agent versions, prompt templates, or feature flag configuration).
25+
- **Statistical Analysis**: Evaluation results include confidence intervals and test for statistical significance to determine if changes are meaningful and not due to random variation.
26+
- **Out-of-box operation metrics**: Automatically generates operational metrics for each evaluation run (client run duration, server run duration, completion tokens, and prompt tokens).
27+
28+
## Prerequisites
29+
30+
Two GitHub Actions are available for evaluating AI applications: **ai-agent-evals** and **genai-evals**.
31+
32+
- If your application is already using AI Foundry agents, **ai-agent-evals** is well-suited as it offers a simplified setup process and direct integration with agent-based workflows.
33+
- **genai-evals** is intended for evaluating generative AI models outside of the agent framework.
34+
35+
> [!NOTE]
36+
> The **ai-agent-evals** interface is more straightforward to configure. In contrast, **genai-evals** requires you to prepare structured evaluation input data. Code samples are provided to help with setup.
37+
38+
## How to set up AI agent evaluations
39+
40+
### AI agent evaluations input
41+
42+
The input of ai-agent-evals includes:
43+
44+
**Required:**
45+
46+
- `azure-aiproject-connection-string`: The connection string for the Azure AI project. This is used to connect to Azure OpenAI to simulate conversations with each agent, and to connect to the Azure AI evaluation SDK to perform the evaluation.
47+
- `deployment-name`: the deployed model name.
48+
- `data-path`: Path to the input data file containing the conversation starters. Each conversation starter is sent to each agent for a pairwise comparison of evaluation results.
49+
- `evaluators`: built-in evaluator names.
50+
- `data`: a set of conversation starters/queries.
51+
- Only single agent turn is supported.
52+
- `agent-ids`: a unique identifier for the agent and comma-separated list of agent IDs to evaluate.
53+
- When only one `agent-id` is specified, the evaluation results include the absolute values for each metric along with the corresponding confidence intervals.
54+
- When multiple `agent-ids` are specified, the results include absolute values for each agent and a statistical comparison against the designated baseline agent ID.
55+
56+
**Optional:**
57+
58+
- `api-version`: the API version of deployed model.
59+
- `baseline-agent-id`: Agent ID of the baseline agent to compare against. By default, the first agent is used.
60+
- `evaluation-result-view`: Specifies the format of evaluation results. Defaults to "default" (boolean scores such as passing and defect rates) if omitted. Options are "default", "all-scores" (includes all evaluation scores), and "raw-scores-only" (non-boolean scores only).
61+
62+
Here's a sample of the dataset:
63+
64+
```JSON
65+
{
66+
"name": "MyTestData",
67+
"evaluators": [
68+
"RelevanceEvaluator",
69+
"ViolenceEvaluator",
70+
"HateUnfairnessEvaluator",
71+
],
72+
"data": [
73+
{
74+
"query": "Tell me about Tokyo?",
75+
},
76+
{
77+
"query": "Where is Italy?",
78+
}
79+
]
80+
}
81+
82+
```
83+
84+
### AI agent evaluations workflow
85+
86+
To use the GitHub Action, add the GitHub Action to your CI/CD workflows and specify the trigger criteria (for example, on commit) and file paths to trigger your automated workflows.
87+
88+
> [!TIP]
89+
> To minimize costs, you should avoid running evaluation on every commit.
90+
91+
This example illustrates how Azure Agent AI Evaluation can be run when comparing different agents with agent IDs.
92+
93+
```YAML
94+
name: "AI Agent Evaluation"
95+
96+
on:
97+
workflow_dispatch:
98+
push:
99+
branches:
100+
- main
101+
102+
permissions:
103+
id-token: write
104+
contents: read
105+
106+
jobs:
107+
run-action:
108+
runs-on: ubuntu-latest
109+
steps:
110+
- name: Azure login using Federated Credentials
111+
uses: azure/login@v2
112+
with:
113+
client-id: ${{ vars.AZURE_CLIENT_ID }}
114+
tenant-id: ${{ vars.AZURE_TENANT_ID }}
115+
subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
116+
117+
- name: Run Evaluation
118+
uses: microsoft/ai-agent-evals@v1
119+
with:
120+
# Replace placeholders with values for your Azure AI Project
121+
azure-aiproject-connection-string: "<your-ai-project-conn-str>"
122+
deployment-name: "<your-deployment-name>"
123+
agent-ids: "<your-ai-agent-ids>"
124+
data-path: ${{ github.workspace }}/path/to/your/data-file
125+
126+
```
127+
128+
### AI agent evaluations output
129+
130+
Evaluation results are outputted to the summary section for each AI evaluation GitHub Action run under Actions in GitHub.com.
131+
132+
The result includes two main parts:
133+
134+
- The top section summarizes the overview of your AI agent variants. You can select it on the agent ID link, and it directs you to the agent setting page in AI Foundry portal. You can also select the link for Evaluation Results, and it directs you to AI Foundry portal to view individual result in detail.
135+
- The second section includes evaluation scores and comparison between different variants on statistical significance (for multiple agents) and confidence intervals (for single agent).
136+
137+
Multi agent evaluation result:
138+
139+
:::image type="content" source="../media/evaluations/github-action-multi-agent-result.png" alt-text="Screenshot of multi agent evaluation result in GitHub Action." lightbox="../media/evaluations/github-action-multi-agent-result.png":::
140+
141+
Single agent evaluation result:
142+
143+
:::image type="content" source="../media/evaluations/github-action-single-agent-output.png" alt-text="Screenshot of single agent evaluation result in GitHub Action." lightbox="../media/evaluations/github-action-single-agent-output.png":::
144+
145+
## How to set up genAI evaluations
146+
147+
### GenAI evaluations input
148+
149+
The input of genai-evals includes (some of them are optional depending on the evaluator used):
150+
151+
Evaluation configuration file:
152+
153+
- `data`: a set of queries and ground truth. Ground-truth is optional and only required for a subset of evaluators. (See which [evaluator requires ground-truth](./develop/evaluate-sdk.md#data-requirements-for-built-in-evaluators)).
154+
155+
Here's a sample of the dataset:
156+
157+
```json
158+
[
159+
{
160+
"query": "Tell me about Tokyo?",
161+
"ground-truth": "Tokyo is the capital of Japan and the largest city in the country. It is located on the eastern coast of Honshu, the largest of Japan's four main islands. Tokyo is the political, economic, and cultural center of Japan and is one of the world's most populous cities. It is also one of the world's most important financial centers and is home to the Tokyo Stock Exchange."
162+
},
163+
{
164+
"query": "Where is Italy?",
165+
"ground-truth": "Italy is a country in southern Europe, located on the Italian Peninsula and the two largest islands in the Mediterranean Sea, Sicily and Sardinia. It is a unitary parliamentary republic with its capital in Rome, the largest city in Italy. Other major cities include Milan, Naples, Turin, and Palermo."
166+
},
167+
168+
{
169+
"query": "Where is Papua New Guinea?",
170+
"ground-truth": "Papua New Guinea is an island country that lies in the south-western Pacific. It includes the eastern half of New Guinea and many small offshore islands. Its neighbours include Indonesia to the west, Australia to the south and Solomon Islands to the south-east."
171+
}
172+
]
173+
174+
```
175+
176+
- `evaluators`: built-in evaluator names.
177+
- `ai_model_configuration`: including type, `azure_endpoint`, `azure_deployment` and `api_version`.
178+
179+
### GenAI evaluations workflow
180+
181+
This example illustrates how Azure AI Evaluation can be run when changes are committed to specific files in your repo.
182+
183+
> [!NOTE]
184+
> Update `GENAI_EVALS_DATA_PATH` to point to the correct directory in your repo.
185+
186+
```yml
187+
name: Sample Evaluate Action
188+
on:
189+
workflow_call:
190+
workflow_dispatch:
191+
192+
permissions:
193+
id-token: write
194+
contents: read
195+
196+
jobs:
197+
evaluate:
198+
runs-on: ubuntu-latest
199+
env:
200+
GENAI_EVALS_CONFIG_PATH: ${{ github.workspace }}/evaluate-config.json
201+
GENAI_EVALS_DATA_PATH: ${{ github.workspace }}/.github/.test_files/eval-input.jsonl
202+
steps:
203+
- uses: actions/checkout@v4
204+
- uses: azure/login@v2
205+
with:
206+
client-id: ${{ secrets.OIDC_AZURE_CLIENT_ID }}
207+
tenant-id: ${{ secrets.OIDC_AZURE_TENANT_ID }}
208+
subscription-id: ${{ secrets.OIDC_AZURE_SUBSCRIPTION_ID }}
209+
- name: Write evaluate config
210+
run: |
211+
cat > ${{ env.GENAI_EVALS_CONFIG_PATH }} <<EOF
212+
{
213+
"data": "${{ env.GENAI_EVALS_DATA_PATH }}",
214+
"evaluators": {
215+
"coherence": "CoherenceEvaluator",
216+
"fluency": "FluencyEvaluator"
217+
},
218+
"ai_model_configuration": {
219+
"type": "azure_openai",
220+
"azure_endpoint": "${{ secrets.AZURE_OPENAI_ENDPOINT }}",
221+
"azure_deployment": "${{ secrets.AZURE_OPENAI_CHAT_DEPLOYMENT }}",
222+
"api_key": "${{ secrets.AZURE_OPENAI_API_KEY }}",
223+
"api_version": "${{ secrets.AZURE_OPENAI_API_VERSION }}"
224+
}
225+
}
226+
EOF
227+
- name: Run AI Evaluation
228+
id: run-ai-evaluation
229+
uses: microsoft/genai-evals@main
230+
with:
231+
evaluate-configuration: ${{ env.GENAI_EVALS_CONFIG_PATH }}
232+
```
233+
234+
### GenAI evaluations output
235+
236+
Evaluation results are outputted to the summary section for each AI evaluation GitHub Action run under Actions in GitHub.com.
237+
238+
The results include three parts:
239+
240+
- Test Variants: a summary of variant names and system prompts.
241+
- Average scores: the average score of each evaluator for each variant.
242+
- Individual test scores: detailed result for each individual test case.
243+
244+
:::image type="content" source="../media/evaluations/github-action-output-results.png" alt-text="Screenshot of result output including test variants, average score, and individual test in GitHub Action." lightbox="../media/evaluations/github-action-output-results.png":::
245+
246+
## Related content
247+
248+
- [How to evaluate generative AI models and applications with Azure AI Foundry](./evaluate-generative-ai-app.md)
249+
- [How to view evaluation results in Azure AI Foundry portal](./evaluate-results.md)
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: Include file
3+
description: Include file
4+
author: lgayhardt
5+
ms.service: azure-ai-foundry
6+
ms.topic: include
7+
ms.date: 5/08/2025
8+
ms.author: lagayhar
9+
ms.custom: include file
10+
---
11+
12+
## Features
13+
14+
- **Automated Evaluation**: Integrate offline evaluation into your CI/CD workflows to automate the pre-production assessment of AI models.
15+
16+
- **Built-in Evaluators**: Leverage existing evaluators provided by the [Azure AI Evaluation SDK](../how-to/develop/evaluate-sdk.md).
17+
18+
The following evaluators are supported:
19+
20+
| Category | Evaluator class/Metrics | AI Agent evaluations | GenAI evaluations |
21+
|--|--|--|--|
22+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `GroundednessEvaluator` | Not Supported | Supported |
23+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `GroundednessProEvaluator` | Not Supported | Supported |
24+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `RetrievalEvaluator` | Not Supported | Supported |
25+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `RelevanceEvaluator` | Supported | Supported |
26+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `CoherenceEvaluator` | Supported | Supported |
27+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `FluencyEvaluator` | Supported | Supported |
28+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `SimilarityEvaluator` | Not Supported | Supported |
29+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `IntentResolutionEvaluator` | Supported | Supported |
30+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `TaskAdherenceEvaluator` | Supported | Supported |
31+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `ToolCallAccuracyEvaluator` | Not Supported | Not Supported |
32+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `ResponseCompletenessEvaluator` | Not Supported | Supported |
33+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `DocumentRetrievalEvaluator` | Not Supported | Not Supported |
34+
| [Performance and quality (NLP)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `F1ScoreEvaluator` | Not Supported | Supported |
35+
| [Performance and quality (NLP)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `RougeScoreEvaluator` | Not Supported | Not Supported |
36+
| [Performance and quality (NLP)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `GleuScoreEvaluator` | Not Supported | Supported |
37+
| [Performance and quality (NLP)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `BleuScoreEvaluator ` | Not Supported | Supported |
38+
| [Performance and quality (NLP)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `MeteorScoreEvaluator` | Not Supported | Supported |
39+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `ViolenceEvaluator` | Supported | Supported |
40+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `SexualEvaluator` | Supported | Supported |
41+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `SelfHarmEvaluator` | Supported | Supported |
42+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `HateUnfairnessEvaluator` | Supported | Supported |
43+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `IndirectAttackEvaluator` | Supported | Supported |
44+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `ProtectedMaterialEvaluator` | Supported | Supported |
45+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `CodeVulnerabilityEvaluator` | Supported | Supported |
46+
| [Risk and safety (AI-assisted)](../how-to/develop/evaluate-sdk.md#risk-and-safety-evaluators-preview) | `UngroundedAttributesEvaluator` | Not Supported | Supported |
47+
| [Composite](../how-to/develop/evaluate-sdk.md#composite-evaluators) | `QAEvaluator` | Not Supported | Supported |
48+
| [Composite](../how-to/develop/evaluate-sdk.md#composite-evaluators) | `ContentSafetyEvaluator` | Supported | Supported |
49+
| [Composite](../how-to/develop/evaluate-sdk.md#composite-evaluators) | `AgentOverallEvaluator` | Not Supported | Not Supported |
50+
| Operational metrics | Client run duration | Supported | Not Supported |
51+
| Operational metrics | Server run duration | Supported | Not Supported |
52+
| Operational metrics | Completion tokens | Supported | Not Supported |
53+
| Operational metrics | Prompt tokens | Supported | Not Supported |
54+
| [Custom evaluators](../how-to/develop/evaluate-sdk.md#custom-evaluators) | | Not Supported | Not Supported |
456 KB
Loading
474 KB
Loading
297 KB
Loading

articles/ai-foundry/toc.yml

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -459,12 +459,8 @@ items:
459459
displayName: accuracy,metrics
460460
- name: Run evaluations online
461461
href: how-to/online-evaluation.md
462-
- name: Evaluate flows in the portal
463-
items:
464-
- name: Submit batch run and evaluate a flow
465-
href: how-to/flow-bulk-test-evaluation.md
466-
- name: Develop an evaluation flow in Prompt flow
467-
href: how-to/flow-develop-evaluation.md
462+
- name: Run an evaluation in GitHub Action
463+
href: how-to/evaluation-github-action.md
468464
- name: A/B experimentation
469465
href: concepts/a-b-experimentation.md
470466
- name: Build apps with prompt flow
@@ -481,6 +477,12 @@ items:
481477
href: how-to/flow-tune-prompts-using-variants.md
482478
- name: Process images in a flow
483479
href: how-to/flow-process-image.md
480+
- name: Evaluate flows in the portal
481+
items:
482+
- name: Submit batch run and evaluate a flow
483+
href: how-to/flow-bulk-test-evaluation.md
484+
- name: Develop an evaluation flow in Prompt flow
485+
href: how-to/flow-develop-evaluation.md
484486
- name: Use prompt flow tools
485487
items:
486488
- name: Prompt flow tools overview

articles/ai-services/openai/how-to/spillover-traffic-management.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
22
title: Manage traffic with spillover for Provisioned deployments
33
description: Article outlining how to use the spillover feature to manage traffic bursts for Azure OpenAI Service provisioned deployments
4-
author: sydneemayers # GitHub alias
5-
ms.author: sydneemayers
4+
author: aahill # GitHub alias
5+
ms.author: aahi
66
ms.service: azure-ai-openai
77
ms.topic: how-to
88
ms.date: 03/05/2025

0 commit comments

Comments
 (0)