Skip to content

Commit 915f2c7

Browse files
committed
evalgithubaction0525
1 parent d38e4eb commit 915f2c7

File tree

3 files changed

+37
-44
lines changed

3 files changed

+37
-44
lines changed

articles/ai-foundry/how-to/evaluation-github-action.md

Lines changed: 34 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: How to run evaluation in GitHub Action to streamline the evaluation
55
manager: scottpolly
66
ms.service: azure-ai-foundry
77
ms.topic: how-to
8-
ms.date: 05/05/2025
8+
ms.date: 05/08/2025
99
ms.reviewer: hanch
1010
ms.author: lagayhar
1111
author: lgayhardt
@@ -15,9 +15,9 @@ author: lgayhardt
1515

1616
[!INCLUDE [feature-preview](../includes/feature-preview.md)]
1717

18-
This GitHub Action enables offline evaluation of AI models within your CI/CD pipelines. It's designed to streamline the evaluation process, allowing you to assess model performance and make informed decisions before deploying to production.
18+
This GitHub Action enables offline evaluation of AI models and agents within your CI/CD pipelines. It's designed to streamline the evaluation process, allowing you to assess model performance and make informed decisions before deploying to production.
1919

20-
Offline evaluation involves testing AI models using test datasets to measure their performance on various quality and safety metrics such as fluency, coherence, and content safety. After you select a model in the [Azure AI Model Catalog](https://azure.microsoft.com/products/ai-model-catalog?msockid=1f44c87dd9fa6d1e257fdd6dd8406c42) or [GitHub Model marketplace](https://github.com/marketplace/models), offline pre-production evaluation is crucial for AI application validation during integration testing. This process allows developers to identify potential issues and make improvements before deploying the model or application to production, such as when updating agents.
20+
Offline evaluation involves testing AI models and agents using test datasets to measure their performance on various quality and safety metrics such as fluency, coherence, and content safety. After you select a model in the [Azure AI Model Catalog](https://azure.microsoft.com/products/ai-model-catalog?msockid=1f44c87dd9fa6d1e257fdd6dd8406c42) or [GitHub Model marketplace](https://github.com/marketplace/models), offline pre-production evaluation is crucial for AI application validation during integration testing. This process allows developers to identify potential issues and make improvements before deploying the model or application to production, such as when creating and updating agents.
2121

2222
[!INCLUDE [features](../includes/evaluation-github-action-azure-devops-features.md)]
2323

@@ -29,11 +29,11 @@ Offline evaluation involves testing AI models using test datasets to measure the
2929

3030
Two GitHub Actions are available for evaluating AI applications: **ai-agent-evals** and **genai-evals**.
3131

32-
- If your application is already leveraging AI Foundry agents, **ai-agent-evals** is well-suited as it offers a simplified setup process and direct integration with agent-based workflows. **genai-evals** is intended for evaluating generative AI models outside of the agent framework.
32+
- If your application is already using AI Foundry agents, **ai-agent-evals** is well-suited as it offers a simplified setup process and direct integration with agent-based workflows.
3333
- **genai-evals** is intended for evaluating generative AI models outside of the agent framework.
3434

3535
> [!NOTE]
36-
> The **ai-agent-evals** interface is more straightforward to configure. In contrast, **genai-evals** require customers to prepare structured evaluation input data. Although code samples are provided to facilitate this process, the overall setup might involve additional complexity.
36+
> The **ai-agent-evals** interface is more straightforward to configure. In contrast, **genai-evals** requires you to prepare structured evaluation input data. Code samples are provided to help with setup.
3737
3838
## How to set up AI agent evaluations
3939

@@ -47,7 +47,7 @@ The input of ai-agent-evals includes:
4747
- `deployment-name`: the deployed model name.
4848
- `data-path`: Path to the input data file containing the conversation starters. Each conversation starter is sent to each agent for a pairwise comparison of evaluation results.
4949
- `evaluators`: built-in evaluator names.
50-
- `data`: a set of conversation starters/queries and ground truth. Ground-truth is optional and only required for a subset of evaluators. (See which [evaluator requires ground-truth](./develop/evaluate-sdk.md#data-requirements-for-built-in-evaluators))
50+
- `data`: a set of conversation starters/queries.
5151
- Only single agent turn is supported.
5252
- `agent-ids`: a unique identifier for the agent and comma-separated list of agent IDs to evaluate.
5353
- When only one `agent-id` is specified, the evaluation results include the absolute values for each metric along with the corresponding confidence intervals.
@@ -65,20 +65,16 @@ Here's a sample of the dataset:
6565
{
6666
"name": "MyTestData",
6767
"evaluators": [
68-
"GroundednessEvaluator",
6968
"RelevanceEvaluator",
7069
"ViolenceEvaluator",
71-
"HateUnfairnessEvaluator",
72-
"RougeScoreEvaluator"
70+
"HateUnfairnessEvaluator",
7371
],
7472
"data": [
7573
{
7674
"query": "Tell me about Tokyo?",
77-
"ground_truth": "Tokyo is the capital of Japan and the largest city in the country. It is located on the eastern coast of Honshu, the largest of Japan's four main islands. Tokyo is the political, economic, and cultural center of Japan and is one of the world's most populous cities. It is also one of the world's most important financial centers and is home to the Tokyo Stock Exchange."
7875
},
7976
{
8077
"query": "Where is Italy?",
81-
"ground_truth": "Italy is a country in southern Europe, located on the Italian Peninsula and the two largest islands in the Mediterranean Sea, Sicily and Sardinia. It is a unitary parliamentary republic with its capital in Rome, the largest city in Italy. Other major cities include Milan, Naples, Turin, and Palermo."
8278
}
8379
]
8480
}
@@ -92,47 +88,44 @@ To use the GitHub Action, add the GitHub Action to your CI/CD workflows and spec
9288
> [!TIP]
9389
> To minimize costs, you should avoid running evaluation on every commit.
9490
95-
This example illustrates how Azure Agent AI Evaluation can be run when comparing two different agents with agent ID `my-agent-id1` and `my-agent-id2`.
91+
This example illustrates how Azure Agent AI Evaluation can be run when comparing different agents with agent IDs.
9692

9793
```YAML
9894
name: "AI Agent Evaluation"
9995

10096
on:
101-
workflow_dispatch:
102-
push:
103-
branches:
104-
- main
97+
workflow_dispatch:
98+
push:
99+
branches:
100+
- main
105101

106102
permissions:
107-
id-token: write
108-
contents: read
103+
id-token: write
104+
contents: read
109105

110106
jobs:
111-
run-action:
112-
runs-on: ubuntu-latest
113-
steps:
114-
- name: Checkout
115-
uses: actions/checkout@v4
116-
117-
- name: Azure login using Federated Credentials
118-
uses: azure/login@v2
119-
with:
120-
client-id: ${{ vars.AZURE_CLIENT_ID }}
121-
tenant-id: ${{ vars.AZURE_TENANT_ID }}
122-
subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
123-
124-
- name: Run Evaluation
125-
uses: microsoft/ai-agent-evals@v1-beta
126-
with:
127-
# Replace placeholders with values for your Azure AI Project
128-
azure-aiproject-connection-string: "<your-ai-project-conn-str>"
129-
deployment-name: "<your-deployment-name>"
130-
agent-ids: "<your-ai-agent-ids>"
131-
data-path: ${{ github.workspace }}/path/to/your/data-file
107+
run-action:
108+
runs-on: ubuntu-latest
109+
steps:
110+
- name: Azure login using Federated Credentials
111+
uses: azure/login@v2
112+
with:
113+
client-id: ${{ vars.AZURE_CLIENT_ID }}
114+
tenant-id: ${{ vars.AZURE_TENANT_ID }}
115+
subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
116+
117+
- name: Run Evaluation
118+
uses: microsoft/ai-agent-evals@v1
119+
with:
120+
# Replace placeholders with values for your Azure AI Project
121+
azure-aiproject-connection-string: "<your-ai-project-conn-str>"
122+
deployment-name: "<your-deployment-name>"
123+
agent-ids: "<your-ai-agent-ids>"
124+
data-path: ${{ github.workspace }}/path/to/your/data-file
132125

133126
```
134127

135-
### AI agent evaluations outputs
128+
### AI agent evaluations output
136129

137130
Evaluation results are outputted to the summary section for each AI evaluation GitHub Action run under Actions in GitHub.com.
138131

@@ -238,7 +231,7 @@ jobs:
238231
evaluate-configuration: ${{ env.GENAI_EVALS_CONFIG_PATH }}
239232
```
240233

241-
### GenAI evaluations outputs
234+
### GenAI evaluations output
242235

243236
Evaluation results are outputted to the summary section for each AI evaluation GitHub Action run under Actions in GitHub.com.
244237

articles/ai-foundry/includes/evaluation-github-action-azure-devops-features.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Include file
44
author: lgayhardt
55
ms.service: azure-ai-foundry
66
ms.topic: include
7-
ms.date: 4/30/2025
7+
ms.date: 5/08/2025
88
ms.author: lagayhar
99
ms.custom: include file
1010
---
@@ -17,7 +17,7 @@ ms.custom: include file
1717

1818
The following evaluators are supported:
1919

20-
| Category | Evaluator class/Metrics | AI Agent evals | GenAI evals |
20+
| Category | Evaluator class/Metrics | AI Agent evaluations | GenAI evaluations |
2121
|--|--|--|--|
2222
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `GroundednessEvaluator` | Not Supported | Supported |
2323
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `GroundednessProEvaluator` | Not Supported | Supported |
@@ -28,7 +28,7 @@ ms.custom: include file
2828
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `SimilarityEvaluator` | Not Supported | Supported |
2929
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `IntentResolutionEvaluator` | Supported | Supported |
3030
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `TaskAdherenceEvaluator` | Supported | Supported |
31-
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `ToolCallAccuracyEvaluator` | Supported | Not Supported |
31+
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `ToolCallAccuracyEvaluator` | Not Supported | Not Supported |
3232
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `ResponseCompletenessEvaluator` | Not Supported | Supported |
3333
| [Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `DocumentRetrievalEvaluator` | Not Supported | Not Supported |
3434
| [Performance and quality (NLP)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators) | `F1ScoreEvaluator` | Not Supported | Supported |
18.1 KB
Loading

0 commit comments

Comments
 (0)