You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This GitHub Action enables offline evaluation of AI models within your CI/CD pipelines. It's designed to streamline the evaluation process, allowing you to assess model performance and make informed decisions before deploying to production.
18
+
This GitHub Action enables offline evaluation of AI models and agents within your CI/CD pipelines. It's designed to streamline the evaluation process, allowing you to assess model performance and make informed decisions before deploying to production.
19
19
20
-
Offline evaluation involves testing AI models using test datasets to measure their performance on various quality and safety metrics such as fluency, coherence, and content safety. After you select a model in the [Azure AI Model Catalog](https://azure.microsoft.com/products/ai-model-catalog?msockid=1f44c87dd9fa6d1e257fdd6dd8406c42) or [GitHub Model marketplace](https://github.com/marketplace/models), offline pre-production evaluation is crucial for AI application validation during integration testing. This process allows developers to identify potential issues and make improvements before deploying the model or application to production, such as when updating agents.
20
+
Offline evaluation involves testing AI models and agents using test datasets to measure their performance on various quality and safety metrics such as fluency, coherence, and content safety. After you select a model in the [Azure AI Model Catalog](https://azure.microsoft.com/products/ai-model-catalog?msockid=1f44c87dd9fa6d1e257fdd6dd8406c42) or [GitHub Model marketplace](https://github.com/marketplace/models), offline pre-production evaluation is crucial for AI application validation during integration testing. This process allows developers to identify potential issues and make improvements before deploying the model or application to production, such as when creating and updating agents.
@@ -29,11 +29,11 @@ Offline evaluation involves testing AI models using test datasets to measure the
29
29
30
30
Two GitHub Actions are available for evaluating AI applications: **ai-agent-evals** and **genai-evals**.
31
31
32
-
- If your application is already leveraging AI Foundry agents, **ai-agent-evals** is well-suited as it offers a simplified setup process and direct integration with agent-based workflows. **genai-evals** is intended for evaluating generative AI models outside of the agent framework.
32
+
- If your application is already using AI Foundry agents, **ai-agent-evals** is well-suited as it offers a simplified setup process and direct integration with agent-based workflows.
33
33
-**genai-evals** is intended for evaluating generative AI models outside of the agent framework.
34
34
35
35
> [!NOTE]
36
-
> The **ai-agent-evals** interface is more straightforward to configure. In contrast, **genai-evals**require customers to prepare structured evaluation input data. Although code samples are provided to facilitate this process, the overall setup might involve additional complexity.
36
+
> The **ai-agent-evals** interface is more straightforward to configure. In contrast, **genai-evals**requires you to prepare structured evaluation input data. Code samples are provided to help with setup.
37
37
38
38
## How to set up AI agent evaluations
39
39
@@ -47,7 +47,7 @@ The input of ai-agent-evals includes:
47
47
-`deployment-name`: the deployed model name.
48
48
-`data-path`: Path to the input data file containing the conversation starters. Each conversation starter is sent to each agent for a pairwise comparison of evaluation results.
49
49
-`evaluators`: built-in evaluator names.
50
-
-`data`: a set of conversation starters/queries and ground truth. Ground-truth is optional and only required for a subset of evaluators. (See which [evaluator requires ground-truth](./develop/evaluate-sdk.md#data-requirements-for-built-in-evaluators))
50
+
-`data`: a set of conversation starters/queries.
51
51
- Only single agent turn is supported.
52
52
-`agent-ids`: a unique identifier for the agent and comma-separated list of agent IDs to evaluate.
53
53
- When only one `agent-id` is specified, the evaluation results include the absolute values for each metric along with the corresponding confidence intervals.
@@ -65,20 +65,16 @@ Here's a sample of the dataset:
65
65
{
66
66
"name": "MyTestData",
67
67
"evaluators": [
68
-
"GroundednessEvaluator",
69
68
"RelevanceEvaluator",
70
69
"ViolenceEvaluator",
71
-
"HateUnfairnessEvaluator",
72
-
"RougeScoreEvaluator"
70
+
"HateUnfairnessEvaluator",
73
71
],
74
72
"data": [
75
73
{
76
74
"query": "Tell me about Tokyo?",
77
-
"ground_truth": "Tokyo is the capital of Japan and the largest city in the country. It is located on the eastern coast of Honshu, the largest of Japan's four main islands. Tokyo is the political, economic, and cultural center of Japan and is one of the world's most populous cities. It is also one of the world's most important financial centers and is home to the Tokyo Stock Exchange."
78
75
},
79
76
{
80
77
"query": "Where is Italy?",
81
-
"ground_truth": "Italy is a country in southern Europe, located on the Italian Peninsula and the two largest islands in the Mediterranean Sea, Sicily and Sardinia. It is a unitary parliamentary republic with its capital in Rome, the largest city in Italy. Other major cities include Milan, Naples, Turin, and Palermo."
82
78
}
83
79
]
84
80
}
@@ -92,47 +88,44 @@ To use the GitHub Action, add the GitHub Action to your CI/CD workflows and spec
92
88
> [!TIP]
93
89
> To minimize costs, you should avoid running evaluation on every commit.
94
90
95
-
This example illustrates how Azure Agent AI Evaluation can be run when comparing two different agents with agent ID `my-agent-id1` and `my-agent-id2`.
91
+
This example illustrates how Azure Agent AI Evaluation can be run when comparing different agents with agent IDs.
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`GroundednessEvaluator`| Not Supported | Supported |
23
23
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`GroundednessProEvaluator`| Not Supported | Supported |
@@ -28,7 +28,7 @@ ms.custom: include file
28
28
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`SimilarityEvaluator`| Not Supported | Supported |
29
29
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`IntentResolutionEvaluator`| Supported | Supported |
30
30
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`TaskAdherenceEvaluator`| Supported | Supported |
31
-
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`ToolCallAccuracyEvaluator`| Supported | Not Supported |
31
+
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`ToolCallAccuracyEvaluator`|Not Supported | Not Supported |
32
32
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`ResponseCompletenessEvaluator`| Not Supported | Supported |
33
33
|[Performance and quality (AI-assisted)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`DocumentRetrievalEvaluator`| Not Supported | Not Supported |
34
34
|[Performance and quality (NLP)](../how-to/develop/evaluate-sdk.md#performance-and-quality-evaluators)|`F1ScoreEvaluator`| Not Supported | Supported |
0 commit comments