Skip to content

Commit 33332bd

Browse files
authored
Add new evaluators and package (#47569)
1 parent 76b8518 commit 33332bd

File tree

1 file changed

+18
-4
lines changed

1 file changed

+18
-4
lines changed

docs/ai/conceptual/evaluation-libraries.md

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: The Microsoft.Extensions.AI.Evaluation libraries
33
description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps.
44
ms.topic: concept-article
5-
ms.date: 05/13/2025
5+
ms.date: 07/24/2025
66
---
77
# The Microsoft.Extensions.AI.Evaluation libraries
88

@@ -11,8 +11,9 @@ The Microsoft.Extensions.AI.Evaluation libraries simplify the process of evaluat
1111
The evaluation libraries, which are built on top of the [Microsoft.Extensions.AI abstractions](../microsoft-extensions-ai.md), are composed of the following NuGet packages:
1212

1313
- [📦 Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) – Defines the core abstractions and types for supporting evaluation.
14-
- [📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
15-
- [📦 Microsoft.Extensions.AI.Evaluation.Safety](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) – Contains evaluators, such as the `ProtectedMaterialEvaluator` and `ContentHarmEvaluator`, that use the [Azure AI Foundry](/azure/ai-foundry/) Evaluation service to perform evaluations.
14+
- [📦 Microsoft.Extensions.AI.Evaluation.NLP](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.NLP) - Contains [evaluators](#nlp-evaluators) that evaluate the similarity of an LLM's response text to one or more reference responses using natural language processing (NLP) metrics. These evaluators aren't LLM or AI-based; they use traditional NLP techniques such as text tokenization and n-gram analysis to evaluate text similarity.
15+
- [📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains [evaluators](#quality-evaluators) that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
16+
- [📦 Microsoft.Extensions.AI.Evaluation.Safety](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) – Contains [evaluators](#safety-evaluators), such as the `ProtectedMaterialEvaluator` and `ContentHarmEvaluator`, that use the [Azure AI Foundry](/azure/ai-foundry/) Evaluation service to perform evaluations.
1617
- [📦 Microsoft.Extensions.AI.Evaluation.Reporting](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) – Contains support for caching LLM responses, storing the results of evaluations, and generating reports from that data.
1718
- [📦 Microsoft.Extensions.AI.Evaluation.Reporting.Azure](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the reporting library with an implementation for caching LLM responses and storing the evaluation results in an [Azure Storage](/azure/storage/common/storage-introduction) container.
1819
- [📦 Microsoft.Extensions.AI.Evaluation.Console](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) – A command-line tool for generating reports and managing evaluation data.
@@ -23,7 +24,7 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi
2324

2425
## Comprehensive evaluation metrics
2526

26-
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators) and [safety](#safety-evaluators) evaluators and the metrics they measure.
27+
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators), [NLP](#nlp-evaluators), and [safety](#safety-evaluators) evaluators and the metrics they measure.
2728

2829
You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface.
2930

@@ -41,9 +42,22 @@ Quality evaluators measure response quality. They use an LLM to perform the eval
4142
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> | `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query |
4243
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> | `Groundedness` | Evaluates how well a generated response aligns with the given context |
4344
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>| `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is |
45+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.IntentResolutionEvaluator> | `Intent Resolution` | Evaluates an AI system's effectiveness at identifying and resolving user intent (agent-focused) |
46+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.TaskAdherenceEvaluator> | `Task Adherence` | Evaluates an AI system's effectiveness at adhering to the task assigned to it (agent-focused) |
47+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.ToolCallAccuracyEvaluator> | `Tool Call Accuracy` | Evaluates an AI system's effectiveness at using the tools supplied to it (agent-focused) |
4448

4549
† This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md).
4650

51+
### NLP evaluators
52+
53+
NLP evaluators evaluate the quality of an LLM response by comparing it to a reference response using natural language processing (NLP) techniques. These evaluators aren't LLM or AI-based; instead, they use older NLP techniques to perform text comparisons.
54+
55+
| Evaluator type | Metric | Description |
56+
|---------------------------------------------------------------------------|--------------------|-------------|
57+
| <xref:Microsoft.Extensions.AI.Evaluation.NLP.BLEUEvaluator> | `BLEU` | Evaluates a response by comparing it to one or more reference responses using the bilingual evaluation understudy (BLEU) algorithm. This algorithm is commonly used to evaluate the quality of machine-translation or text-generation tasks. |
58+
| <xref:Microsoft.Extensions.AI.Evaluation.NLP.GLEUEvaluator> | `GLEU` | Measures the similarity between the generated response and one or more reference responses using the Google BLEU (GLEU) algorithm, a variant of the BLEU algorithm that's optimized for sentence-level evaluation. |
59+
| <xref:Microsoft.Extensions.AI.Evaluation.NLP.F1Evaluator> | `F1` | Evaluates a response by comparing it to a reference response using the *F1* scoring algorithm (the ratio of the number of shared words between the generated response and the reference response). |
60+
4761
### Safety evaluators
4862

4963
Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.

0 commit comments

Comments
 (0)