You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai/conceptual/evaluation-libraries.md
+18-4Lines changed: 18 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
title: The Microsoft.Extensions.AI.Evaluation libraries
3
3
description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps.
4
4
ms.topic: concept-article
5
-
ms.date: 05/13/2025
5
+
ms.date: 07/24/2025
6
6
---
7
7
# The Microsoft.Extensions.AI.Evaluation libraries
8
8
@@ -11,8 +11,9 @@ The Microsoft.Extensions.AI.Evaluation libraries simplify the process of evaluat
11
11
The evaluation libraries, which are built on top of the [Microsoft.Extensions.AI abstractions](../microsoft-extensions-ai.md), are composed of the following NuGet packages:
12
12
13
13
-[📦 Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) – Defines the core abstractions and types for supporting evaluation.
14
-
-[📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
15
-
-[📦 Microsoft.Extensions.AI.Evaluation.Safety](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) – Contains evaluators, such as the `ProtectedMaterialEvaluator` and `ContentHarmEvaluator`, that use the [Azure AI Foundry](/azure/ai-foundry/) Evaluation service to perform evaluations.
14
+
-[📦 Microsoft.Extensions.AI.Evaluation.NLP](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.NLP) - Contains [evaluators](#nlp-evaluators) that evaluate the similarity of an LLM's response text to one or more reference responses using natural language processing (NLP) metrics. These evaluators aren't LLM or AI-based; they use traditional NLP techniques such as text tokenization and n-gram analysis to evaluate text similarity.
15
+
-[📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains [evaluators](#quality-evaluators) that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
16
+
-[📦 Microsoft.Extensions.AI.Evaluation.Safety](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) – Contains [evaluators](#safety-evaluators), such as the `ProtectedMaterialEvaluator` and `ContentHarmEvaluator`, that use the [Azure AI Foundry](/azure/ai-foundry/) Evaluation service to perform evaluations.
16
17
-[📦 Microsoft.Extensions.AI.Evaluation.Reporting](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) – Contains support for caching LLM responses, storing the results of evaluations, and generating reports from that data.
17
18
-[📦 Microsoft.Extensions.AI.Evaluation.Reporting.Azure](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the reporting library with an implementation for caching LLM responses and storing the evaluation results in an [Azure Storage](/azure/storage/common/storage-introduction) container.
18
19
-[📦 Microsoft.Extensions.AI.Evaluation.Console](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) – A command-line tool for generating reports and managing evaluation data.
@@ -23,7 +24,7 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi
23
24
24
25
## Comprehensive evaluation metrics
25
26
26
-
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators) and [safety](#safety-evaluators) evaluators and the metrics they measure.
27
+
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators), [NLP](#nlp-evaluators), and [safety](#safety-evaluators) evaluators and the metrics they measure.
27
28
28
29
You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface.
29
30
@@ -41,9 +42,22 @@ Quality evaluators measure response quality. They use an LLM to perform the eval
41
42
|<xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator>|`Equivalence`| Evaluates the similarity between the generated text and its ground truth with respect to a query |
42
43
|<xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator>|`Groundedness`| Evaluates how well a generated response aligns with the given context |
43
44
|<xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>† |`Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)`| Evaluates how relevant, truthful, and complete a response is |
45
+
|<xref:Microsoft.Extensions.AI.Evaluation.Quality.IntentResolutionEvaluator>|`Intent Resolution`| Evaluates an AI system's effectiveness at identifying and resolving user intent (agent-focused) |
46
+
|<xref:Microsoft.Extensions.AI.Evaluation.Quality.TaskAdherenceEvaluator>|`Task Adherence`| Evaluates an AI system's effectiveness at adhering to the task assigned to it (agent-focused) |
47
+
|<xref:Microsoft.Extensions.AI.Evaluation.Quality.ToolCallAccuracyEvaluator>|`Tool Call Accuracy`| Evaluates an AI system's effectiveness at using the tools supplied to it (agent-focused) |
44
48
45
49
† This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md).
46
50
51
+
### NLP evaluators
52
+
53
+
NLP evaluators evaluate the quality of an LLM response by comparing it to a reference response using natural language processing (NLP) techniques. These evaluators aren't LLM or AI-based; instead, they use older NLP techniques to perform text comparisons.
|<xref:Microsoft.Extensions.AI.Evaluation.NLP.BLEUEvaluator>|`BLEU`| Evaluates a response by comparing it to one or more reference responses using the bilingual evaluation understudy (BLEU) algorithm. This algorithm is commonly used to evaluate the quality of machine-translation or text-generation tasks. |
58
+
|<xref:Microsoft.Extensions.AI.Evaluation.NLP.GLEUEvaluator>|`GLEU`| Measures the similarity between the generated response and one or more reference responses using the Google BLEU (GLEU) algorithm, a variant of the BLEU algorithm that's optimized for sentence-level evaluation. |
59
+
|<xref:Microsoft.Extensions.AI.Evaluation.NLP.F1Evaluator>|`F1`| Evaluates a response by comparing it to a reference response using the *F1* scoring algorithm (the ratio of the number of shared words between the generated response and the reference response). |
60
+
47
61
### Safety evaluators
48
62
49
63
Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.
0 commit comments