Add new evaluators and package (#47569)

gewarren · web-flow · commit 33332bd8390e · 2025-07-25T15:29:37.000Z
diff --git a/docs/ai/conceptual/evaluation-libraries.md b/docs/ai/conceptual/evaluation-libraries.md
@@ -2,7 +2,7 @@
 title: The Microsoft.Extensions.AI.Evaluation libraries
 description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps.
 ms.topic: concept-article
-ms.date: 05/13/2025
+ms.date: 07/24/2025
 ---
 # The Microsoft.Extensions.AI.Evaluation libraries
 
@@ -11,8 +11,9 @@ The Microsoft.Extensions.AI.Evaluation libraries simplify the process of evaluat
 The evaluation libraries, which are built on top of the [Microsoft.Extensions.AI abstractions](../microsoft-extensions-ai.md), are composed of the following NuGet packages:
 
 - [📦 Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) – Defines the core abstractions and types for supporting evaluation.
-- [📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
-- [📦 Microsoft.Extensions.AI.Evaluation.Safety](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) – Contains evaluators, such as the `ProtectedMaterialEvaluator` and `ContentHarmEvaluator`, that use the [Azure AI Foundry](/azure/ai-foundry/) Evaluation service to perform evaluations.
+- [📦 Microsoft.Extensions.AI.Evaluation.NLP](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.NLP) - Contains [evaluators](#nlp-evaluators) that evaluate the similarity of an LLM's response text to one or more reference responses using natural language processing (NLP) metrics. These evaluators aren't LLM or AI-based; they use traditional NLP techniques such as text tokenization and n-gram analysis to evaluate text similarity.
+- [📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains [evaluators](#quality-evaluators) that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
+- [📦 Microsoft.Extensions.AI.Evaluation.Safety](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) – Contains [evaluators](#safety-evaluators), such as the `ProtectedMaterialEvaluator` and `ContentHarmEvaluator`, that use the [Azure AI Foundry](/azure/ai-foundry/) Evaluation service to perform evaluations.
 - [📦 Microsoft.Extensions.AI.Evaluation.Reporting](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) – Contains support for caching LLM responses, storing the results of evaluations, and generating reports from that data.
 - [📦 Microsoft.Extensions.AI.Evaluation.Reporting.Azure](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the reporting library with an implementation for caching LLM responses and storing the evaluation results in an [Azure Storage](/azure/storage/common/storage-introduction) container.
 - [📦 Microsoft.Extensions.AI.Evaluation.Console](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) – A command-line tool for generating reports and managing evaluation data.
@@ -23,7 +24,7 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi
 
 ## Comprehensive evaluation metrics
 
-The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators) and [safety](#safety-evaluators) evaluators and the metrics they measure.
+The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators), [NLP](#nlp-evaluators), and [safety](#safety-evaluators) evaluators and the metrics they measure.
 
 You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface.
 
@@ -41,9 +42,22 @@ Quality evaluators measure response quality. They use an LLM to perform the eval
 | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> | `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query |
 | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> | `Groundedness` | Evaluates how well a generated response aligns with the given context |
 | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>† | `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is |
+| <xref:Microsoft.Extensions.AI.Evaluation.Quality.IntentResolutionEvaluator> | `Intent Resolution` | Evaluates an AI system's effectiveness at identifying and resolving user intent (agent-focused) |
+| <xref:Microsoft.Extensions.AI.Evaluation.Quality.TaskAdherenceEvaluator> | `Task Adherence` | Evaluates an AI system's effectiveness at adhering to the task assigned to it (agent-focused) |
+| <xref:Microsoft.Extensions.AI.Evaluation.Quality.ToolCallAccuracyEvaluator> | `Tool Call Accuracy` | Evaluates an AI system's effectiveness at using the tools supplied to it (agent-focused) |
 
 † This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md).
 
+### NLP evaluators
+
+NLP evaluators evaluate the quality of an LLM response by comparing it to a reference response using natural language processing (NLP) techniques. These evaluators aren't LLM or AI-based; instead, they use older NLP techniques to perform text comparisons.
+
+| Evaluator type                                                            | Metric             | Description |
+|---------------------------------------------------------------------------|--------------------|-------------|
+| <xref:Microsoft.Extensions.AI.Evaluation.NLP.BLEUEvaluator>  | `BLEU` | Evaluates a response by comparing it to one or more reference responses using the bilingual evaluation understudy (BLEU) algorithm. This algorithm is commonly used to evaluate the quality of machine-translation or text-generation tasks.  |
+| <xref:Microsoft.Extensions.AI.Evaluation.NLP.GLEUEvaluator> | `GLEU` | Measures the similarity between the generated response and one or more reference responses using the Google BLEU (GLEU) algorithm, a variant of the BLEU algorithm that's optimized for sentence-level evaluation. |
+| <xref:Microsoft.Extensions.AI.Evaluation.NLP.F1Evaluator> | `F1` | Evaluates a response by comparing it to a reference response using the *F1* scoring algorithm (the ratio of the number of shared words between the generated response and the reference response). |
+
 ### Safety evaluators
 
 Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.