MicrosoftDocs
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
Lines changed: 12 additions & 0 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
Lines changed: 12 additions & 0 deletions
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
Lines changed: 3 additions & 0 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
Lines changed: 33 additions & 24 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
Lines changed: 33 additions & 24 deletions
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
Lines changed: 3 additions & 0 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎articles/ai-foundry/concepts/model-benchmarks.md
Lines changed: 17 additions & 21 deletions b/‎articles/ai-foundry/concepts/model-benchmarks.md
Lines changed: 17 additions & 21 deletions
@@ -19,6 +19,12 @@ Agents are powerful productivity assistants. They can plan, make decisions, and
 
 Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an [end-to-end example of evaluating agents in Azure AI Agent Service](https://aka.ms/e2e-agent-eval-sample).
 
+Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality as well as safety aspects of your agentic workflows, leveraging out comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter: 
+- **Quality**: `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence`, `Relevance`, `Coherence`, `Fluency`
+- **Safety**: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`.
+
+We will show examples of `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` here. See more examples in [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents) for other evaluators with Azure AI agent message support.
+
 ## Model configuration for AI-assisted evaluators
 
 For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:
@@ -37,6 +43,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency. 
+
 ## Intent resolution
 
 `IntentResolutionEvaluator` measures how well the system identifies and understands a user's request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
@@ -83,6 +92,9 @@ If you're building agents outside of Azure AI Agent Serice, this evaluator accep
 
 `ToolCallAccuracyEvaluator` measures an agent's ability to select appropriate tools, extract, and process correct parameters from previous steps of the agentic workflow. It detects whether each tool call made is accurate (binary) and reports back the average scores, which can be interpreted as a passing rate across tool calls made.
 
+> [!NOTE]
+> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but does not support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.    
+
 ### Tool call accuracy example
 
 ```python
 
@@ -35,6 +35,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
+
 ## Coherence
 
 `CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
 
@@ -39,6 +39,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
+
 ## Retrieval
 
 Retrieval quality is very important given its upstream role in RAG: if the retrieval quality is poor and the response requires corpus-specific knowledge, there's less chance your LLM model gives you a satisfactory answer. `RetrievalEvaluator` measures the **textual quality** of retrieval results with an LLM without requiring ground truth (also known as query relevance judgment), which provides value compared to `DocumentRetrievalEvaluator` measuring `ndcg`,  `xdcg`, `fidelity`, and other classical information retrieval metrics that require ground truth. This metric focuses on how relevant the context chunks (encoded as a string) are to address a query and how the most relevant context chunks are surfaced at the top of the list.
@@ -90,69 +93,74 @@ Retrieval quality is very important given its upstream role in RAG: if the retri
 ```python
 from azure.ai.evaluation import DocumentRetrievalEvaluator
 
+# these query_relevance_label are given by your human- or LLM-judges.
 retrieval_ground_truth = [
     {
         "document_id": "1",
-        "query_relevance_judgement": 4
+        "query_relevance_label": 4
     },
     {
         "document_id": "2",
-        "query_relevance_judgement": 2
+        "query_relevance_label": 2
     },
     {
         "document_id": "3",
-        "query_relevance_judgement": 3
+        "query_relevance_label": 3
     },
     {
         "document_id": "4",
-        "query_relevance_judgement": 1
+        "query_relevance_label": 1
     },
     {
         "document_id": "5",
-        "query_relevance_judgement": 0
+        "query_relevance_label": 0
     },
 ]
+# the min and max of the label scores are inputs to document retrieval evaluator
+ground_truth_label_min = 0
+ground_truth_label_max = 4
 
+# these relevance scores come from your search retrieval system
 retrieved_documents = [
     {
         "document_id": "2",
-        "query_relevance_judgement": 45.1
+        "relevance_score": 45.1
     },
     {
         "document_id": "6",
-        "query_relevance_judgement": 35.8
+        "relevance_score": 35.8
     },
     {
         "document_id": "3",
-        "query_relevance_judgement": 29.2
+        "relevance_score": 29.2
     },
     {
         "document_id": "5",
-        "query_relevance_judgement": 25.4
+        "relevance_score": 25.4
     },
     {
         "document_id": "7",
-        "query_relevance_judgement": 18.8
+        "relevance_score": 18.8
     },
 ]
 
-default_threshold = {
-            "ndcg@3": 0.5,
-            "xdcg@3": 0.5,
-            "fidelity": 0.5,
-            "top1_relevance": 50,
-            "top3_max_relevance": 50,
-            "total_retrieved_documents": 50,
-            "total_ground_truth_documents": 50,
-}
-
-document_retrieval_evaluator = DocumentRetrievalEvaluator(threshold=default_threshold)
+document_retrieval_evaluator = DocumentRetrievalEvaluator(
+    ground_truth_label_min=ground_truth_label_min, 
+    ground_truth_label_max=ground_truth_label_max,
+    ndcg_threshold = 0.5,
+    xdcg_threshold = 50.0,
+    fidelity_threshold = 0.5,
+    top1_relevance_threshold = 50.0,
+    top3_max_relevance_threshold = 50.0,
+    total_retrieved_documents_threshold = 50,
+    total_ground_truth_documents_threshold = 50
+)
 document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retrieved_documents=retrieved_documents)   
 ```
 
 ### Document retrieval output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. 
 
 ```python
 {
@@ -163,15 +171,16 @@ The numerical score on a likert scale (integer 1 to 5) and a higher score is bet
     "top3_max_relevance": 2,
     "holes": 30,
     "holes_ratio": 0.6000000000000001,
-    "holes_is_higher_better": False,
-    "holes_ratio_is_higher_better": False,
+    "holes_higher_is_better": False,
+    "holes_ratio_higher_is_better": False,
     "total_retrieved_documents": 50,
     "total_groundtruth_documents": 1565,
     "ndcg@3_result": "pass",
     "xdcg@3_result": "pass",
     "fidelity_result": "fail",
     "top1_relevance_result": "fail",
     "top3_max_relevance_result": "fail",
+    # omitting more fields ...
 }
 ```
 
 
@@ -33,6 +33,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
+
 ## Similarity
 
 `SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.
 
@@ -27,7 +27,7 @@ Model leaderboards (preview) in Azure AI Foundry portal allow you to streamline
 Whenever you find a model to your liking, you can select it and zoom into the **Detailed benchmarking results** of the model within the model catalog. If satisfied with the model, you can deploy it, try it in the playground, or evaluate it on your data. The leaderboards support benchmarking across text language models (large language models (LLMs) and small language models (SLMs)) and embedding models.
 
 
-Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. In addition, we assess the quality of embedding models using standard benchmarks. The benchmarks are updated regularly as better and more unsaturated datasets and associated metrics are added to existing models, and as new models are added to the model catalog.
+Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. In addition, we assess the quality of embedding models using standard benchmarks. The leaderboards are updated regularly as better and more unsaturated benchmarks are onboarded, and as new models are added to the model catalog.
 
 
 ## Quality benchmarks of language models
@@ -40,37 +40,33 @@ Azure AI assesses the quality of LLMs and SLMs using accuracy scores from standa
 
 Quality index is provided on a scale of zero to one. Higher values of quality index are better. The datasets included in quality index are: 
 
-| Dataset name            | Leaderboard category        |
-|-------------------------|---------------------|
-| BoolQ                   | QA                  |
-| HellaSwag               | Reasoning           |
-| OpenBookQA              | Reasoning           |
-| PIQA                    | Reasoning           |
-| Social IQA              | Reasoning           |
-| Winogrande              | Reasoning           |
-| TruthfulQA (MC)         | Groundedness        |
-| HumanEval               | Coding              |
-| GSM8K                   | Math                |
-| MMLU (Humanities)       | General Knowledge   |
-| MMLU (Other)            | General Knowledge   |
-| MMLU (Social Sciences)  | General Knowledge   |
-| MMLU (STEM)             | General Knowledge   |
+| Dataset Name       | Leaderboard Category |
+|--------------------|----------------------|
+| arena_hard        | QA                   |
+| bigbench_hard     | Reasoning            |
+| gpqa              | QA                   |
+| humanevalplus     | Coding               |
+| ifeval            | Reasoning            |
+| math              | Math                 |
+| mbppplus          | Coding               |
+| mmlu_pro          | General Knowledge    |
+
 
 
 See more details in accuracy scores:
 
 | Metric | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
 |--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Accuracy | Accuracy scores are available at the dataset and the model levels. At the dataset level, the score is the average value of an accuracy metric computed over all examples in the dataset. The accuracy metric used is `exact-match` in all cases, except for the _HumanEval_  and _MBPP_ datasets that uses a `pass@1` metric. Exact match compares model generated text with the correct answer according to the dataset, reporting one if the generated text matches the answer exactly and zero otherwise. The `pass@1` metric measures the proportion of model solutions that pass a set of unit tests in a code generation task. At the model level, the accuracy score is the average of the dataset-level accuracies for each model. |
+| Accuracy | Accuracy scores are available at the dataset and the model levels. At the dataset level, the score is the average value of an accuracy metric computed over all examples in the dataset. The accuracy metric used is `exact-match` in all cases, except for the _HumanEval_  and _MBPP_ datasets that use a `pass@1` metric. Exact match compares model generated text with the correct answer according to the dataset, reporting one if the generated text matches the answer exactly and zero otherwise. The `pass@1` metric measures the proportion of model solutions that pass a set of unit tests in a code generation task. At the model level, the accuracy score is the average of the dataset-level accuracies for each model. |
 
 Accuracy scores are provided on a scale of zero to one. Higher values are better.
 
 
 ## Safety benchmarks of language models
 
-Safety benchmarks use a standard metric Attack Success Rate to measure how vulerable language models are to attacks in biosecurity, cybersecurity, and chemical security. Currently, the [Weapons of Mass Destruction Proxy (WMDP) benchmark](https://www.wmdp.ai/) is used to assess hazardous knowledge in language models. The lower the Attack Success Rate is, the safer is the model response. 
+Safety benchmarks use a standard metric Attack Success Rate to measure how vulnerable language models are to attacks in biosecurity, cybersecurity, and chemical security. Currently, the [Weapons of Mass Destruction Proxy (WMDP) benchmark](https://www.wmdp.ai/) is used to assess hazardous knowledge in language models. The lower the Attack Success Rate is, the safer is the model response. 
 
-All model endpoints are benchmarked with the default Azure AI Content Safety filters on with a default configuration. These safety filters detect and block [content harm categories](../../ai-services/content-safety/concepts/harm-categories.md) in violence, self-harm, sexual, hate and unfaireness, but do not measure categories in cybersecurity, biosecurity, chemical security.
+All model endpoints are benchmarked with the default Azure AI Content Safety filters on with a default configuration. These safety filters detect and block [content harm categories](../../ai-services/content-safety/concepts/harm-categories.md) in violence, self-harm, sexual, hate, and unfairness, but do not specifically cover categories in cybersecurity, biosecurity, chemical security.
 
 
 ## Performance benchmarks of language models
@@ -135,7 +131,7 @@ Azure AI also displays the cost index as follows:
 
 ## Quality benchmarks of embedding models
 
-The quality index of embedding models is defined as the averaged accuracy scores of a comprehensive set of standard benchmark datasests targeting Information Retrieval, Document Clustering, and Summarization tasks.
+The quality index of embedding models is defined as the averaged accuracy scores of a comprehensive set of standard benchmark datasets targeting Information Retrieval, Document Clustering, and Summarization tasks.
 
 See more details in accuracy score definitions specific to each dataset:
 
@@ -155,7 +151,7 @@ See more details in accuracy score definitions specific to each dataset:
 
 Benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Azure AI evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics.
 
-Prompt construction follows best practices for each dataset, as specified by the paper introducing the dataset and industry standards. In most cases, each prompt contains several _shots_, that is, several examples of complete questions and answers to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data that's held out from evaluation.
+Prompt construction follows best practices for each dataset, as specified by the paper introducing the dataset and industry standards. In most cases, each prompt contains several _shots_, that is, several examples of complete questions and answers to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data held out from evaluation.
 
 ## Related content
Original file line number	Diff line number	Diff line change
`@@ -35,6 +35,9 @@ model_config = AzureOpenAIModelConfiguration(`
`35`	`35`	`)`
`36`	`36`	```
`37`	`37`
	`38`	`+> [!TIP]`
	`39`	+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
	`40`	`+`
`38`	`41`	`## Coherence`
`39`	`42`
`40`	`43`	`CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
Original file line number	Diff line number	Diff line change
`@@ -33,6 +33,9 @@ model_config = AzureOpenAIModelConfiguration(`
`33`	`33`	`)`
`34`	`34`	```
`35`	`35`
	`36`	`+> [!TIP]`
	`37`	+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
	`38`	`+`
`36`	`39`	`## Similarity`
`37`	`40`
`38`	`41`	`SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.