MicrosoftDocs
diff --git a/‎articles/ai-foundry/.openpublishing.redirection.ai-studio.json
Lines changed: 5 additions & 0 deletions b/‎articles/ai-foundry/.openpublishing.redirection.ai-studio.json
Lines changed: 5 additions & 0 deletions
diff --git a/‎articles/ai-foundry/concepts/content-filtering.md
Lines changed: 1 addition & 1 deletion b/‎articles/ai-foundry/concepts/content-filtering.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎articles/ai-foundry/concepts/deployments-overview.md
Lines changed: 2 additions & 2 deletions b/‎articles/ai-foundry/concepts/deployments-overview.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
Lines changed: 12 additions & 0 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
Lines changed: 12 additions & 0 deletions
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
Lines changed: 3 additions & 0 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
Lines changed: 33 additions & 24 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
Lines changed: 33 additions & 24 deletions
diff --git a/‎articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
Lines changed: 3 additions & 0 deletions b/‎articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎articles/ai-foundry/concepts/fine-tuning-overview.md
Lines changed: 4 additions & 4 deletions b/‎articles/ai-foundry/concepts/fine-tuning-overview.md
Lines changed: 4 additions & 4 deletions
@@ -215,6 +215,11 @@
             "redirect_url": "/azure/ai-foundry/concepts/models-featured#mistral-ai",
             "redirect_document_id": false
           },
+          {
+            "source_path_from_root": "/articles/ai-foundry/how-to/model-catalog-overview.md",
+            "redirect_url": "/azure/ai-foundry/concepts/foundry-models-overview",
+            "redirect_document_id": false
+          },
           {
             "source_path_from_root": "/articles/ai-studio/how-to/deploy-models-mistral-open.md",
             "redirect_url": "/azure/ai-foundry/how-to/deploy-models-mistral-open",
 
@@ -26,7 +26,7 @@ author: PatrickFarley
 
 The content filtering system is powered by [Azure AI Content Safety](../../ai-services/content-safety/overview.md), and it works by running both the prompt input and completion output through a set of classification models designed to detect and prevent the output of harmful content. Variations in API configurations and application design might affect completions and thus filtering behavior.
 
-With Azure OpenAI model deployments, you can use the default content filter or create your own content filter (described later on). Models available through **serverless APIs** have content filtering enabled by default. To learn more about the default content filter enabled for serverless APIs, see [Guardrails & controls for Azure Direct Models in the model catalog](model-catalog-content-safety.md).
+With Azure OpenAI model deployments, you can use the default content filter or create your own content filter (described later on). Models available through **standard deployments** have content filtering enabled by default. To learn more about the default content filter enabled for standard deployments, see [Content safety for models curated by Azure AI in the model catalog](model-catalog-content-safety.md).
 
 ## Language support
 
 
@@ -20,7 +20,7 @@ The model catalog in Azure AI Foundry portal is the hub to discover and use a wi
 Deployment options vary depending on the model offering:
 
 * **Azure OpenAI in Azure AI Foundry Models:** The latest OpenAI models that have enterprise features from Azure with flexible billing options.
-* **Standard deployment:** These models don't require compute quota from your subscription and are billed per token in a pay-as-you-go fashion. 
+* **Standard deployment:** These models don't require compute quota from your subscription and are billed per token in a serverless pay per token offer. 
 * **Open and custom models:** The model catalog offers access to a large variety of models across modalities, including models of open access. You can host open models in your own subscription with a managed infrastructure, virtual machines, and the number of instances for capacity management.
 
 Azure AI Foundry offers four different deployment options:
@@ -39,7 +39,7 @@ Azure AI Foundry offers four different deployment options:
 | Billing bases                 | Token usage & [provisioned throughput units](../../ai-services/openai/concepts/provisioned-throughput.md)        | Token usage       | Token usage<sup>1</sup>      | Compute core hours<sup>2</sup> |
 | Deployment instructions       | [Deploy to Azure OpenAI](../how-to/deploy-models-openai.md) | [Deploy to Foundry Models](../model-inference/how-to/create-model-deployments.md) | [Deploy to Standard deployment](../how-to/deploy-models-serverless.md) | [Deploy to Managed compute](../how-to/deploy-models-managed.md) |
 
-<sup>1</sup> A minimal endpoint infrastructure is billed per minute. You aren't billed for the infrastructure that hosts the model in pay-as-you-go. After you delete the endpoint, no further charges accrue.
+<sup>1</sup> A minimal endpoint infrastructure is billed per minute. You aren't billed for the infrastructure that hosts the model in standard deployment. After you delete the endpoint, no further charges accrue.
 
 <sup>2</sup> Billing is on a per-minute basis, depending on the product tier and the number of instances used in the deployment since the moment of creation. After you delete the endpoint, no further charges accrue.
 
 
@@ -19,6 +19,12 @@ Agents are powerful productivity assistants. They can plan, make decisions, and
 
 Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an [end-to-end example of evaluating agents in Azure AI Agent Service](https://aka.ms/e2e-agent-eval-sample).
 
+Besides `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` specific to agentic workflows, you can also assess other quality as well as safety aspects of your agentic workflows, leveraging out comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter: 
+- **Quality**: `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence`, `Relevance`, `Coherence`, `Fluency`
+- **Safety**: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`.
+
+We will show examples of `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence` here. See more examples in [evaluating Azure AI agents](../../how-to/develop/agent-evaluate-sdk.md#evaluate-azure-ai-agents) for other evaluators with Azure AI agent message support.
+
 ## Model configuration for AI-assisted evaluators
 
 For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:
@@ -37,6 +43,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency. 
+
 ## Intent resolution
 
 `IntentResolutionEvaluator` measures how well the system identifies and understands a user's request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
@@ -83,6 +92,9 @@ If you're building agents outside of Azure AI Agent Serice, this evaluator accep
 
 `ToolCallAccuracyEvaluator` measures an agent's ability to select appropriate tools, extract, and process correct parameters from previous steps of the agentic workflow. It detects whether each tool call made is accurate (binary) and reports back the average scores, which can be interpreted as a passing rate across tool calls made.
 
+> [!NOTE]
+> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but does not support Built-in Tool evaluation. The agent messages must have at least one Function Tool actually called to be evaluated.    
+
 ### Tool call accuracy example
 
 ```python
 
@@ -35,6 +35,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
+
 ## Coherence
 
 `CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
 
@@ -39,6 +39,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
+
 ## Retrieval
 
 Retrieval quality is very important given its upstream role in RAG: if the retrieval quality is poor and the response requires corpus-specific knowledge, there's less chance your LLM model gives you a satisfactory answer. `RetrievalEvaluator` measures the **textual quality** of retrieval results with an LLM without requiring ground truth (also known as query relevance judgment), which provides value compared to `DocumentRetrievalEvaluator` measuring `ndcg`,  `xdcg`, `fidelity`, and other classical information retrieval metrics that require ground truth. This metric focuses on how relevant the context chunks (encoded as a string) are to address a query and how the most relevant context chunks are surfaced at the top of the list.
@@ -90,69 +93,74 @@ Retrieval quality is very important given its upstream role in RAG: if the retri
 ```python
 from azure.ai.evaluation import DocumentRetrievalEvaluator
 
+# these query_relevance_label are given by your human- or LLM-judges.
 retrieval_ground_truth = [
     {
         "document_id": "1",
-        "query_relevance_judgement": 4
+        "query_relevance_label": 4
     },
     {
         "document_id": "2",
-        "query_relevance_judgement": 2
+        "query_relevance_label": 2
     },
     {
         "document_id": "3",
-        "query_relevance_judgement": 3
+        "query_relevance_label": 3
     },
     {
         "document_id": "4",
-        "query_relevance_judgement": 1
+        "query_relevance_label": 1
     },
     {
         "document_id": "5",
-        "query_relevance_judgement": 0
+        "query_relevance_label": 0
     },
 ]
+# the min and max of the label scores are inputs to document retrieval evaluator
+ground_truth_label_min = 0
+ground_truth_label_max = 4
 
+# these relevance scores come from your search retrieval system
 retrieved_documents = [
     {
         "document_id": "2",
-        "query_relevance_judgement": 45.1
+        "relevance_score": 45.1
     },
     {
         "document_id": "6",
-        "query_relevance_judgement": 35.8
+        "relevance_score": 35.8
     },
     {
         "document_id": "3",
-        "query_relevance_judgement": 29.2
+        "relevance_score": 29.2
     },
     {
         "document_id": "5",
-        "query_relevance_judgement": 25.4
+        "relevance_score": 25.4
     },
     {
         "document_id": "7",
-        "query_relevance_judgement": 18.8
+        "relevance_score": 18.8
     },
 ]
 
-default_threshold = {
-            "ndcg@3": 0.5,
-            "xdcg@3": 0.5,
-            "fidelity": 0.5,
-            "top1_relevance": 50,
-            "top3_max_relevance": 50,
-            "total_retrieved_documents": 50,
-            "total_ground_truth_documents": 50,
-}
-
-document_retrieval_evaluator = DocumentRetrievalEvaluator(threshold=default_threshold)
+document_retrieval_evaluator = DocumentRetrievalEvaluator(
+    ground_truth_label_min=ground_truth_label_min, 
+    ground_truth_label_max=ground_truth_label_max,
+    ndcg_threshold = 0.5,
+    xdcg_threshold = 50.0,
+    fidelity_threshold = 0.5,
+    top1_relevance_threshold = 50.0,
+    top3_max_relevance_threshold = 50.0,
+    total_retrieved_documents_threshold = 50,
+    total_ground_truth_documents_threshold = 50
+)
 document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retrieved_documents=retrieved_documents)   
 ```
 
 ### Document retrieval output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. 
 
 ```python
 {
@@ -163,15 +171,16 @@ The numerical score on a likert scale (integer 1 to 5) and a higher score is bet
     "top3_max_relevance": 2,
     "holes": 30,
     "holes_ratio": 0.6000000000000001,
-    "holes_is_higher_better": False,
-    "holes_ratio_is_higher_better": False,
+    "holes_higher_is_better": False,
+    "holes_ratio_higher_is_better": False,
     "total_retrieved_documents": 50,
     "total_groundtruth_documents": 1565,
     "ndcg@3_result": "pass",
     "xdcg@3_result": "pass",
     "fidelity_result": "fail",
     "top1_relevance_result": "fail",
     "top3_max_relevance_result": "fail",
+    # omitting more fields ...
 }
 ```
 
 
@@ -33,6 +33,9 @@ model_config = AzureOpenAIModelConfiguration(
 )
 ```
 
+> [!TIP]
+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
+
 ## Similarity
 
 `SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.
 
@@ -84,15 +84,15 @@ It's important to call out that fine-tuning is heavily dependent on the quality
 ## Supported models for fine-tuning
 
 Now that you know when to use fine-tuning for your use case, you can go to Azure AI Foundry to find models available to fine-tune.
-For some models in the model catalog, fine-tuning is available by using a serverless API, or a managed compute (preview), or both.
+For some models in the model catalog, fine-tuning is available by using a standard deployment, or a managed compute (preview), or both.
 
-Fine-tuning is available in specific Azure regions for some models that are deployed via serverless APIs. To fine-tune such models, a user must have a hub/project in the region where the model is available for fine-tuning. See [Region availability for models in serverless API endpoints](../how-to/deploy-models-serverless-availability.md) for detailed information.
+Fine-tuning is available in specific Azure regions for some models that are deployed via standard deployments. To fine-tune such models, a user must have a hub/project in the region where the model is available for fine-tuning. See [Region availability for models in standard deployment](../how-to/deploy-models-serverless-availability.md) for detailed information.
 
 For more information on fine-tuning using a managed compute (preview), see [Fine-tune models using managed compute (preview)](../how-to/fine-tune-managed-compute.md).
 
 For details about Azure OpenAI in Azure AI Foundry Models that are available for fine-tuning, see the [Azure OpenAI in Foundry Models documentation](../../ai-services/openai/concepts/models.md#fine-tuning-models) or the [Azure OpenAI models table](#fine-tuning-azure-openai-models) later in this guide.
 
-For the Azure OpenAI  Service models that you can fine tune, supported regions for fine-tuning include North Central US, Sweden Central, and more.
+For the Azure OpenAI Service models that you can fine tune, supported regions for fine-tuning include North Central US, Sweden Central, and more.
 
 ### Fine-tuning Azure OpenAI models
 
@@ -102,5 +102,5 @@ For the Azure OpenAI  Service models that you can fine tune, supported regions f
 
 - [Fine-tune models using managed compute (preview)](../how-to/fine-tune-managed-compute.md)
 - [Fine-tune an Azure OpenAI model in Azure AI Foundry portal](../../ai-services/openai/how-to/fine-tuning.md?context=/azure/ai-studio/context/context)
-- [Fine-tune models using serverless API](../how-to/fine-tune-serverless.md)
+- [Fine-tune models using standard deployment](../how-to/fine-tune-serverless.md)
Original file line number	Diff line number	Diff line change
`@@ -35,6 +35,9 @@ model_config = AzureOpenAIModelConfiguration(`
`35`	`35`	`)`
`36`	`36`	```
`37`	`37`
	`38`	`+> [!TIP]`
	`39`	+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
	`40`	`+`
`38`	`41`	`## Coherence`
`39`	`42`
`40`	`43`	`CoherenceEvaluator` measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.
Original file line number	Diff line number	Diff line change
`@@ -33,6 +33,9 @@ model_config = AzureOpenAIModelConfiguration(`
`33`	`33`	`)`
`34`	`34`	```
`35`	`35`
	`36`	`+> [!TIP]`
	`37`	+> We recommend using `o3-mini` for a balance of reasoning capability and cost efficiency.
	`38`	`+`
`36`	`39`	`## Similarity`
`37`	`40`
`38`	`41`	`SimilarityEvaluator` measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.