Merge pull request #281677 from cdpark/horz-monitor-ml

Jill Grant · web-flow · commit 059c604592bf · 2024-07-25T11:28:27.000-06:00
Q&amp;M: Curate 27 SOX Azure Monitor horizontals articles to reduce duplication -- Machine Learning
diff --git a/articles/machine-learning/concept-endpoints-online.md b/articles/machine-learning/concept-endpoints-online.md
@@ -396,12 +396,6 @@ For more information, see [Network isolation with managed online endpoints](conc
 
 Monitoring for Azure Machine Learning endpoints is possible via integration with [Azure Monitor](monitor-azure-machine-learning.md#what-is-azure-monitor). This integration allows you to view metrics in charts, configure alerts, query from log tables, use Application Insights to analyze events from user containers, and so on.
 
-* **Metrics**: Use Azure Monitor to track various endpoint metrics, such as request latency, and drill down to deployment or status level. You can also track deployment-level metrics, such as CPU/GPU utilization and drill down to instance level. Azure Monitor allows you to track these metrics in charts and set up dashboards and alerts for further analysis.
-
-* **Logs**: Send metrics to the Log Analytics Workspace where you can query logs using the Kusto query syntax. You can also send metrics to Storage Account and/or Event Hubs for further processing. In addition, you can use dedicated Log tables for online endpoint related events, traffic, and container logs. Kusto query allows complex analysis joining multiple tables.
-
-* **Application insights**: Curated environments include the integration with Application Insights, and you can enable/disable it when you create an online deployment. Built-in metrics and logs are sent to Application insights, and you can use its built-in features such as Live metrics, Transaction search, Failures, and Performance for further analysis.
-
 For more information on monitoring, see [Monitor online endpoints](how-to-monitor-online-endpoints.md).
 
 ### Secret injection in online deployments (preview)
diff --git a/articles/machine-learning/how-to-monitor-online-endpoints.md b/articles/machine-learning/how-to-monitor-online-endpoints.md
@@ -67,78 +67,19 @@ Depending on the resource that you select, the metrics that you see will be diff
 
 #### Metrics at endpoint scope
 
-- __Traffic__
-
-| Metric ID | Unit | Description | Aggregate Method | Splittable By | Example Metric Alerts |
-| ---- | --- | --- | --- | --- | --- |
-| RequestsPerMinute | Count | The number of requests sent to Endpoint within a minute | Average | Deployment, ModelStatusCode, StatusCode, StatusCodeClass | Alert me when I have <= 0 transactions in the system |
-| RequestLatency | Milliseconds | The complete interval of time taken for a request to be responded | Average | Deployment | Alert me when average latency > 2 sec |
-| RequestLatency_P50 | Milliseconds | The request latency at the 50th percentile aggregated by all request latency values collected over a period of 60 seconds | Average | Deployment | Alert me when average latency > 2 sec |
-| RequestLatency_P90 | Milliseconds | The request latency at the 90th percentile aggregated by all request latency values collected over a period of 60 seconds | Average | Deployment | Alert me when average latency > 2 sec |
-| RequestLatency_P95 | Milliseconds | The request latency at the 95th percentile aggregated by all request latency values collected over a period of 60 seconds | Average | Deployment | Alert me when average latency > 2 sec |
-| RequestLatency_P99 | Milliseconds | The request latency at the 99th percentile aggregated by all request latency values collected over a period of 60 seconds | Average | Deployment | Alert me when average latency > 2 sec |
-
-- __Network__
-
-| Metric ID | Unit | Description | Aggregate Method | Splittable By | Example Metric Alerts |
-| ---- | --- | --- | --- | --- | --- |
-| NetworkBytes | Bytes per second | The bytes per second served for the endpoint | Average | - | - |
-| ConnectionsActive | Count | The total number of concurrent TCP connections active from clients | Average | - | - |
-| NewConnectionsPerSecond | Count | The average number of new TCP connections per second established from clients | Average | - | - |
-
-- __Model Data Collection__
-
-| Metric ID | Unit | Description | Aggregate Method | Splittable By | Example Metric Alerts |
-| ---- | --- | --- | --- | --- | --- |
-| DataCollectionEventsPerMinute | Count | The number of data collection events processed per minute | Average | Deployment, Type | - |
-| DataCollectionErrorsPerMinute | Count | The number of data collection events dropped per minute | Average | Deployment, Type, Reason | - |
-
-For example, you can split along the deployment dimension to compare the request latency of different deployments under an endpoint. 
+[!INCLUDE [Microsoft.MachineLearningServices/workspaces](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/metrics/microsoft-machinelearningservices-workspaces-onlineendpoints-metrics-include.md)]
 
 **Bandwidth throttling**
 
 Bandwidth will be throttled if the quota limits are exceeded for _managed_ online endpoints. For more information on limits, see the article on [limits for online endpoints](how-to-manage-quotas.md#azure-machine-learning-online-endpoints-and-batch-endpoints). To determine if requests are throttled:
 - Monitor the "Network bytes" metric
 - The response trailers will have the fields: `ms-azureml-bandwidth-request-delay-ms` and `ms-azureml-bandwidth-response-delay-ms`. The values of the fields are the delays, in milliseconds, of the bandwidth throttling.
+
 For more information, see [Bandwidth limit issues](how-to-troubleshoot-online-endpoints.md#bandwidth-limit-issues).
 
 #### Metrics at deployment scope
 
-- __Saturation__
-
-| Metric ID | Unit | Description | Aggregate Method | Splittable By | Example Metric Alerts |
-| ---- | --- | --- | --- | --- | --- |
-| CpuUtilizationPercentage | Percent | How much percentage of CPU was utilized | Minimun, Maximum, Average | InstanceId | Alert me when % Capacity Used > 75% |
-| CpuMemoryUtilizationPercentage | Percent | How much percent of Memory was utilized | Minimun, Maximum, Average | InstanceId |  |
-| DiskUtilization | Percent | How much disk space was utilized | Minimun, Maximum, Average | InstanceId, Disk |  |
-| GpuUtilizationPercentage  | Percent | Percentage of GPU utilization on an instance - Utilization is reported at one minute intervals | Minimun, Maximum, Average | InstanceId |  |
-| GpuMemoryUtilizationPercentage | Percent | Percentage of GPU memory utilization on an instance - Utilization is reported at one minute intervals | Minimun, Maximum, Average | InstanceId |  |
-| GpuEnergyJoules | Joule | Interval energy in Joules on a GPU node - Energy is reported at one minute intervals | Minimun, Maximum, Average | InstanceId |  |
-
-- __Availability__
-
-| Metric ID | Unit | Description | Aggregate Method | Splittable By | Example Metric Alerts |
-| ---- | --- | --- | --- | --- | --- |
-| DeploymentCapacity | Count | The number of instances in the deployment | Minimum, Maximum, Average | InstanceId, State | Alert me when the % Availability of my service drops below 100% |
-
-- __Traffic__
-
-| Metric ID | Unit | Description | Aggregate Method | Splittable By | Example Metric Alerts |
-| ---- | --- | --- | --- | --- | --- |
-| RequestsPerMinute | Count | The number of requests sent to online deployment within a minute | Average | StatusCode | Alert me when I have <= 0 transactions in the system |
-| RequestLatency_P50 | Milliseconds | The average P50 request latency aggregated by all request latency values collected over the selected time period | Average | - | Alert me when average latency > 2 sec |
-| RequestLatency_P90 | Milliseconds | The average P90 request latency aggregated by all request latency values collected over the selected time period | Average | - | Alert me when average latency > 2 sec |
-| RequestLatency_P95 | Milliseconds | The average P95 request latency aggregated by all request latency values collected over the selected time period | Average | - | Alert me when average latency > 2 sec |
-| RequestLatency_P99 | Milliseconds | The average P99 request latency aggregated by all request latency values collected over the selected time period | Average | - | Alert me when average latency > 2 sec |
-
-- __Model Data Collection__
-
-| Metric ID | Unit | Description | Aggregate Method | Splittable By | Example Metric Alerts |
-| ---- | --- | --- | --- | --- | --- |
-| DataCollectionEventsPerMinute | Count | The number of data collection events processed per minute | Average | InstanceId, Type | - |
-| DataCollectionErrorsPerMinute | Count | The number of data collection events dropped per minute | Average | InstanceId, Type, Reason | - |
-
-For instance, you can compare CPU and/or memory utilization between difference instances for an online deployment. 
+[!INCLUDE [Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/metrics/microsoft-machinelearningservices-workspaces-onlineendpoints-deployments-metrics-include.md)]
 
 ### Create dashboards and alerts
 
diff --git a/articles/machine-learning/how-to-track-monitor-analyze-runs.md b/articles/machine-learning/how-to-track-monitor-analyze-runs.md
@@ -112,24 +112,9 @@ To cancel a job in the studio:
 
 ## Monitor job status by email notification
 
-1. In the [Azure portal](https://portal.azure.com), in the left navigation bar, select the **Monitor** tab.
+You can use diagnostic settings to trigger email notifications. To learn how to create diagnostic settings, see [Create diagnostic settings in Azure Monitor](/azure/azure-monitor/essentials/create-diagnostic-settings).
 
-1. Select **Diagnostic settings**, then choose **+ Add diagnostic setting**.
-
-    :::image type="content" source="media/how-to-track-monitor-analyze-runs/diagnostic-setting.png" alt-text="Screenshot of diagnostic settings for email notification.":::
-
-1. Under **Category details**, select **AmlRunStatusChangedEvent**. Under **Destination details**, select **Send to Log Analytics workspace** and specify the **Subscription** and **Log Analytics workspace**.
-
-    :::image type="content" source="media/how-to-track-monitor-analyze-runs/log-location.png" alt-text="Screenshot of where to save email notification.":::
-
-    > [!NOTE]
-    > The **Azure Log Analytics Workspace** is a different type of Azure resource than the **Azure Machine Learning service workspace**. If there are no options in that list, you can [create a Log Analytics workspace](../azure-monitor/logs/quick-create-workspace.md). 
-
-1. In the **Logs** tab, select **New alert rule**. 
-
-    :::image type="content" source="media/how-to-track-monitor-analyze-runs/new-alert-rule.png" alt-text="Screenshot of button to add new alert rule.":::
-
-1. To learn how to create and manage log alerts using Azure Monitor, see [Create or edit a log search alert rule](../azure-monitor/alerts/alerts-log.md).
+To learn how to create and manage log alerts using Azure Monitor, see [Create or edit a log search alert rule](/azure/azure-monitor/alerts/alerts-create-log-alert-rule).
 
 ## Related content
 
diff --git a/articles/machine-learning/monitor-azure-machine-learning-reference.md b/articles/machine-learning/monitor-azure-machine-learning-reference.md
@@ -25,19 +25,19 @@ The metrics categories are **Model**, **Quota**, **Resource**, **Run**, and **Tr
 The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces resource type.
 
 [!INCLUDE [horz-monitor-ref-metrics-tableheader](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-metrics-tableheader.md)]
-[!INCLUDE [Microsoft.MachineLearningServices/workspaces](~/azure-reference-other-repo/azure-monitor-ref/supported-metrics/includes/microsoft-machinelearningservices-workspaces-metrics-include.md)]
+[!INCLUDE [Microsoft.MachineLearningServices/workspaces](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/metrics/microsoft-machinelearningservices-workspaces-metrics-include.md)]
 
 ### Supported metrics for Microsoft.MachineLearningServices/workspaces/onlineEndpoints
 The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints resource type.
 
 [!INCLUDE [horz-monitor-ref-metrics-tableheader](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-metrics-tableheader.md)]
-[!INCLUDE [Microsoft.MachineLearningServices/workspaces/onlineEndpoints](~/azure-reference-other-repo/azure-monitor-ref/supported-metrics/includes/microsoft-machinelearningservices-workspaces-onlineendpoints-metrics-include.md)]
+[!INCLUDE [Microsoft.MachineLearningServices/workspaces](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/metrics/microsoft-machinelearningservices-workspaces-onlineendpoints-metrics-include.md)]
 
 ### Supported metrics for Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments
 The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments resource type.
 
 [!INCLUDE [horz-monitor-ref-metrics-tableheader](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-metrics-tableheader.md)]
-[!INCLUDE [Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments](~/azure-reference-other-repo/azure-monitor-ref/supported-metrics/includes/microsoft-machinelearningservices-workspaces-onlineendpoints-deployments-metrics-include.md)]
+[!INCLUDE [Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/metrics/microsoft-machinelearningservices-workspaces-onlineendpoints-deployments-metrics-include.md)]
 
 [!INCLUDE [horz-monitor-ref-metrics-dimensions-intro](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-metrics-dimensions-intro.md)]
 [!INCLUDE [horz-monitor-ref-metrics-dimensions](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-metrics-dimensions.md)]
@@ -67,13 +67,13 @@ The valid values for the RunType dimension are:
 [!INCLUDE [horz-monitor-ref-resource-logs](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-resource-logs.md)]
 
 ### Supported resource logs for Microsoft.MachineLearningServices/registries
-[!INCLUDE [Microsoft.MachineLearningServices/registries](~/azure-reference-other-repo/azure-monitor-ref/supported-logs/includes/microsoft-machinelearningservices-registries-logs-include.md)]
+[!INCLUDE [Microsoft.MachineLearningServices/registries](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/logs/microsoft-machinelearningservices-registries-logs-include.md)]
 
 ### Supported resource logs for Microsoft.MachineLearningServices/workspaces
-[!INCLUDE [Microsoft.MachineLearningServices/workspaces](~/azure-reference-other-repo/azure-monitor-ref/supported-logs/includes/microsoft-machinelearningservices-workspaces-logs-include.md)]
+[!INCLUDE [Microsoft.MachineLearningServices/workspaces](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/logs/microsoft-machinelearningservices-workspaces-logs-include.md)]
 
 ### Supported resource logs for Microsoft.MachineLearningServices/workspaces/onlineEndpoints
-[!INCLUDE [Microsoft.MachineLearningServices/workspaces/onlineEndpoints](~/azure-reference-other-repo/azure-monitor-ref/supported-logs/includes/microsoft-machinelearningservices-workspaces-onlineendpoints-logs-include.md)]
+[!INCLUDE [Microsoft.MachineLearningServices/workspaces/onlineEndpoints](~/reusable-content/ce-skilling/azure/includes/azure-monitor/reference/logs/microsoft-machinelearningservices-workspaces-onlineendpoints-logs-include.md)]
 
 [!INCLUDE [horz-monitor-ref-logs-tables](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-logs-tables.md)]
 ### Machine Learning
diff --git a/articles/machine-learning/toc.yml b/articles/machine-learning/toc.yml
@@ -1403,7 +1403,7 @@
     - name: Manage and optimize cost
       displayName: cost-management,cost-optimization
       href: how-to-manage-optimize-cost.md
-    - name: Monitor
+    - name: Monitor Machine Learning
       href: monitor-azure-machine-learning.md
     - name: Secure code
       displayName: security threat