Skip to content

Commit 4186362

Browse files
authored
Merge pull request #202803 from Blackmist/endpoint-logs
writing
2 parents 5764b96 + 4fa43a1 commit 4186362

File tree

8 files changed

+138
-31
lines changed

8 files changed

+138
-31
lines changed

articles/machine-learning/how-to-deploy-managed-online-endpoints.md

Lines changed: 2 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -331,28 +331,9 @@ To view metrics and set alerts based on your SLA, complete the steps that are de
331331
332332
### (Optional) Integrate with Log Analytics
333333
334-
The `get-logs` command provides only the last few hundred lines of logs from an automatically selected instance. However, Log Analytics provides a way to durably store and analyze logs.
334+
The `get-logs` command provides only the last few hundred lines of logs from an automatically selected instance. However, Log Analytics provides a way to durably store and analyze logs. For more information on using logging, see [Monitor online endpoints](how-to-monitor-online-endpoints.md#logs)
335335
336-
First, create a Log Analytics workspace by completing the steps in [Create a Log Analytics workspace in the Azure portal](../azure-monitor/logs/quick-create-workspace.md#create-a-workspace).
337-
338-
Then, in the Azure portal:
339-
340-
1. Go to the resource group.
341-
1. Select your endpoint.
342-
1. Select the **ARM resource page**.
343-
1. Select **Diagnostic settings**.
344-
1. Select **Add settings**.
345-
1. Select to enable sending console logs to the Log Analytics workspace.
346-
347-
The logs might take up to an hour to connect. After an hour, send some scoring requests, and then check the logs by using the following steps:
348-
349-
1. Open the Log Analytics workspace.
350-
1. In the left menu, select **Logs**.
351-
1. Close the **Queries** dialog that automatically opens.
352-
1. Double-click **AmlOnlineEndpointConsoleLog**.
353-
1. Select **Run**.
354-
355-
[!INCLUDE [Email Notification Include](../../includes/machine-learning-email-notifications.md)]
336+
[!INCLUDE [Email Notification Include](../../includes/machine-learning-email-notifications.md)]
356337
357338
## Delete the endpoint and the deployment
358339

articles/machine-learning/how-to-manage-quotas.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ Azure Machine Learning managed online endpoints have limits described in the fol
119119

120120
<sup>3</sup> If you request a limit increase, be sure to calculate related limit increases you might need. For example, if you request a limit increase for requests per second, you might also want to compute the required connections and bandwidth limits and include these limit increases in the same request.
121121

122-
To determine the current usage for an endpoint, [view the metrics](how-to-monitor-online-endpoints.md#view-metrics).
122+
To determine the current usage for an endpoint, [view the metrics](how-to-monitor-online-endpoints.md#metrics).
123123

124124
To request an exception from the Azure Machine Learning product team, use the steps in the [Request quota increases](#request-quota-increases) section and provide the following information:
125125

articles/machine-learning/how-to-monitor-online-endpoints.md

Lines changed: 134 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.service: machine-learning
77
ms.author: larryfr
88
author: blackmist
99
ms.subservice: mlops
10-
ms.date: 06/01/2022
10+
ms.date: 06/27/2022
1111
ms.topic: conceptual
1212
ms.custom: how-to, devplatv2, event-tier1-build-2022
1313
---
@@ -28,7 +28,7 @@ In this article you learn how to:
2828
- Deploy an Azure Machine Learning online endpoint.
2929
- You must have at least [Reader access](../role-based-access-control/role-assignments-portal.md) on the endpoint.
3030

31-
## View metrics
31+
## Metrics
3232

3333
Use the following steps to view metrics for an online endpoint or deployment:
3434
1. Go to the [Azure portal](https://portal.azure.com).
@@ -38,11 +38,11 @@ Use the following steps to view metrics for an online endpoint or deployment:
3838

3939
1. In the left-hand column, select **Metrics**.
4040

41-
## Available metrics
41+
### Available metrics
4242

4343
Depending on the resource that you select, the metrics that you see will be different. Metrics are scoped differently for online endpoints and online deployments.
4444

45-
### Metrics at endpoint scope
45+
#### Metrics at endpoint scope
4646

4747
- Request Latency
4848
- Request Latency P50 (Request latency at the 50th percentile)
@@ -59,13 +59,13 @@ Split on the following dimensions:
5959
- Status Code
6060
- Status Code Class
6161

62-
#### Bandwidth throttling
62+
**Bandwidth throttling**
6363

6464
Bandwidth will be throttled if the limits are exceeded for _managed_ online endpoints (see managed online endpoints section in [Manage and increase quotas for resources with Azure Machine Learning](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints)). To determine if requests are throttled:
6565
- Monitor the "Network bytes" metric
6666
- The response trailers will have the fields: `ms-azureml-bandwidth-request-delay-ms` and `ms-azureml-bandwidth-response-delay-ms`. The values of the fields are the delays, in milliseconds, of the bandwidth throttling.
6767

68-
### Metrics at deployment scope
68+
#### Metrics at deployment scope
6969

7070
- CPU Utilization Percentage
7171
- Deployment Capacity (the number of instances of the requested instance type)
@@ -78,11 +78,11 @@ Split on the following dimension:
7878

7979
- InstanceId
8080

81-
## Create a dashboard
81+
### Create a dashboard
8282

8383
You can create custom dashboards to visualize data from multiple sources in the Azure portal, including the metrics for your online endpoint. For more information, see [Create custom KPI dashboards using Application Insights](../azure-monitor/app/tutorial-app-dashboards.md#add-custom-metric-chart).
8484

85-
## Create an alert
85+
### Create an alert
8686

8787
You can also create custom alerts to notify you of important status updates to your online endpoint:
8888

@@ -97,6 +97,132 @@ You can also create custom alerts to notify you of important status updates to y
9797
1. Select **Add action groups** > **Create action groups** to specify what should happen when your alert is triggered.
9898

9999
1. Choose **Create alert rule** to finish creating your alert.
100+
1.
101+
## Logs
102+
103+
There are three logs that can be enabled for online endpoints:
104+
105+
* **AMLOnlineEndpointTrafficLog**: You could choose to enable traffic logs if you want to check the information of your request. Below are some cases:
106+
107+
* If the response isn't 200, check the value of the column “ResponseCodeReason” to see what happened. Also check the reason in the "HTTPS status codes" section of the [Troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md#http-status-codes) article.
108+
109+
* You could check the response code and response reason of your model from the column “ModelStatusCode” and “ModelStatusReason”.
110+
111+
* You want to check the duration of the request like total duration, the request/response duration, and the delay caused by the network throttling. You could check it from the logs to see the breakdown latency.
112+
113+
* If you want to check how many requests or failed requests recently. You could also enable the logs.
114+
115+
* **AMLOnlineEndpointConsoleLog**: Contains logs that the containers output to the console. Below are some cases:
116+
117+
* If the container fails to start, the console log may be useful for debugging.
118+
119+
* Monitor container behavior and make sure that all requests are correctly handled.
120+
121+
* Write request IDs in the console log. Joining the request ID, the AMLOnlineEndpointConsoleLog, and AMLOnlineEndpointTrafficLog in the Log Analytics workspace, you can trace a request from the network entry point of an online endpoint to the container.
122+
123+
* You may also use this log for performance analysis in determining the time required by the model to process each request.
124+
125+
* **AMLOnlineEndpointEventLog**: Contains event information regarding the container’s life cycle. Currently, we provide information on the following types of events:
126+
127+
| Name | Message |
128+
| ----- | ----- |
129+
| BackOff | Back-off restarting failed container
130+
| Pulled | Container image "\<IMAGE\_NAME\>" already present on machine
131+
| Killing | Container inference-server failed liveness probe, will be restarted
132+
| Created | Created container image-fetcher
133+
| Created | Created container inference-server
134+
| Created | Created container model-mount
135+
| Unhealthy | Liveness probe failed: \<FAILURE\_CONTENT\>
136+
| Unhealthy | Readiness probe failed: \<FAILURE\_CONTENT\>
137+
| Started | Started container image-fetcher
138+
| Started | Started container inference-server
139+
| Started | Started container model-mount
140+
| Killing | Stopping container inference-server
141+
| Killing | Stopping container model-mount
142+
143+
### How to enable/disable logs
144+
145+
> [!IMPORTANT]
146+
> Logging uses Azure Log Analytics. If you do not currently have a Log Analytics workspace, you can create one using the steps in [Create a Log Analytics workspace in the Azure portal](../azure-monitor/logs/quick-create-workspace.md#create-a-workspace).
147+
148+
1. In the [Azure portal](https://portal.azure.com), go to the resource group that contains your endpoint and then select the endpoint.
149+
1. From the **Monitoring** section on the left of the page, select **Diagnostic settings** and then **Add settings**.
150+
1. Select the log categories to enable, select **Send to Log Analytics workspace**, and then select the Log Analytics workspace to use. Finally, enter a **Diagnostic setting name** and select **Save**.
151+
152+
:::image type="content" source="./media/how-to-monitor-online-endpoints/diagnostic-settings.png" alt-text="Screenshot of the diagnostic settings dialog.":::
153+
154+
> [!IMPORTANT]
155+
> It may take up to an hour for the connection to the Log Analytics workspace to be enabled. Wait an hour before continuing with the next steps.
156+
157+
1. Submit scoring requests to the endpoint. This activity should create entries in the logs.
158+
1. From either the online endpoint properties or the Log Analytics workspace, select **Logs** from the left of the screen.
159+
1. Close the **Queries** dialog that automatically opens, and then double-click the **AmlOnlineEndpointConsoleLog**. If you don't see it, use the **Search** field.
160+
161+
:::image type="content" source="./media/how-to-monitor-online-endpoints/online-endpoints-log-queries.png" alt-text="Screenshot showing the log queries.":::
162+
163+
1. Select **Run**.
164+
165+
:::image type="content" source="./media/how-to-monitor-online-endpoints/query-results.png" alt-text="Screenshots of the results after running a query.":::
166+
167+
### Example queries
168+
169+
You can find example queries on the __Queries__ tab while viewing logs. Search for __Online endpoint__ to find example queries.
170+
171+
:::image type="content" source="./media/how-to-monitor-online-endpoints/example-queries.png" alt-text="Screenshot of the example queries.":::
172+
173+
### Log column details
174+
175+
The following tables provide details on the data stored in each log:
176+
177+
**AMLOnlineEndpointTrafficLog**
178+
179+
| Field name | Description |
180+
| ---- | ---- |
181+
| Method | The requested method from client.
182+
| Path | The requested path from client.
183+
| SubscriptionId | The machine learning subscription ID of the online endpoint.
184+
| WorkspaceId | The machine learning workspace ID of the online endpoint.
185+
| EndpointName | The name of the online endpoint.
186+
| DeploymentName | The name of the online deployment.
187+
| Protocol | The protocol of the request.
188+
| ResponseCode | The final response code returned to the client.
189+
| ResponseCodeReason | The final response code reason returned to the client.
190+
| ModelStatusCode | The response status code from model.
191+
| ModelStatusReason | The response status reason from model.
192+
| RequestPayloadSize | The total bytes received from the client.
193+
| ResponsePayloadSize | The total bytes sent back to the client.
194+
| UserAgent | The user-agent header of the request.
195+
| XRequestId | The request ID generated by Azure Machine Learning for internal tracing.
196+
| XMSClientRequestId | The tracking ID generated by the client.
197+
| TotalDurationMs | Duration in milliseconds from the request start time to the last response byte sent back to the client. If the client disconnected, it measures from the start time to client disconnect time.
198+
| RequestDurationMs | Duration in milliseconds from the request start time to the last byte of the request received from the client.
199+
| ResponseDurationMs | Duration in milliseconds from the request start time to the first response byte read from the model.
200+
| RequestThrottlingDelayMs | Delay in milliseconds in request data transfer due to network throttling.
201+
| ResponseThrottlingDelayMs | Delay in milliseconds in response data transfer due to network throttling.
202+
203+
**AMLOnlineEndpointConsoleLog**
204+
205+
| Field Name | Description |
206+
| ----- | ----- |
207+
| TimeGenerated | The timestamp (UTC) of when the log was generated.
208+
| OperationName | The operation associated with log record.
209+
| InstanceId | The ID of the instance that generated this log record.
210+
| DeploymentName | The name of the deployment associated with the log record.
211+
| ContainerName | The name of the container where the log was generated.
212+
| Message | The content of the log.
213+
214+
**AMLOnlineEndpointEventLog**
215+
216+
217+
| Field Name | Description |
218+
| ----- | ----- |
219+
| TimeGenerated | The timestamp (UTC) of when the log was generated.
220+
| OperationName | The operation associated with log record.
221+
| InstanceId | The ID of the instance that generated this log record.
222+
| DeploymentName | The name of the deployment associated with the log record.
223+
| Name | The name of the event.
224+
| Message | The content of the event.
225+
100226

101227

102228
## Next steps

articles/machine-learning/how-to-safely-rollout-managed-endpoints.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ If you want to use a REST client to invoke the deployment directly without going
110110
Once you've tested your `green` deployment, you can copy (or 'mirror') a percentage of the live traffic to it. Mirroring traffic doesn't change results returned to clients. Requests still flow 100% to the blue deployment. The mirrored percentage of the traffic is copied and submitted to the `green` deployment so you can gather metrics and logging without impacting your clients. Mirroring is useful when you want to validate a new deployment without impacting clients. For example, to check if latency is within acceptable bounds and that there are no HTTP errors.
111111

112112
> [!WARNING]
113-
> Mirroring traffic uses your [endpoint bandwidth quota](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints) (default 5 MBPS). Your endpoint bandwidth will be throttled if you exceed the allocated quota. For information on monitoring bandwidth throttling, see [Monitor managed online endpoints](how-to-monitor-online-endpoints.md#bandwidth-throttling).
113+
> Mirroring traffic uses your [endpoint bandwidth quota](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints) (default 5 MBPS). Your endpoint bandwidth will be throttled if you exceed the allocated quota. For information on monitoring bandwidth throttling, see [Monitor managed online endpoints](how-to-monitor-online-endpoints.md#metrics-at-endpoint-scope).
114114
115115
The following command mirrors 10% of the traffic to the `green` deployment:
116116

59.8 KB
Loading
30.5 KB
Loading
89.3 KB
Loading
152 KB
Loading

0 commit comments

Comments
 (0)