Skip to content

Commit f2b82a0

Browse files
Merge pull request #259896 from jiaochenlu/update-arc
update amlarc doc
2 parents ca372b6 + 065eaec commit f2b82a0

File tree

2 files changed

+28
-21
lines changed

2 files changed

+28
-21
lines changed

articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md

Lines changed: 26 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,15 @@ ms.custom: build-spring-2022, cliv2, sdkv2, event-tier1-build-2022
1414

1515
# Troubleshoot Kubernetes Compute
1616

17-
In this article, you'll learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
17+
In this article, you learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
1818

1919
## Inference guide
2020

2121
The common Kubernetes endpoint errors on Kubernetes compute are categorized into two scopes: **compute scope** and **cluster scope**. The compute scope errors are related to the compute target, such as the compute target is not found, or the compute target is not accessible. The cluster scope errors are related to the underlying Kubernetes cluster, such as the cluster itself is not reachable, or the cluster is not found.
2222

2323
### Kubernetes compute errors
2424

25-
Below is a list of error types in **compute scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guidelines:
25+
The common error types in **compute scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guidelines:
2626

2727

2828
* [ERROR: GenericComputeError](#error-genericcomputeerror)
@@ -33,7 +33,7 @@ Below is a list of error types in **compute scope** that you might encounter whe
3333

3434

3535
#### ERROR: GenericComputeError
36-
The error message is as below:
36+
The error message is as:
3737

3838
```bash
3939
Failed to get compute information.
@@ -44,7 +44,7 @@ This error should occur when system failed to get the compute information from t
4444
* Check the Kubernetes cluster health.
4545
* You can view the cluster health check report for any issues, for example, if the cluster is not reachable.
4646
* You can go to your workspace portal to check the compute status.
47-
* Check if the instance types is information is correct. You can check the supported instance types in the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) documentation.
47+
* Check if the instance types are information is correct. You can check the supported instance types in the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) documentation.
4848
* Try to detach and reattach the compute to the workspace if applicable.
4949

5050
> [!NOTE]
@@ -83,7 +83,7 @@ The error message is as follows:
8383
```bash
8484
The compute information is invalid.
8585
```
86-
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
86+
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid. For example, the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
8787

8888
You can check the following items to troubleshoot the issue:
8989
* Check whether the compute target you used is correct and existing in your workspace.
@@ -125,7 +125,7 @@ For AKS clusters:
125125
* Check if the AKS cluster is shut down.
126126
* If the cluster isn't running, you need to start the cluster first.
127127
* Check if the AKS cluster has enabled selected network by using authorized IP ranges.
128-
* If the AKS cluster has enabled authorized IP ranges, please make sure all the **Azure Machine Learning control plane IP ranges** have been enabled for the AKS cluster. More information you can see this [document](how-to-deploy-kubernetes-extension.md#limitations).
128+
* If the AKS cluster has enabled authorized IP ranges, make sure all the **Azure Machine Learning control plane IP ranges** have been enabled for the AKS cluster. More information you can see this [document](how-to-deploy-kubernetes-extension.md#limitations).
129129

130130

131131
For an AKS cluster or an Azure Arc enabled Kubernetes cluster:
@@ -170,7 +170,7 @@ You can check the following items to troubleshoot the issue:
170170

171171
#### ERROR: RefreshExtensionIdentityNotSet
172172

173-
This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to re-install the extension to fix it.
173+
This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to reinstall the extension to fix it.
174174

175175
> Please notice this error is only for managed clusters
176176
@@ -179,41 +179,46 @@ This error occurs when the extension is installed but the extension identity is
179179

180180

181181
### How to check sslCertPemFile and sslKeyPemFile is correct?
182-
Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
182+
In order to allow for any known errors to be surfaced, you can use the commands to run a baseline check for your cert and key. Expect the second command to return "RSA key ok" without prompting you for password.
183183

184184
```bash
185185
openssl x509 -in cert.pem -noout -text
186186
openssl rsa -in key.pem -noout -check
187187
```
188188

189-
Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
189+
Run the commands to verify whether sslCertPemFile and sslKeyPemFile are matched:
190190

191191
```bash
192192
openssl x509 -in cert.pem -noout -modulus | md5sum
193193
openssl rsa -in key.pem -noout -modulus | md5sum
194194
```
195195

196+
For sslCertPemFile, it is the public certificate. It should include the certificate chain which includes the following certificates and should be in the sequence of the server certificate, the intermediate CA certificate and the root CA certificate:
197+
* The server certificate: the server presents to the client during the TLS handshake. It contains the server’s public key, domain name, and other information. The server certificate is signed by an intermediate certificate authority (CA) that vouches for the server’s identity.
198+
* The intermediate CA certificate: the intermediate CA presents to the client to prove its authority to sign the server certificate. It contains the intermediate CA’s public key, name, and other information. The intermediate CA certificate is signed by a root CA that vouches for the intermediate CA’s identity.
199+
* The root CA certificate: the root CA presents to the client to prove its authority to sign the intermediate CA certificate. It contains the root CA’s public key, name, and other information. The root CA certificate is self-signed and trusted by the client.
200+
196201

197202
## Training guide
198203

199-
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide below to troubleshoot the issue.
204+
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide to troubleshoot the issue.
200205

201206
### Job retry debugging
202207

203-
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job will be **automatically retried** to another available node.
208+
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job is **automatically retried** to another available node.
204209

205210
To further debug the root cause of the job try, you can go to the workspace portal to check the job retry log.
206211

207-
* Each retry log will be recorded in a new log folder with the format of "retry-<retry number\>"(such as: retry-001).
212+
* Each retry log is recorded in a new log folder with the format of "retry-<retry number\>"(such as: retry-001).
208213

209-
Then you can get the retry job-node mapping information as mentioned above, to figure out which node the retry-job has been running on.
214+
Then you can get the retry job-node mapping information, to figure out which node the retry-job has been running on.
210215

211216
:::image type="content" source="media/how-to-troubleshoot-kubernetes-compute/job-retry-log.png" alt-text="Screenshot of adding a new extension to the Azure Arc-enabled Kubernetes cluster from the Azure portal.":::
212217

213218
You can get job-node mapping information from the
214219
**amlarc_cr_bootstrap.log** under system_logs folder.
215220

216-
The host name of the node which the job pod is running on will be indicated in this log, for example:
221+
The host name of the node, which the job pod is running on is indicated in this log, for example:
217222

218223
```bash
219224
++ echo 'Run on node: ask-agentpool-17631869-vmss0000"
@@ -257,7 +262,7 @@ If the error message is:
257262
Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
258263
```
259264

260-
Please check whether you have `enableTraining=True` set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
265+
Check whether you have `enableTraining=True` set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
261266

262267
### Job failed. 400
263268

@@ -274,13 +279,13 @@ If you need to access Azure Container Registry (ACR) for Docker image, and to ac
274279

275280
To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker images, or access a storage account for training data, you need to attach the Kubernetes compute with a system-assigned or user-assigned managed identity enabled.
276281

277-
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job will fail and report missing account key or sas token. Take accessing storage account for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
282+
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job fails and reports missing account key or sas token. Take accessing storage account, for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
278283
279284
```bash
280285
Unable to mount data store workspaceblobstore. Give either an account key or SAS token
281286
```
282287
283-
This is because machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
288+
The cause is machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
284289
285290
To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
286291
@@ -292,13 +297,13 @@ If you need to access the AzureBlob for data upload or download in your training
292297
Unable to upload project files to working directory in AzureBlob because the authorization failed.
293298
```
294299
295-
This is because the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
300+
The cause is the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
296301
* Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list.
297302
* Make sure the workspace has a system assigned managed identity.
298303
299304
## Private link issue
300305
301-
We could use the method below to check private link setup by logging into one pod in the Kubernetes cluster and then check related network settings.
306+
We could use the method to check private link setup by logging into one pod in the Kubernetes cluster and then check related network settings.
302307
303308
* Find workspace ID in Azure portal or get this ID by running `az ml workspace show` in the command line.
304309
* Show all azureml-fe pods run by `kubectl get po -n azureml -l azuremlappname=azureml-fe`.
@@ -311,13 +316,13 @@ If you set up private link from VNet to workspace correctly, then the internal I
311316
curl https://{workspace_id}.workspace.westcentralus.api.azureml.ms/metric/v2.0/subscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{workspace_name}/api/2.0/prometheus/post -X POST -x {proxy_address} -d {} -v -k
312317
```
313318

314-
If the proxy and workspace with private link is configured correctly, you can see it's trying to connect to an internal IP. This will return a response with http 401, which is expected when you don't provide token.
319+
When the proxy and workspace are correctly set up with a private link, you should observe an attempt to connect to an internal IP. A response with an HTTP 401 status code is expected in this scenario if a token is not provided.
315320

316321
## Other known issues
317322

318323
### Kubernetes compute update does not take effect
319324

320-
At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an existing Kubernetes compute. For example, changing the namespace will not take effect.
325+
At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an existing Kubernetes compute. For example, changing the namespace does not take effect.
321326

322327
### Workspace or resource group name end with '-'
323328

articles/machine-learning/reference-kubernetes.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,8 @@ More information about how to use ARM template can be found from [ARM template d
397397

398398
| Date | Version |Version description |
399399
|---|---|---|
400+
|Nov 21, 2023 | 1.1.39| Fixed vulnerabilities. Refined error message. Increased stability for relayserver API. |
401+
|Nov 1, 2023 | 1.1.37| Update data plane envoy version. |
400402
|Oct 11, 2023 | 1.1.35| Fix vulnerable image. Bug fixes. |
401403
|Aug 25, 2023 | 1.1.34| Fix vulnerable image. Return more detailed identity error. Bug fixes. |
402404
|July 18, 2023 | 1.1.29| Add new identity operator errors. Bug fixes. |

0 commit comments

Comments
 (0)