You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-deploy-kubernetes-extension.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ In this article, you can learn:
46
46
- Azure Machine Learning does not guarantee support for all preview stage features in AKS. For example, [Azure AD pod identity](../aks/use-azure-ad-pod-identity.md) is not supported.
47
47
- If you've previously followed the steps from [AzureML AKS v1 document](./v1/how-to-create-attach-kubernetes.md) to create or attach your AKS as inference cluster, use the following link to [clean up the legacy azureml-fe related resources](./v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources) before you continue the next step.
48
48
- We currently don't support attaching your AKS cluster across subscription, which means that your AKS cluster must be in the same subscription as your workspace.
49
-
- The workaround to meet your cross-subscription needs is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
49
+
- The workaround to meet your cross-subscription requirement is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,7 +99,7 @@ The compute information is invalid.
99
99
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
100
100
101
101
You can check the following items to troubleshoot the issue:
102
-
* Check whether the compute target you used is correct and exsiting in your workspace.
102
+
* Check whether the compute target you used is correct and existing in your workspace.
103
103
* Try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
@@ -112,7 +112,7 @@ The compute kubeconfig is invalid.
112
112
113
113
This error should occur when the system failed to find any configuration to connect to cluster, such as:
114
114
* For Arc-Kubernetes cluster, there is no Azure Relay configuration can be found.
115
-
* For AKS cluster, there is no AKS configuraiton can be found.
115
+
* For AKS cluster, there is no AKS configuration can be found.
116
116
117
117
To rebuild the configuration of compute connection in your cluster, you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
118
118
@@ -173,11 +173,11 @@ Cannot found Kubernetes cluster.
173
173
This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.
174
174
175
175
You can check the following items to troubleshoot the issue:
176
-
* First, check the cluster resource ID in the Azure Portal to verify whether Kubernetes cluster resource still exist and is running normally.
176
+
* First, check the cluster resource ID in the Azure Portal to verify whether Kubernetes cluster resource still exists and is running normally.
177
177
* If the cluster exists and is running, then you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
178
178
179
179
> [!TIP]
180
-
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](#how-to-troubleshoot-online-endpoints.md).
180
+
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md).
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-troubleshoot-online-endpoints.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -485,32 +485,32 @@ Below is a list of reasons you might run into this error when creating/updating
485
485
486
486
### ERROR: EndpointNotFound
487
487
488
-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the system can't find the endpoint resource for the deployment in the cluster. You should create the deployment in a exist endpoint or create this endpoint first in your cluster.
488
+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the system can't find the endpoint resource for the deployment in the cluster. You should create the deployment in an exist endpoint or create this endpoint first in your cluster.
489
489
490
490
### ERROR: ValidateScoringFailed
491
491
492
-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the scoring request URL validation failed when processing the model deploying.
492
+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the scoring request URL validation failed when processing the model deploying.
493
493
494
494
In this case, you can first check the endpoint URL and then try to re-deploy the deployment.
495
495
496
496
### ERROR: InvalidDeploymentSpec
497
497
498
-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the deployment spec is invalid.
498
+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the deployment spec is invalid.
499
499
500
500
In this case, you can check the error message.
501
501
* Make sure the `instance count` is valid.
502
502
* If you have enabled auto scaling, make sure the `minimum instance count` and `maximum instance count` are both valid.
503
503
504
504
### ERROR: ImagePullLoopBackOff
505
505
506
-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the images can't be downloaded from the container registry, resulting in the images pull failure. <message>
506
+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the images can't be downloaded from the container registry, resulting in the images pull failure.
507
507
508
508
In this case, you can check the cluster network policy and the workspace container registry if cluster can pull image from the container registry.
509
509
510
510
### ERROR: KubernetesCrashLoopBackOff
511
511
512
512
Below is a list of reasons you might run into this error when creating/updating the Kubernetes online endpoints/deployments:
513
-
* One or more pod(s) stuck in CrashLoopBackoff status, you can check if the deployment log exist, and check if there are error messgaes in the log.
513
+
* One or more pod(s) stuck in CrashLoopBackoff status, you can check if the deployment log exists, and check if there are error messages in the log.
514
514
* There is an error in `score.py` and the container crashed when init your score code, please following [ERROR: ResourceNotReady](#error-resourcenotready) part.
515
515
* Your scoring process needs more memory that your deployment config limit is insufficient, you can try to update the deployment with a larger memory limit.
516
516
@@ -523,7 +523,7 @@ Below is a list of reasons you might run into this error when creating/updating
523
523
To mitigate this error, refer to the following steps:
524
524
* Check the `node selector` definition of the `instance type` you used, and `node label` configuration of your cluster nodes.
525
525
* Check `instance type` and the node SKU size for AKS cluster or the node resource for Arc-Kubernetes cluster.
526
-
* If the cluster is under-resourced, you can reduce the instance type resource requirement or use the anohter instance type with smaller resource required.
526
+
* If the cluster is under-resourced, you can reduce the instance type resource requirement or use the another instance type with smaller resource required.
527
527
* If the cluster has no more resource to meet the requirement of the deployment, delete some deployment to release resources.
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-kubernetes.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -147,7 +147,7 @@ Some logs about AzureML workloads in the cluster, such as status, metrics, life
147
147
|aml-operator | Manage the lifecycle of training jobs. |The logs contain AML training job pod status in the cluster.|
148
148
|azureml-fe-v2| The front-end component that routes incoming inference requests to deployed services. |Access logs at request level, including request Id, start time, response code, error details and durations for request latency. Trace logs for service metadata changes, service running healthy status, etc. for debugging purpose.|
149
149
| gateway | The gateway is used to communicate and send data back and forth. | Trace logs on requests from AML services to the clusters.|
150
-
|healthcheck |--| The logs contain azureml namespace resource (AML extension) status to diagnostic what make the extension not functional. |
150
+
|healthcheck |--| The logs contain azureml namespace resource (AML extension) status to diagnose what make the extension not functional. |
151
151
|inference-operator-controller-manager| Manage the lifecycle of inference endpoints. |The logs contain AML inference endpoint and deployment pod status in the cluster.|
152
152
| metrics-controller-manager | Manage the configuration for Prometheus.|Trace logs for status of uploading training job and inference deployment metrics on CPU utilization and memory utilization.|
153
153
| relayserver | relayserver is only needed in arc-connected cluster and will not be installed in AKS cluster.| Relayserver works with Azure Relay to communicate with the cloud services. The logs contain request level info from Azure relay. |
0 commit comments