Skip to content

Commit fce59e0

Browse files
committed
Update k8s compute TSG and log info
1 parent f61488b commit fce59e0

File tree

4 files changed

+12
-12
lines changed

4 files changed

+12
-12
lines changed

articles/machine-learning/how-to-deploy-kubernetes-extension.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ In this article, you can learn:
4646
- Azure Machine Learning does not guarantee support for all preview stage features in AKS. For example, [Azure AD pod identity](../aks/use-azure-ad-pod-identity.md) is not supported.
4747
- If you've previously followed the steps from [AzureML AKS v1 document](./v1/how-to-create-attach-kubernetes.md) to create or attach your AKS as inference cluster, use the following link to [clean up the legacy azureml-fe related resources](./v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources) before you continue the next step.
4848
- We currently don't support attaching your AKS cluster across subscription, which means that your AKS cluster must be in the same subscription as your workspace.
49-
- The workaround to meet your cross-subscription needs is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
49+
- The workaround to meet your cross-subscription requirement is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
5050

5151
## Review AzureML extension configuration settings
5252

articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ The compute information is invalid.
9999
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
100100

101101
You can check the following items to troubleshoot the issue:
102-
* Check whether the compute target you used is correct and exsiting in your workspace.
102+
* Check whether the compute target you used is correct and existing in your workspace.
103103
* Try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
104104

105105
#### ERROR: InvalidComputeNoKubernetesConfiguration
@@ -112,7 +112,7 @@ The compute kubeconfig is invalid.
112112

113113
This error should occur when the system failed to find any configuration to connect to cluster, such as:
114114
* For Arc-Kubernetes cluster, there is no Azure Relay configuration can be found.
115-
* For AKS cluster, there is no AKS configuraiton can be found.
115+
* For AKS cluster, there is no AKS configuration can be found.
116116

117117
To rebuild the configuration of compute connection in your cluster, you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
118118

@@ -173,11 +173,11 @@ Cannot found Kubernetes cluster.
173173
This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.
174174

175175
You can check the following items to troubleshoot the issue:
176-
* First, check the cluster resource ID in the Azure Portal to verify whether Kubernetes cluster resource still exist and is running normally.
176+
* First, check the cluster resource ID in the Azure Portal to verify whether Kubernetes cluster resource still exists and is running normally.
177177
* If the cluster exists and is running, then you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
178178

179179
> [!TIP]
180-
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](#how-to-troubleshoot-online-endpoints.md).
180+
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md).
181181
182182

183183
## Training guide

articles/machine-learning/how-to-troubleshoot-online-endpoints.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -485,32 +485,32 @@ Below is a list of reasons you might run into this error when creating/updating
485485

486486
### ERROR: EndpointNotFound
487487

488-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the system can't find the endpoint resource for the deployment in the cluster. You should create the deployment in a exist endpoint or create this endpoint first in your cluster.
488+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the system can't find the endpoint resource for the deployment in the cluster. You should create the deployment in an exist endpoint or create this endpoint first in your cluster.
489489

490490
### ERROR: ValidateScoringFailed
491491

492-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the scoring request URL validation failed when processing the model deploying.
492+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the scoring request URL validation failed when processing the model deploying.
493493

494494
In this case, you can first check the endpoint URL and then try to re-deploy the deployment.
495495

496496
### ERROR: InvalidDeploymentSpec
497497

498-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the deployment spec is invalid.
498+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the deployment spec is invalid.
499499

500500
In this case, you can check the error message.
501501
* Make sure the `instance count` is valid.
502502
* If you have enabled auto scaling, make sure the `minimum instance count` and `maximum instance count` are both valid.
503503

504504
### ERROR: ImagePullLoopBackOff
505505

506-
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the images can't be downloaded from the container registry, resulting in the images pull failure. <message>
506+
The reason you might run into this error when creating/updating Kubernetes online deployments is because the images can't be downloaded from the container registry, resulting in the images pull failure.
507507

508508
In this case, you can check the cluster network policy and the workspace container registry if cluster can pull image from the container registry.
509509

510510
### ERROR: KubernetesCrashLoopBackOff
511511

512512
Below is a list of reasons you might run into this error when creating/updating the Kubernetes online endpoints/deployments:
513-
* One or more pod(s) stuck in CrashLoopBackoff status, you can check if the deployment log exist, and check if there are error messgaes in the log.
513+
* One or more pod(s) stuck in CrashLoopBackoff status, you can check if the deployment log exists, and check if there are error messages in the log.
514514
* There is an error in `score.py` and the container crashed when init your score code, please following [ERROR: ResourceNotReady](#error-resourcenotready) part.
515515
* Your scoring process needs more memory that your deployment config limit is insufficient, you can try to update the deployment with a larger memory limit.
516516

@@ -523,7 +523,7 @@ Below is a list of reasons you might run into this error when creating/updating
523523
To mitigate this error, refer to the following steps:
524524
* Check the `node selector` definition of the `instance type` you used, and `node label` configuration of your cluster nodes.
525525
* Check `instance type` and the node SKU size for AKS cluster or the node resource for Arc-Kubernetes cluster.
526-
* If the cluster is under-resourced, you can reduce the instance type resource requirement or use the anohter instance type with smaller resource required.
526+
* If the cluster is under-resourced, you can reduce the instance type resource requirement or use the another instance type with smaller resource required.
527527
* If the cluster has no more resource to meet the requirement of the deployment, delete some deployment to release resources.
528528

529529

articles/machine-learning/reference-kubernetes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ Some logs about AzureML workloads in the cluster, such as status, metrics, life
147147
|aml-operator | Manage the lifecycle of training jobs. |The logs contain AML training job pod status in the cluster.|
148148
|azureml-fe-v2| The front-end component that routes incoming inference requests to deployed services. |Access logs at request level, including request Id, start time, response code, error details and durations for request latency. Trace logs for service metadata changes, service running healthy status, etc. for debugging purpose.|
149149
| gateway | The gateway is used to communicate and send data back and forth. | Trace logs on requests from AML services to the clusters.|
150-
|healthcheck |--| The logs contain azureml namespace resource (AML extension) status to diagnostic what make the extension not functional. |
150+
|healthcheck |--| The logs contain azureml namespace resource (AML extension) status to diagnose what make the extension not functional. |
151151
|inference-operator-controller-manager| Manage the lifecycle of inference endpoints. |The logs contain AML inference endpoint and deployment pod status in the cluster.|
152152
| metrics-controller-manager | Manage the configuration for Prometheus.|Trace logs for status of uploading training job and inference deployment metrics on CPU utilization and memory utilization.|
153153
| relayserver | relayserver is only needed in arc-connected cluster and will not be installed in AKS cluster.| Relayserver works with Azure Relay to communicate with the cloud services. The logs contain request level info from Azure relay. |

0 commit comments

Comments
 (0)