Skip to content

Commit 235de8b

Browse files
author
sophie zhao
committed
add additional instructions for troubleshooting OOM kill issue
1 parent 8cfbf74 commit 235de8b

File tree

1 file changed

+16
-4
lines changed

1 file changed

+16
-4
lines changed

articles/machine-learning/how-to-troubleshoot-kubernetes-extension.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -79,13 +79,25 @@ When you request support, we recommend that you run the following command and se
7979
kubectl logs healthcheck -n azureml
8080
```
8181
## Extension-operator pod in azure-arc/kube-system namespace is crashing due to OOMKill
82-
This issue happens if the extension's helm chart size is large and there are multiple Helm releases on the cluster. Here is a sample script to help clean up the helm history on the cluster:
82+
This issue occurs when the extension's Helm chart size is large and there are multiple Helm releases on the cluster.
83+
To check the Helm history of the Azure ML extension, use the following commands:
84+
```
85+
# Check if there is a release of the Azure ML extension Helm chart installed on the cluster
86+
# Note: The default namespace for the extension is usually 'azureml'. If you specified a different namespace during installation, replace 'azureml' with your namespace.
87+
helm list -n azureml
88+
89+
# Get helm history
90+
# Note: <release-name> can be retrieved from the output of the previous command
91+
helm history -n azureml <release-name>
92+
```
93+
There is a Helm history limit of 10 revisions, but this limit applies only to revisions in a non-transient state.
94+
If you see multiple revisions in a pending-rollback or pending-upgrade state in the Helm history output, run the script below to clean up the Helm history on the cluster:
8395
```
8496
#!/bin/bash
8597

8698
# Set release name and namespace
87-
RELEASE_NAME=$1
88-
NAMESPACE=$2
99+
RELEASE_NAME=$1 # release_name is the name of the azure ml extension helm release
100+
NAMESPACE=$2 # namespace is the azure ml extension's namespace. Default value is azureml
89101

90102
# Validate input
91103
if [[ -z "$RELEASE_NAME" || -z "$NAMESPACE" ]]; then
@@ -124,7 +136,7 @@ How to run the script:
124136
```
125137
chmod +x delete_stuck_helm_secrets.sh
126138

127-
./delete_stuck_helm_secrets.sh my-release my-namespace
139+
./delete_stuck_helm_secrets.sh <release_name> <extension_namespace>
128140
```
129141
130142
### Error Code of HealthCheck

0 commit comments

Comments
 (0)