-
Notifications
You must be signed in to change notification settings - Fork 488
Description
Description
I am currently running HAMI version 2.3.13 in a production Kubernetes cluster (v1.30.2), and we encounter a critical issue: the NVIDIA 5090 GPU cannot be recognized/used with this version. To resolve this compatibility issue, we plan to upgrade HAMI to a newer version (2.5.0+), as confirmed that versions 2.5.0 and above support the NVIDIA 5090 GPU.
However, the official upgrade documentation only provides the approach of helm uninstall followed by helm install, which will completely stop and re-deploy the entire HAMI service. This will cause unacceptable downtime and disruption to our 7x24 online business (we rely on HAMI for GPU-accelerated workloads in production, and any downtime directly impacts core business operations).
Expected Solution
We are looking for a smooth/rolling upgrade method for HAMI (e.g., in-place update, DaemonSet rolling update, or Helm upgrade with zero downtime) that:
- Avoids full uninstall/install of the Helm release (to prevent complete service outage)
- Minimizes or eliminates downtime for online GPU-dependent services
- Ensures compatibility with NVIDIA 5090 GPU after upgrading to 2.5.0+
- Preserves existing configurations (e.g., GPU scheduling rules, node labels, runtime settings, resource quotas) as much as possible
- Works stably on Kubernetes v1.30.2
Environment
- HAMI Current Version: 2.3.13
- HAMI Target Version: 2.5.0+
- Kubernetes Version: v1.30.2
- GPU Model: NVIDIA 5090