-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Labels
Description
Description: We are operating in a Kubernetes environment. We monitor our devices by periodically executing the tt-smi -s command (e.g., via liveness probe, sidecar, or cronjob) to collect metrics and heartbeats.
However, we have observed a critical stability issue where the system either hangs or results in a Device Fault when tt-smi -s is executed while a Pod is running tt-topology or ttnn workloads.
Steps to Reproduce:
- Deploy a Pod running a workload using
ttnnortt-topologyin a Kubernetes cluster. - While the workload is active, execute
tt-smi -speriodically (e.g., inside the same container, a sidecar, or viakubectl exec). - Observe that the process hangs or the device enters a fault state.
Expected Behavior: tt-smi commands should be able to run safely for monitoring purposes within a Kubernetes environment without interfering with active ttnn/tt-topology workloads.
Questions & Requests: We need a solution to continue monitoring the device status in our cluster without causing instability.
- Alternative Monitoring Methods: Is there any other recommended way to collect metrics and heartbeats safely while
tt-topologyorttnnis in use? - Fix Timeline: If this requires a fix in the library, driver, or device plugin, could you please provide an estimated timeline (ETA)?
Environment:
- Orchestrator: Kubernetes
- OS / Kernel: Ubuntu 22.04
- Machine: LoudBox
Reactions are currently unavailable