Skip to content

tt-smi -s causes System Hang/Device Fault in Kubernetes Environment (concurrent with tt-topology/ttnn) #137

@hongsu

Description

@hongsu

Description: We are operating in a Kubernetes environment. We monitor our devices by periodically executing the tt-smi -s command (e.g., via liveness probe, sidecar, or cronjob) to collect metrics and heartbeats.

However, we have observed a critical stability issue where the system either hangs or results in a Device Fault when tt-smi -s is executed while a Pod is running tt-topology or ttnn workloads.

Steps to Reproduce:

  1. Deploy a Pod running a workload using ttnn or tt-topology in a Kubernetes cluster.
  2. While the workload is active, execute tt-smi -s periodically (e.g., inside the same container, a sidecar, or via kubectl exec).
  3. Observe that the process hangs or the device enters a fault state.

Expected Behavior: tt-smi commands should be able to run safely for monitoring purposes within a Kubernetes environment without interfering with active ttnn/tt-topology workloads.

Questions & Requests: We need a solution to continue monitoring the device status in our cluster without causing instability.

  1. Alternative Monitoring Methods: Is there any other recommended way to collect metrics and heartbeats safely while tt-topology or ttnn is in use?
  2. Fix Timeline: If this requires a fix in the library, driver, or device plugin, could you please provide an estimated timeline (ETA)?

Environment:

  • Orchestrator: Kubernetes
  • OS / Kernel: Ubuntu 22.04
  • Machine: LoudBox

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions