Skip to content

ci instability: failover nvb: force-delete worker pod 0 failed #921

@jgehrcke

Description

@jgehrcke

https://github.com/NVIDIA/k8s-dra-driver-gpu/actions/runs/22759033700/job/66010763180#step:3:260

# 2026-03-06T10:20:50.480Z [  11.5s] sleep, pre-injection jitter: 1.47484 s
# 2026-03-06T10:20:51.959Z [  13.0s] inject fault type 1: force-delete worker pod 0
# + kubectl delete pod test-failover-job-worker-0 --grace-period=0 --force
# Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
# pod "test-failover-job-worker-0" force deleted from default namespace
# + set +x
# 2026-03-06T10:25:38.629Z [ 299.6s] global deadline reached (300 seconds), collect debug data -- and leave control loop
...
# nvidia-dra-driver-gpu   computedomain-daemon-c41e373a-dce3-4e9f-b86a-eb4110b0abc7-5vm97   1/1     Running     0          4m59s   192.168.35.146   gb-nvl-156-compute17   <none>           <none>
# nvidia-dra-driver-gpu   computedomain-daemon-c41e373a-dce3-4e9f-b86a-eb4110b0abc7-tzbqn   1/1     Running     0          4m40s   192.168.34.120   gb-nvl-156-compute18   <none>           <none>

Reading this log output, I think the CD daemon log follower may be broken as of today -- probably because of a rename that we recently performed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-instabilitynon-deterministic CI / build failure

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions