Commit b45bd2d
authored
Changed node selection (#1881)
This PR adds a reusable withHealthyNode(...) wrapper and a set of health-checks that verify a node before running any heavy work. If a node fails the checks (e.g., Docker daemon down, missing GPU devices), the node is blacklisted and the pipeline automatically retries on the next candidate until a healthy executor is found (or retry budget is exhausted).1 parent 1358852 commit b45bd2d
1 file changed
+730
-915
lines changed
0 commit comments