Skip to content

Commit 59e7b3e

Browse files
committed
feat(operator): implement per-node priority ordering
Implements per-node priority ordering where each node progresses through skyhooks independently, preventing slow/blocked nodes from blocking healthy nodes. Includes comprehensive e2e test documentation, bug fixes, and validation. Core Changes: - Per-node priority ordering: nodes progress independently through priorities - Nodes no longer wait for all other nodes to complete a priority level - Added IsNodeReadyForSkyhook() to check per-node priority readiness - Fixed CollectNodeStatus() to show in_progress if ANY node is progressing - Updated waiting/blocked logic to be per-node, not global - Changed runtime-required to be per node, not by skyhook. Tests: - Added critical assertions proving concurrent execution at different priorities - Updated runtime-required test for per-node taint removal - Added 7 unit tests for CollectNodeStatus per-node behavior Documentation: - Added comprehensive README.md for all chainsaw e2e tests - Updated ordering_of_skyhooks.md with per-node behavior details - Updated runtime_required.md for per-node taint removal - Documented test scenarios, key features, and file descriptions Infrastructure: - Regenerated mocks for SkyhookNode with new methods - Fixed license formatter to skip deleted files This enables per-node parallelism while maintaining strict ordering guarantees per node, preventing cluster-wide deadlocks from bad nodes.
1 parent 1962f15 commit 59e7b3e

File tree

88 files changed

+4344
-454
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+4344
-454
lines changed

docs/ordering_of_skyhooks.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,30 @@
11
# Ordering of Skyhooks
22
## What
3-
With v0.8.0 Skyhooks now always get applied in a repeatable and specific order. This also means that all Skyhooks will now be sequential, though packages within a Skyhook can be parallel. Each custom resource now supports a `priority` field which is a non-zero positive integer. Skyhooks will be processed in order starting from 1, any Skyhooks with the same `priority` will be processed by sorting them by their `metadata.name` field.
3+
Skyhooks are applied in a repeatable and specific order based on their `priority` field. Each custom resource supports a `priority` field which is a non-zero positive integer. Skyhooks will be processed in order starting from 1, any Skyhooks with the same `priority` will be processed by sorting them by their `metadata.name` field.
44

55
**NOTE**: Any Skyhook which does NOT provide a `priority` field will be assigned a priority value of 200.
66

7-
Two additional flow control features have been added with this and can be set in the annotations of each skyhook:
7+
## Per-Node Ordering
8+
9+
**Important**: Priority ordering is enforced **per-node**, not globally across all nodes. This means:
10+
- Node A can proceed to Skyhook 2 as soon as Skyhook 1 completes on Node A
11+
- Node A does NOT wait for Node B to complete Skyhook 1
12+
- If Node B is stuck on Skyhook 1, Node A can still progress through all its skyhooks
13+
14+
This per-node behavior prevents deadlocks where a few stuck/bad nodes would block all other healthy nodes from progressing through their skyhook sequence.
15+
16+
### Example
17+
With two nodes (A, B) and two skyhooks (priority 1 and priority 2):
18+
- Node A completes Skyhook 1 → Node A immediately starts Skyhook 2
19+
- Node B is still processing Skyhook 1 → Node B shows "waiting" status on Skyhook 2
20+
- Node A completes Skyhook 2 → Node A is fully complete
21+
- Node B eventually completes Skyhook 1 → Node B starts Skyhook 2
22+
23+
## Flow Control Annotations
24+
25+
Two flow control features can be set in the annotations of each skyhook:
826
* `skyhook.nvidia.com/disable`: bool. When `true` it will skip this Skyhook from processing and continue with any other ones further down the priority order.
9-
* `skyhook.nvidia.com/pause`: bool. When `true` it will NOT process this Skyhook and it WILL NOT continue to process any Skyhook's after this one. This will effectively stop all application of Skyhooks starting with this one. NOTE: This ability used to be on the Skyhook spec itself as the `pause` field and has been moved here to be consistent with `disable` and to avoid incrementing the generation of a Skyhook Custom Resource instance when changing it.
27+
* `skyhook.nvidia.com/pause`: bool. When `true` it will NOT process this Skyhook and it WILL NOT continue to process any Skyhook's after this one on that node. This will effectively stop all application of Skyhooks starting with this one. NOTE: This ability used to be on the Skyhook spec itself as the `pause` field and has been moved here to be consistent with `disable` and to avoid incrementing the generation of a Skyhook Custom Resource instance when changing it.
1028

1129
## Why
1230
This solves a few problems:

docs/runtime_required.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,18 @@ before the nodes that it targets are considered available for general use.
1818
1. It will NOT add the taint to any nodes targeted by a SCR with `runtimeRequired: true`
1919

2020
# Details
21-
## When is a node considered ready
22-
When all of the following is true per node:
23-
1. All SCRs with `runtimeRequired: true` are complete (ie complete on all nodes)
21+
## When is the runtime-required taint removed from a node
22+
The taint is removed from a node when all SCRs with `runtimeRequired: true` that target that node are complete **on that specific node**.
2423

25-
## What happens happens when a node is considered ready
26-
1. The runtime required taint is removed from the node if it exists.
24+
**Important**: Taint removal is per-node, not per-skyhook. This means:
25+
- Node A's taint is removed when all runtime-required skyhooks complete on Node A
26+
- Node A does NOT wait for Node B to complete those same skyhooks
27+
- If Node B is stuck or failing, Node A can still have its taint removed and become available
28+
29+
This per-node behavior prevents deadlocks where a few bad nodes would block all other healthy nodes from becoming available.
30+
31+
## What happens when the taint is removed
32+
1. The node becomes available for general workload scheduling (pods without the runtime-required toleration can now be scheduled on it).
2733

2834
# Why would you use runtime required
2935
This is useful when you want to gate other work behind the successful completion of some set of Skyhook Packages. This can be for security reasons or for scheduling.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# CLI Lifecycle Test
2+
3+
## Purpose
4+
5+
Validates all CLI lifecycle commands for controlling Skyhook processing state.
6+
7+
## Test Scenario
8+
9+
1. Reset state from previous runs
10+
2. Create a skyhook and wait for it to start processing
11+
3. Test pause command:
12+
- Run `skyhook pause <name>`
13+
- Assert the skyhook shows paused status
14+
4. Test resume command:
15+
- Run `skyhook resume <name>`
16+
- Assert the skyhook resumes processing
17+
5. Test disable command:
18+
- Run `skyhook disable <name>`
19+
- Assert the skyhook is disabled
20+
6. Test enable command:
21+
- Run `skyhook enable <name>`
22+
- Assert the skyhook is enabled and processing
23+
7. Test pause and disable together:
24+
- Apply both pause and disable to the skyhook
25+
- Assert both flags are set
26+
- Run `skyhook resume` - should only remove pause, not disable
27+
- Assert the skyhook is still disabled
28+
- Run `skyhook enable` - should remove disable
29+
- Assert the skyhook is fully enabled
30+
31+
## Key Features Tested
32+
33+
- `skyhook pause` - Pauses a Skyhook from processing
34+
- `skyhook resume` - Resumes a paused Skyhook
35+
- `skyhook disable` - Disables a Skyhook completely
36+
- `skyhook enable` - Enables a disabled Skyhook
37+
- Independence of pause and disable flags
38+
- Resume only affects pause, not disable
39+
40+
## Files
41+
42+
- `chainsaw-test.yaml` - Main test configuration
43+
- `skyhook.yaml` - Test skyhook
44+
- `assert-paused.yaml` - Paused state assertion
45+
- `assert-resumed.yaml` - Resumed state assertion
46+
- `assert-disabled.yaml` - Disabled state assertion
47+
- `assert-enabled.yaml` - Enabled state assertion
48+
- `assert-paused-and-disabled.yaml` - Both flags set assertion
49+
- `assert-still-disabled.yaml` - Resume doesn't remove disable assertion

k8s-tests/chainsaw/cli/lifecycle/chainsaw-test.yaml

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33
#
44
#
@@ -20,12 +20,6 @@ kind: Test
2020
metadata:
2121
name: cli-lifecycle
2222
spec:
23-
description: |
24-
Test all CLI lifecycle commands:
25-
- pause: pauses a Skyhook from processing
26-
- resume: resumes a paused Skyhook
27-
- disable: disables a Skyhook completely
28-
- enable: enables a disabled Skyhook
2923
timeouts:
3024
assert: 60s
3125
exec: 30s
@@ -127,4 +121,3 @@ spec:
127121
- script:
128122
content: |
129123
kubectl delete skyhook cli-lifecycle-test 2>/dev/null || true
130-
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# CLI Node Test
2+
3+
## Purpose
4+
5+
Validates all CLI node commands for managing node state within a Skyhook.
6+
7+
## Test Scenario
8+
9+
1. Reset state from previous runs
10+
2. Create a skyhook and wait for it to complete
11+
3. Test node list command:
12+
- Run `skyhook node list <skyhook>`
13+
- Verify output shows targeted nodes
14+
4. Test node status command:
15+
- Run `skyhook node status [node]`
16+
- Verify output shows Skyhook activity on nodes
17+
5. Test node ignore command:
18+
- Run `skyhook node ignore <skyhook> <node>`
19+
- Assert the node is excluded from processing
20+
6. Test node unignore command:
21+
- Run `skyhook node unignore <skyhook> <node>`
22+
- Assert the node is included back in processing
23+
7. Test node reset command:
24+
- Run `skyhook node reset <skyhook> <node>`
25+
- Assert the package state is reset on the node
26+
27+
## Key Features Tested
28+
29+
- `skyhook node list` - Shows nodes targeted by a Skyhook
30+
- `skyhook node status` - Shows Skyhook activity on nodes
31+
- `skyhook node ignore` - Excludes a node from processing
32+
- `skyhook node unignore` - Includes a node back in processing
33+
- `skyhook node reset` - Resets package state on a node
34+
35+
## Files
36+
37+
- `chainsaw-test.yaml` - Main test configuration
38+
- `skyhook.yaml` - Test skyhook
39+
- `assert-node-ignored.yaml` - Ignored state assertion
40+
- `assert-node-unignored.yaml` - Unignored state assertion
41+
- `assert-node-reset.yaml` - Reset state assertion

k8s-tests/chainsaw/cli/node/chainsaw-test.yaml

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33
#
44
#
@@ -20,12 +20,6 @@ kind: Test
2020
metadata:
2121
name: cli-node
2222
spec:
23-
description: |
24-
Test all CLI node commands:
25-
- node list: shows nodes targeted by a Skyhook
26-
- node status: shows Skyhook activity on nodes
27-
- node ignore/unignore: excludes/includes nodes from processing
28-
- node reset: resets package state on a node
2923
timeouts:
3024
assert: 120s
3125
exec: 90s
@@ -116,4 +110,3 @@ spec:
116110
for NODE in $(kubectl get nodes -l skyhook.nvidia.com/test-node=skyhooke2e -o jsonpath='{.items[*].metadata.name}'); do
117111
kubectl label node "$NODE" skyhook.nvidia.com/ignore- 2>/dev/null || true
118112
done
119-
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# CLI Package Test
2+
3+
## Purpose
4+
5+
Validates all CLI package commands for inspecting and managing package state.
6+
7+
## Test Scenario
8+
9+
1. Reset state from previous runs
10+
2. Create a skyhook and wait for it to complete
11+
3. Test package status command:
12+
- Run `skyhook package status <skyhook> <package>`
13+
- Verify output shows package status across nodes
14+
4. Test package logs command:
15+
- Run `skyhook package logs <skyhook> <package>`
16+
- Verify logs are retrieved from package pods
17+
5. Test package rerun command:
18+
- Run `skyhook package rerun <skyhook> <package>`
19+
- Assert the package is re-run on the node
20+
21+
## Key Features Tested
22+
23+
- `skyhook package status` - Shows package status across nodes
24+
- `skyhook package logs` - Retrieves logs from package pods
25+
- `skyhook package rerun` - Forces a package to re-run
26+
27+
## Files
28+
29+
- `chainsaw-test.yaml` - Main test configuration
30+
- `skyhook.yaml` - Test skyhook
31+
- `assert-complete.yaml` - Initial completion assertion
32+
- `assert-package-rerun.yaml` - Rerun state assertion

k8s-tests/chainsaw/cli/package/chainsaw-test.yaml

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33
#
44
#
@@ -20,11 +20,6 @@ kind: Test
2020
metadata:
2121
name: cli-package
2222
spec:
23-
description: |
24-
Test all CLI package commands:
25-
- package status: shows package status across nodes
26-
- package logs: retrieves logs from package pods
27-
- package rerun: forces a package to re-run
2823
timeouts:
2924
assert: 120s
3025
exec: 90s
@@ -96,4 +91,3 @@ spec:
9691
- script:
9792
content: |
9893
../skyhook-cli reset cli-package-test --confirm 2>/dev/null || true
99-

k8s-tests/chainsaw/cli/readme.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# CLI Tests
2+
3+
This directory contains end-to-end tests for the `kubectl-skyhook` CLI plugin. These tests validate that all CLI commands work correctly against a real Kubernetes cluster.
4+
5+
## Prerequisites
6+
7+
The CLI tests require:
8+
1. A running Kind cluster with the skyhook operator installed
9+
2. Nodes labeled with `skyhook.nvidia.com/test-node=skyhooke2e`
10+
3. The `skyhook` CLI binary built with coverage enabled
11+
12+
## Tests
13+
14+
| Test | Description |
15+
|------|-------------|
16+
| [lifecycle](./lifecycle/) | Pause, resume, disable, and enable commands |
17+
| [node](./node/) | Node list, status, ignore, unignore, and reset commands |
18+
| [package](./package/) | Package status, logs, and rerun commands |
19+
| [reset](./reset/) | Skyhook reset command |
20+
21+
## CLI Commands Tested
22+
23+
### Lifecycle Commands
24+
- `skyhook pause <skyhook>` - Pauses a Skyhook from processing
25+
- `skyhook resume <skyhook>` - Resumes a paused Skyhook
26+
- `skyhook disable <skyhook>` - Disables a Skyhook completely
27+
- `skyhook enable <skyhook>` - Enables a disabled Skyhook
28+
29+
### Node Commands
30+
- `skyhook node list <skyhook>` - Shows nodes targeted by a Skyhook
31+
- `skyhook node status [node]` - Shows Skyhook activity on nodes
32+
- `skyhook node ignore <skyhook> <node>` - Excludes a node from processing
33+
- `skyhook node unignore <skyhook> <node>` - Includes a node back in processing
34+
- `skyhook node reset <skyhook> <node>` - Resets package state on a node
35+
36+
### Package Commands
37+
- `skyhook package status <skyhook> <package>` - Shows package status across nodes
38+
- `skyhook package logs <skyhook> <package>` - Retrieves logs from package pods
39+
- `skyhook package rerun <skyhook> <package>` - Forces a package to re-run
40+
41+
### Reset Command
42+
- `skyhook reset <skyhook>` - Resets all nodes for a Skyhook
43+
44+
## Running the Tests
45+
46+
```bash
47+
# Run all CLI tests
48+
make cli-e2e-tests
49+
50+
# Run a specific test
51+
cd k8s-tests/chainsaw/cli
52+
chainsaw test --test-dir lifecycle
53+
```
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# CLI Reset Test
2+
3+
## Purpose
4+
5+
Validates the skyhook reset command for resetting all nodes in a Skyhook.
6+
7+
## Test Scenario
8+
9+
1. Create a skyhook and wait for it to complete
10+
2. Disable the skyhook to prevent re-processing after reset
11+
3. Test reset command:
12+
- Run `skyhook reset <name>`
13+
- Assert all nodes are reset to initial state
14+
4. Verify node annotations are cleared
15+
16+
## Key Features Tested
17+
18+
- `skyhook reset` - Resets all nodes for a Skyhook
19+
- Node state cleanup
20+
- Annotation removal
21+
22+
## Files
23+
24+
- `chainsaw-test.yaml` - Main test configuration
25+
- `skyhook.yaml` - Test skyhook
26+
- `assert-skyhook-complete.yaml` - Initial completion assertion
27+
- `assert-skyhook-disabled.yaml` - Disabled state assertion
28+
- `assert-nodes-reset.yaml` - Reset state assertion

0 commit comments

Comments
 (0)