Skip to content

Commit 4879e0f

Browse files
committed
feat(deployment-policy): add batch state reset with auto-reset, CLI, and config
Add deployment policy batch state reset to ensure rollouts start fresh instead of continuing with scaled-up batch sizes. Includes: - Auto-reset on rollout completion and spec version change - Manual CLI command: `deployment-policy reset` (alias: `dp reset`) - `resetBatchStateOnCompletion` field on DeploymentPolicy and Skyhook - Config precedence: Skyhook override > DeploymentPolicy > default - `--skip-batch-reset` flag on existing `reset` command - E2E tests for auto-reset, config precedence, and CLI reset - Documentation for deployment policy and CLI
1 parent 94c30c8 commit 4879e0f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+3208
-32
lines changed

chart/templates/cleanup-skyhooks-job.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,27 @@ spec:
1919
tolerations:
2020
{{- toYaml . | nindent 6 }}
2121
{{- end }}
22+
{{- with .Values.controllerManager.selectors }}
23+
nodeSelector:
24+
{{- toYaml . | nindent 8 }}
25+
{{- end }}
26+
{{- if .Values.controllerManager.nodeAffinity.matchExpressions }}
27+
affinity:
28+
nodeAffinity:
29+
requiredDuringSchedulingIgnoredDuringExecution:
30+
nodeSelectorTerms:
31+
- matchExpressions:
32+
{{- range .Values.controllerManager.nodeAffinity.matchExpressions }}
33+
- key: {{ .key }}
34+
operator: {{ .operator }}
35+
{{- if .values }}
36+
values:
37+
{{- range .values }}
38+
- {{ . }}
39+
{{- end }}
40+
{{- end }}
41+
{{- end }}
42+
{{- end }}
2243
securityContext:
2344
runAsNonRoot: true
2445
runAsUser: 10001
@@ -59,13 +80,23 @@ spec:
5980
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
6081
name: kube-api-access
6182
readOnly: true
83+
# Guard against limitRange being disabled (set to null/false) with fallback defaults
6284
resources:
85+
{{- if .Values.limitRange }}
6386
limits:
6487
cpu: {{ .Values.limitRange.default.cpu }}
6588
memory: {{ .Values.limitRange.default.memory }}
6689
requests:
6790
cpu: {{ .Values.limitRange.defaultRequest.cpu }}
6891
memory: {{ .Values.limitRange.defaultRequest.memory }}
92+
{{- else }}
93+
limits:
94+
cpu: 500m
95+
memory: 512Mi
96+
requests:
97+
cpu: 250m
98+
memory: 256Mi
99+
{{- end }}
69100
command:
70101
- /bin/sh
71102
- -c

chart/templates/cleanup-webhook-job.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,31 @@ spec:
1414
restartPolicy: Never
1515
automountServiceAccountToken: false
1616
serviceAccountName: {{ include "chart.fullname" . }}-controller-manager
17+
{{- with .Values.controllerManager.tolerations }}
18+
tolerations:
19+
{{- toYaml . | nindent 6 }}
20+
{{- end }}
21+
{{- with .Values.controllerManager.selectors }}
22+
nodeSelector:
23+
{{- toYaml . | nindent 8 }}
24+
{{- end }}
25+
{{- if .Values.controllerManager.nodeAffinity.matchExpressions }}
26+
affinity:
27+
nodeAffinity:
28+
requiredDuringSchedulingIgnoredDuringExecution:
29+
nodeSelectorTerms:
30+
- matchExpressions:
31+
{{- range .Values.controllerManager.nodeAffinity.matchExpressions }}
32+
- key: {{ .key }}
33+
operator: {{ .operator }}
34+
{{- if .values }}
35+
values:
36+
{{- range .values }}
37+
- {{ . }}
38+
{{- end }}
39+
{{- end }}
40+
{{- end }}
41+
{{- end }}
1742
securityContext:
1843
runAsNonRoot: true
1944
runAsUser: 10001
@@ -54,13 +79,23 @@ spec:
5479
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
5580
name: kube-api-access
5681
readOnly: true
82+
# Guard against limitRange being disabled (set to null/false) with fallback defaults
5783
resources:
84+
{{- if .Values.limitRange }}
5885
limits:
5986
cpu: {{ .Values.limitRange.default.cpu }}
6087
memory: {{ .Values.limitRange.default.memory }}
6188
requests:
6289
cpu: {{ .Values.limitRange.defaultRequest.cpu }}
6390
memory: {{ .Values.limitRange.defaultRequest.memory }}
91+
{{- else }}
92+
limits:
93+
cpu: 500m
94+
memory: 512Mi
95+
requests:
96+
cpu: 250m
97+
memory: 256Mi
98+
{{- end }}
6499
command:
65100
- /bin/sh
66101
- -c

chart/templates/deploymentpolicy-crd.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,11 @@ spec:
267267
required:
268268
- strategy
269269
type: object
270+
resetBatchStateOnCompletion:
271+
default: false
272+
description: ResetBatchStateOnCompletion controls whether batch state
273+
is reset when rollout completes or spec changes
274+
type: boolean
270275
required:
271276
- default
272277
type: object

chart/templates/skyhook-crd.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,16 @@ spec:
115115
description: DeploymentPolicy is the name of a DeploymentPolicy for
116116
rollout settings
117117
type: string
118+
deploymentPolicyOptions:
119+
description: DeploymentPolicyOptions allows per-Skyhook overrides
120+
of DeploymentPolicy settings
121+
properties:
122+
resetBatchStateOnCompletion:
123+
default: false
124+
description: ResetBatchStateOnCompletion overrides the DeploymentPolicy
125+
setting for this Skyhook
126+
type: boolean
127+
type: object
118128
interruptionBudget:
119129
description: InterruptionBudget configures how many nodes that match
120130
node selectors that allowed to be interrupted at once.

docs/cli.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ The CLI requires **operator version v0.8.0 or later** for full functionality of
2525
| `package rerun` | ✅ Full | ✅ Full |
2626
| `package logs` | ✅ Full | ✅ Full |
2727
| `reset` | ✅ Full | ✅ Full |
28+
| `deployment-policy reset` | ❌ Not supported | ✅ Full |
2829
| `pause` | ❌ Not supported | ✅ Full |
2930
| `resume` | ❌ Not supported | ✅ Full |
3031
| `disable` | ❌ Not supported | ✅ Full |
@@ -126,6 +127,62 @@ kubectl skyhook disable my-skyhook --confirm
126127
kubectl skyhook enable my-skyhook
127128
```
128129

130+
### Reset Command
131+
132+
Reset all package state for a Skyhook, causing re-execution from the beginning.
133+
134+
```bash
135+
# Reset all nodes for a Skyhook (also resets batch state by default)
136+
kubectl skyhook reset gpu-init --confirm
137+
138+
# Preview changes without applying (dry-run)
139+
kubectl skyhook reset gpu-init --dry-run
140+
141+
# Reset nodes only, preserve deployment policy batch state
142+
kubectl skyhook reset gpu-init --skip-batch-reset --confirm
143+
```
144+
145+
| Flag | Description |
146+
|------|-------------|
147+
| `--confirm, -y` | Skip confirmation prompt |
148+
| `--skip-batch-reset` | Skip resetting deployment policy batch state |
149+
150+
> **Note:** By default, `reset` also resets the deployment policy batch state so the next rollout starts from batch 1. Use `--skip-batch-reset` to preserve the existing batch state.
151+
152+
### Deployment Policy Commands
153+
154+
Manage deployment policy batch state.
155+
156+
> **Note:** Requires operator v0.8.0+.
157+
158+
```bash
159+
# Reset batch state for a Skyhook (starts rollout from batch 1)
160+
kubectl skyhook deployment-policy reset gpu-init --confirm
161+
162+
# Preview what would be reset (dry-run)
163+
kubectl skyhook deployment-policy reset gpu-init --dry-run
164+
165+
# Using the short alias
166+
kubectl skyhook dp reset gpu-init --confirm
167+
```
168+
169+
The `deployment-policy reset` command resets the batch processing state for all compartments in the specified Skyhook, including:
170+
- Current batch number (reset to 1)
171+
- Consecutive failure count
172+
- Completed and failed node counts
173+
- Stop flag
174+
175+
| Flag | Description |
176+
|------|-------------|
177+
| `--confirm, -y` | Skip confirmation prompt |
178+
179+
**When to use**:
180+
- After a rollout completes and you want to start a new rollout fresh
181+
- When batch processing is stuck and needs to be reset
182+
- Before re-running a rollout with the same deployment policy
183+
184+
See [Deployment Policy documentation](deployment_policy.md) for details on auto-reset configuration.
185+
129186
### Node Commands
130187

131188
Manage Skyhook nodes across the cluster.
@@ -208,10 +265,12 @@ kubectl skyhook --help
208265
# Command group help
209266
kubectl skyhook node --help
210267
kubectl skyhook package --help
268+
kubectl skyhook deployment-policy --help
211269

212270
# Specific command help
213271
kubectl skyhook node reset --help
214272
kubectl skyhook package rerun --help
273+
kubectl skyhook deployment-policy reset --help
215274
```
216275

217276
## Common Usage Patterns
@@ -252,6 +311,18 @@ kubectl skyhook node status
252311
kubectl skyhook node status --skyhook my-skyhook -o json
253312
```
254313

314+
### Resetting a Rollout
315+
```bash
316+
# 1. Full reset: nodes + batch state (starts everything fresh)
317+
kubectl skyhook reset my-skyhook --confirm
318+
319+
# 2. Or reset only batch state (keep node state, restart batch progression)
320+
kubectl skyhook deployment-policy reset my-skyhook --confirm
321+
322+
# 3. Or reset only nodes (keep batch progression)
323+
kubectl skyhook reset my-skyhook --skip-batch-reset --confirm
324+
```
325+
255326
### Emergency Stop
256327

257328
> **Note:** Requires operator v0.8.0+. For older operators, use `kubectl edit skyhook my-skyhook` and set `spec.pause: true`.
@@ -289,7 +360,11 @@ kubectl skyhook node list --skyhook my-skyhook -o yaml
289360
operator/cmd/cli/app/ # CLI commands
290361
├── cli.go # Root command (NewSkyhookCommand)
291362
├── version.go # Version command
363+
├── reset.go # Reset command (nodes + batch state)
292364
├── lifecycle.go # Pause, resume, disable, enable commands
365+
├── deploymentpolicy/ # Deployment policy subcommands
366+
│ ├── deploymentpolicy.go # Parent command
367+
│ └── deploymentpolicy_reset.go # Batch state reset
293368
├── node/ # Node subcommands
294369
│ ├── node.go # Parent command
295370
│ ├── node_list.go
@@ -314,10 +389,13 @@ main()
314389
└── cli.Execute()
315390
└── NewSkyhookCommand(ctx)
316391
├── NewVersionCmd(ctx)
392+
├── NewResetCmd(ctx)
317393
├── NewPauseCmd(ctx)
318394
├── NewResumeCmd(ctx)
319395
├── NewDisableCmd(ctx)
320396
├── NewEnableCmd(ctx)
397+
├── deploymentpolicy.NewDeploymentPolicyCmd(ctx)
398+
│ └── NewResetCmd(ctx)
321399
├── node.NewNodeCmd(ctx)
322400
│ ├── NewListCmd(ctx)
323401
│ ├── NewStatusCmd(ctx)

docs/deployment_policy.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ kind: DeploymentPolicy
2323
metadata:
2424
name: my-policy
2525
spec:
26+
# Reset batch state automatically when rollout completes or spec version changes
27+
resetBatchStateOnCompletion: true # default: true
2628
# Default applies to nodes that don't match any compartment
2729
default:
2830
budget:
@@ -240,6 +242,94 @@ compartments:
240242

241243
---
242244

245+
## Batch State Reset
246+
247+
When using progressive rollout strategies (linear, exponential), the operator tracks batch processing state per compartment — current batch number, consecutive failures, completed/failed node counts, etc. This state persists across reconciliations so the rollout can scale up progressively.
248+
249+
However, when a rollout **completes** or a **spec version changes**, you typically want the next rollout to start fresh from batch 1 rather than continuing with scaled-up batch sizes. Batch state reset handles this automatically.
250+
251+
### Auto-Reset Triggers
252+
253+
Batch state is automatically reset when **either** of these events occurs (if configured):
254+
255+
1. **Rollout completion** — When a Skyhook's status transitions to `Complete`
256+
2. **Spec version change** — When a package version changes in the Skyhook spec
257+
258+
After reset, the next reconciliation starts from batch 1 with all counters cleared.
259+
260+
### Configuration
261+
262+
Auto-reset is controlled by two fields with a precedence hierarchy:
263+
264+
| Field | Location | Description |
265+
|-------|----------|-------------|
266+
| `spec.resetBatchStateOnCompletion` | DeploymentPolicy | Default setting for all Skyhooks using this policy |
267+
| `spec.deploymentPolicyOptions.resetBatchStateOnCompletion` | Skyhook | Per-Skyhook override (takes precedence) |
268+
269+
**Precedence order** (highest to lowest):
270+
1. Skyhook's `deploymentPolicyOptions.resetBatchStateOnCompletion`
271+
2. DeploymentPolicy's `resetBatchStateOnCompletion`
272+
3. Default: `true` (safe by default for new resources)
273+
274+
### Examples
275+
276+
**Enable auto-reset (default behavior for new policies)**:
277+
```yaml
278+
apiVersion: skyhook.nvidia.com/v1alpha1
279+
kind: DeploymentPolicy
280+
metadata:
281+
name: my-policy
282+
spec:
283+
resetBatchStateOnCompletion: true # Enabled by default
284+
default:
285+
budget:
286+
percent: 25
287+
```
288+
289+
**Disable auto-reset for a specific Skyhook** (override the policy):
290+
```yaml
291+
apiVersion: skyhook.nvidia.com/v1alpha1
292+
kind: Skyhook
293+
metadata:
294+
name: my-skyhook
295+
spec:
296+
deploymentPolicy: my-policy
297+
deploymentPolicyOptions:
298+
resetBatchStateOnCompletion: false # Override: keep batch state across rollouts
299+
```
300+
301+
**Disable auto-reset at the policy level**:
302+
```yaml
303+
apiVersion: skyhook.nvidia.com/v1alpha1
304+
kind: DeploymentPolicy
305+
metadata:
306+
name: preserve-state-policy
307+
spec:
308+
resetBatchStateOnCompletion: false # All Skyhooks using this policy keep batch state
309+
```
310+
311+
### Manual Reset
312+
313+
You can also reset batch state manually using the CLI:
314+
315+
```bash
316+
# Reset batch state for a specific Skyhook
317+
kubectl skyhook deployment-policy reset my-skyhook --confirm
318+
319+
# Preview what would be reset (dry-run)
320+
kubectl skyhook deployment-policy reset my-skyhook --dry-run
321+
322+
# The 'reset' command also resets batch state by default
323+
kubectl skyhook reset my-skyhook --confirm
324+
325+
# To reset nodes only without resetting batch state
326+
kubectl skyhook reset my-skyhook --skip-batch-reset --confirm
327+
```
328+
329+
See [CLI documentation](cli.md) for full command details.
330+
331+
---
332+
243333
## Using with Skyhooks
244334

245335
Reference a policy by name:
@@ -251,6 +341,8 @@ metadata:
251341
name: my-skyhook
252342
spec:
253343
deploymentPolicy: my-policy # References DeploymentPolicy
344+
deploymentPolicyOptions: # Optional per-Skyhook overrides
345+
resetBatchStateOnCompletion: true
254346
nodeSelectors:
255347
matchLabels:
256348
workload: gpu
@@ -262,6 +354,7 @@ spec:
262354
- DeploymentPolicy is **cluster-scoped** (not namespaced)
263355
- Each node is assigned to a compartment based on selectors
264356
- Nodes not matching any compartment use the `default` settings
357+
- `deploymentPolicyOptions` allows per-Skyhook overrides of policy settings
265358

266359
---
267360

0 commit comments

Comments
 (0)