Skip to content
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
86d6daf
wip
Ronkahn21 Feb 24, 2026
71d1767
refactor: pythonic polish and bug fixes for e2e cluster script
Ronkahn21 Feb 24, 2026
c4952e5
refactor: switch e2e cluster script to Typer with default-on model
Ronkahn21 Feb 25, 2026
278a0d0
fix: add Typer bool workaround and fix typo in e2e-cluster-manager
Ronkahn21 Feb 25, 2026
6c790c4
refactor: split e2e-cluster-manager into e2e_manager package
Ronkahn21 Feb 25, 2026
55ed02d
revert all changes
Ronkahn21 Feb 25, 2026
2d2ed0b
chore: restore Makefile from upstream main
Ronkahn21 Feb 25, 2026
40b8152
chore: restore deploy-addons.sh from upstream main
Ronkahn21 Feb 26, 2026
77990f8
refactor: rename e2e_manager to infra_manager
Ronkahn21 Feb 26, 2026
9525a67
refactor: add GroveInstallOptions, simplify and expand E2E/workflow s…
Ronkahn21 Feb 26, 2026
f9c0cc4
cli-integ-test: temporary commit
Ronkahn21 Feb 27, 2026
f645017
cli-integ-test: temporary commit
Ronkahn21 Feb 27, 2026
b7de6dd
remove: clean up deprecated scale testing scripts
Ronkahn21 Feb 27, 2026
d441d07
cli-integ-test: temporary commit
Ronkahn21 Feb 27, 2026
1d16a74
cli-integ-test: temporary commit
Ronkahn21 Feb 27, 2026
b1f8efe
refactor: rename cli.py to infra-manager.py, remove legacy scripts
Ronkahn21 Feb 27, 2026
676cbaa
remove: delete test-cli-integration.sh and its Makefile target
Ronkahn21 Feb 27, 2026
f31efdb
docs: update env vars section in hack README
Ronkahn21 Feb 27, 2026
32ad2b0
feat: add configurable pprof bind address
Ronkahn21 Mar 1, 2026
1a80570
fix: helm prereq, rename private fn, dedup Docker
Ronkahn21 Mar 1, 2026
bace76e
fix: add KWOK and topology safety guards
Ronkahn21 Mar 1, 2026
0f01e27
chore: misc code quality fixes
Ronkahn21 Mar 1, 2026
fd473f9
remove: delete all subcommands from infra_manager
Ronkahn21 Mar 1, 2026
f9c912a
feat: add `delete` subcommand to infra_manager
Ronkahn21 Mar 1, 2026
0fc467e
docs: add `pprofBindAddress` field to operator API reference
Ronkahn21 Mar 2, 2026
640ba0a
chore: revert some changes
Ronkahn21 Mar 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/actions/e2e-setup/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ runs:
shell: bash
run: |
echo "Installing Python dependencies..."
pip3 install --break-system-packages -r operator/hack/e2e-cluster/requirements.txt
pip3 install --break-system-packages -r operator/hack/requirements.txt

echo "Verifying Python dependencies..."
python3 -c "import docker; import sh; import typer; import pydantic; import rich; print('All dependencies installed successfully')"
15 changes: 13 additions & 2 deletions operator/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -115,12 +115,12 @@ run-e2e:
# Pass E2E_CREATE_FLAGS to add CLI flags (e.g. --skip-kai --skip-topology --skip-prepull)
.PHONY: e2e-cluster-up
e2e-cluster-up:
@$(MODULE_HACK_DIR)/e2e-cluster/create-e2e-cluster.py $(E2E_CREATE_FLAGS)
@$(MODULE_HACK_DIR)/infra-manager.py setup e2e $(E2E_CREATE_FLAGS)

# Delete the k3d e2e test cluster
.PHONY: e2e-cluster-down
e2e-cluster-down:
@$(MODULE_HACK_DIR)/e2e-cluster/create-e2e-cluster.py --delete
@$(MODULE_HACK_DIR)/infra-manager.py delete k3d-cluster

# Full e2e test run: create k3d cluster, run tests, then cleanup
# Usage: make run-e2e-full [TEST_PATTERN=<pattern>]
Expand All @@ -135,6 +135,17 @@ run-e2e-full: e2e-cluster-up
@echo "> Tests passed, cleaning up cluster..."
@$(MAKE) e2e-cluster-down

# Create a k3d cluster for scale testing (with Grove, Kai, KWOK, Pyroscope)
# Pass SCALE_CREATE_FLAGS to add CLI flags (e.g. --kwok-nodes 500 --pcs-syncs 10)
.PHONY: scale-cluster-up
scale-cluster-up:
@$(MODULE_HACK_DIR)/infra-manager.py setup scale $(SCALE_CREATE_FLAGS)

# Delete the k3d scale test cluster
.PHONY: scale-cluster-down
scale-cluster-down:
@$(MODULE_HACK_DIR)/infra-manager.py delete k3d-cluster

# Run autoMNNVL e2e tests (all 4 configurations: supported/unsupported x enabled/disabled)
# Creates a lightweight k3d cluster (2 workers, no Kai/topology), runs all configurations
# sequentially via config-cluster.py, then cleans up. Uses the same e2e-cluster-up target
Expand Down
21 changes: 21 additions & 0 deletions operator/charts/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ spec:
metadata:
labels:
{{- include "operator.deployment.labels" . | nindent 8 }}
{{- if and .Values.config.debugging .Values.config.debugging.enableProfiling }}
annotations:
profiles.grafana.com/cpu.scrape: "true"
profiles.grafana.com/cpu.port: "2753"
profiles.grafana.com/memory.scrape: "true"
profiles.grafana.com/memory.port: "2753"
profiles.grafana.com/goroutine.scrape: "true"
profiles.grafana.com/goroutine.port: "2753"
{{- end }}
spec:
restartPolicy: Always
{{- if .Values.priorityClass.enabled }}
Expand All @@ -34,6 +43,18 @@ spec:
imagePullPolicy: {{ .Values.image.pullPolicy }}
args:
- --config=/etc/grove-operator/config/config.yaml
ports:
- name: metrics
containerPort: {{ required ".Values.config.server.metrics.port" .Values.config.server.metrics.port }}
protocol: TCP
- name: webhooks
containerPort: {{ required ".Values.config.server.webhooks.port" .Values.config.server.webhooks.port }}
protocol: TCP
{{- if and .Values.config.debugging .Values.config.debugging.enableProfiling }}
- name: pprof
containerPort: 2753
protocol: TCP
{{- end }}
{{- if .Values.config.server.healthProbes.enable }}
livenessProbe:
httpGet:
Expand Down
10 changes: 8 additions & 2 deletions operator/charts/templates/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ metadata:
name: {{ required ".Values.service.name is required" .Values.service.name }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "operator.serviceaccount.labels" . | nindent 4 }}
{{- include "operator.service.labels" . | nindent 4 }}
spec:
type: {{ .Values.service.type }}
selector:
Expand All @@ -17,4 +17,10 @@ spec:
- name: webhooks
protocol: TCP
port: {{ required ".Values.config.server.webhooks.port" .Values.config.server.webhooks.port }}
targetPort: {{ required ".Values.config.server.webhooks.port" .Values.config.server.webhooks.port }}
targetPort: {{ required ".Values.config.server.webhooks.port" .Values.config.server.webhooks.port }}
{{- if and .Values.config.debugging .Values.config.debugging.enableProfiling }}
- name: pprof
protocol: TCP
port: 2753
targetPort: 2753
{{- end }}
6 changes: 3 additions & 3 deletions operator/e2e/setup/k8s_clusters.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
// The cluster must be created beforehand with Grove operator, Kai scheduler, and required
// test infrastructure already deployed. For local development with k3d, you can use:
//
// ./operator/hack/e2e-cluster/create-e2e-cluster.py
// ./operator/hack/infra-manager.py setup e2e
//
// This package only handles connecting to existing clusters - it does not create clusters.
package setup
Expand Down Expand Up @@ -56,7 +56,7 @@ const (
// 1. KUBECONFIG environment variable (if set)
// 2. Default kubeconfig at ~/.kube/config
//
// For local development with k3d, run './operator/hack/e2e-cluster/create-e2e-cluster.py' first
// For local development with k3d, run './operator/hack/infra-manager.py setup e2e' first
// to create a cluster and configure kubectl.
func getRestConfig() (*rest.Config, error) {
// Try KUBECONFIG environment variable first
Expand All @@ -72,7 +72,7 @@ func getRestConfig() (*rest.Config, error) {
// Try to load from kubeconfig file
if kubeconfigPath == "" {
return nil, fmt.Errorf("failed to get kubernetes config: no KUBECONFIG found and ~/.kube/config not accessible." +
"For local development, run './operator/hack/e2e-cluster/create-e2e-cluster.py' first")
"For local development, run './operator/hack/infra-manager.py setup e2e' first")
}

if _, err := os.Stat(kubeconfigPath); err != nil {
Expand Down
18 changes: 13 additions & 5 deletions operator/e2e/setup/shared_cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ const (

// Environment variables for cluster configuration.
// The cluster must be created beforehand with Grove operator and Kai scheduler deployed.
// For local development with k3d, use: ./operator/hack/e2e-cluster/create-e2e-cluster.py
// For local development with k3d, use: ./operator/hack/infra-manager.py setup e2e

// EnvRegistryPort specifies the container registry port for test images (optional)
EnvRegistryPort = "E2E_REGISTRY_PORT"
Expand Down Expand Up @@ -94,6 +94,7 @@ type SharedClusterManager struct {
logger *utils.Logger
isSetup bool
workerNodes []string
kwokNodes []string
registryPort string
cleanupFailed bool // Set to true if CleanupWorkloads fails, causing subsequent tests to fail
cleanupError string // The error message from the failed cleanup
Expand Down Expand Up @@ -123,7 +124,7 @@ func SharedCluster(logger *utils.Logger) *SharedClusterManager {
//
// For local development with k3d:
//
// ./operator/hack/e2e-cluster/create-e2e-cluster.py
// ./operator/hack/infra-manager.py setup e2e
//
// Optional environment variables:
// - E2E_REGISTRY_PORT (default: 5001, for pushing test images to local registry)
Expand Down Expand Up @@ -187,10 +188,16 @@ func (scm *SharedClusterManager) connectToCluster(ctx context.Context, testImage
}

scm.workerNodes = make([]string, 0)
scm.kwokNodes = make([]string, 0)
for _, node := range nodes.Items {
if _, isServer := node.Labels["node-role.kubernetes.io/control-plane"]; !isServer {
scm.workerNodes = append(scm.workerNodes, node.Name)
if _, isServer := node.Labels["node-role.kubernetes.io/control-plane"]; isServer {
continue
}
if node.Annotations["kwok.x-k8s.io/node"] == "fake" {
scm.kwokNodes = append(scm.kwokNodes, node.Name)
continue
}
scm.workerNodes = append(scm.workerNodes, node.Name)
}

// Start node monitoring to handle unhealthy k3d nodes during test execution.
Expand All @@ -204,7 +211,8 @@ func (scm *SharedClusterManager) connectToCluster(ctx context.Context, testImage
scm.logger.Info("ℹ️ Test run complete - cluster preserved for inspection or reuse")
}

scm.logger.Infof("✅ Connected to cluster with %d worker nodes", len(scm.workerNodes))
scm.logger.Infof("✅ Connected to cluster with %d worker nodes (%d KWOK simulated)",
len(scm.workerNodes), len(scm.kwokNodes))
scm.isSetup = true
return nil
}
Expand Down
135 changes: 79 additions & 56 deletions operator/hack/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,88 +2,111 @@

This directory contains utility scripts for Grove operator development and testing.

## Python Scripts
## Directory Structure

### create-e2e-cluster.py
```
hack/
├── infra-manager.py # Primary CLI for cluster infrastructure management
├── config-cluster.py # Declarative cluster configuration (fake GPU, MNNVL)
├── requirements.txt # Python dependencies
├── infra_manager/ # Python package with modular cluster management
│ ├── __init__.py
│ ├── cluster.py # k3d cluster operations
│ ├── components.py # Kai, Grove, Pyroscope installation
│ ├── config.py # Configuration models
│ ├── constants.py # Constants and dependency loading
│ ├── kwok.py # KWOK simulated node management
│ ├── orchestrator.py # Workflow orchestration
│ ├── utils.py # Shared utilities
│ ├── dependencies.yaml # Centralized dependency versions
│ └── pyroscope-values.yaml # Pyroscope Helm values
├── e2e-autoMNNVL/ # Auto-MNNVL E2E test runners
├── kind/ # Kind cluster configuration
├── build-operator.sh # Build operator image
├── build-initc.sh # Build init container image
├── docker-build.sh # Docker build helper
├── deploy.sh # Deploy operator
├── deploy-addons.sh # Deploy addon components
├── prepare-charts.sh # Prepare Helm charts
├── kind-up.sh # Create Kind cluster
└── kind-down.sh # Delete Kind cluster
```

Creates and configures a k3d cluster for E2E testing with **parallel image pre-pulling** for faster cluster startup.
## Python Scripts

**Features:**
- ✨ **Parallel image pre-pulling** - Pulls 7 Kai Scheduler images in parallel (~45s instead of 3.5min)
- 🎨 **Beautiful terminal output** - Progress bars and colored status messages
- ⚡ **Fast cluster creation** - Images are pre-loaded into local registry
- 🔧 **Configurable** - All settings via environment variables
- 🛡️ **Type-safe** - Pydantic models with validation
### infra-manager.py (Primary)

Unified CLI for Grove infrastructure management. Delegates to the `infra_manager` package.

**Installation:**

```bash
# Install Python dependencies (one-time setup)
pip3 install -r requirements.txt

# Or using a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip3 install -r hack/requirements.txt
```

**Usage:**

```bash
# Create a cluster with all components (includes image pre-pulling)
./hack/e2e-cluster/create-e2e-cluster.py
# Full e2e setup
./hack/infra-manager.py setup e2e

# View all options
./hack/e2e-cluster/create-e2e-cluster.py --help
./hack/infra-manager.py --help

# Delete the cluster
./hack/e2e-cluster/create-e2e-cluster.py --delete
./hack/infra-manager.py delete k3d-cluster

# Skip specific components
./hack/e2e-cluster/create-e2e-cluster.py --skip-grove
./hack/e2e-cluster/create-e2e-cluster.py --skip-kai
./hack/infra-manager.py setup e2e --skip-grove
./hack/infra-manager.py setup e2e --skip-kai --skip-prepull

# Scale test setup with KWOK simulated nodes
./hack/infra-manager.py setup scale --kwok-nodes 1000

# Skip image pre-pulling (faster script start, but slower cluster startup)
./hack/e2e-cluster/create-e2e-cluster.py --skip-prepull
# Install individual components
./hack/infra-manager.py install grove --profiling
./hack/infra-manager.py install pyroscope
```

**Environment Variables:**
### config-cluster.py

All configuration can be overridden via environment variables:
Declarative configuration for an existing E2E cluster. Supports fake GPU operator
and auto-MNNVL toggle.

```bash
export E2E_CLUSTER_NAME=my-cluster
export E2E_WORKER_NODES=50
export E2E_KAI_VERSION=v0.14.0
./hack/e2e-cluster/create-e2e-cluster.py
./hack/config-cluster.py --fake-gpu=yes --auto-mnnvl=enabled
```

Available variables:
- `E2E_CLUSTER_NAME` - Cluster name (default: shared-e2e-test-cluster)
- `E2E_REGISTRY_PORT` - Registry port (default: 5001)
- `E2E_API_PORT` - Kubernetes API port (default: 6560)
- `E2E_LB_PORT` - Load balancer port mapping (default: 8090:80)
- `E2E_WORKER_NODES` - Number of worker nodes (default: 30)
- `E2E_WORKER_MEMORY` - Worker node memory (default: 150m)
- `E2E_K3S_IMAGE` - K3s image (default: rancher/k3s:v1.33.5-k3s1)
- `E2E_KAI_VERSION` - Kai Scheduler version (default: v0.13.0-rc1)
- `E2E_MAX_RETRIES` - Max cluster creation retries (default: 3)
- `E2E_SKAFFOLD_PROFILE` - Skaffold profile for Grove (default: topology-test)

**Image Pre-Pulling:**

The script pre-pulls the following Kai Scheduler images in parallel before installation:
- `ghcr.io/nvidia/kai-scheduler/admission`
- `ghcr.io/nvidia/kai-scheduler/binder`
- `ghcr.io/nvidia/kai-scheduler/operator`
- `ghcr.io/nvidia/kai-scheduler/podgroupcontroller`
- `ghcr.io/nvidia/kai-scheduler/podgrouper`
- `ghcr.io/nvidia/kai-scheduler/queuecontroller`
- `ghcr.io/nvidia/kai-scheduler/scheduler`

**Performance:**
- Without pre-pull: ~3-7 minutes for images to pull during pod startup
- With pre-pull: ~45 seconds for parallel pre-pull, then instant pod startup
### Environment Variables

All configuration can be overridden via `E2E_*` environment variables (used by `infra-manager.py`):

**Cluster (K3dConfig):**

- `E2E_CLUSTER_NAME` - Cluster name (default: `shared-e2e-test-cluster`)
- `E2E_REGISTRY_PORT` - Registry port (default: `5001`)
- `E2E_API_PORT` - Kubernetes API port (default: `6560`)
- `E2E_LB_PORT` - Load balancer port mapping (default: `8090:80`)
- `E2E_WORKER_NODES` - Number of worker nodes (default: `30`)
- `E2E_WORKER_MEMORY` - Memory per worker node (default: `150m`)
- `E2E_K3S_IMAGE` - K3s container image (default: `rancher/k3s:v1.33.5-k3s1`)
- `E2E_MAX_RETRIES` - Max retries for cluster operations (default: `3`)

**Components (ComponentConfig):**

- `E2E_KAI_VERSION` - Kai Scheduler version (default: from `dependencies.yaml`)
- `E2E_SKAFFOLD_PROFILE` - Skaffold profile for Grove (default: `topology-test`)
- `E2E_GROVE_NAMESPACE` - Grove operator namespace (default: `grove-system`)
- `E2E_REGISTRY` - Container registry override (default: none)

**KWOK / Observability (KwokConfig):**

- `E2E_KWOK_NODES` - Number of KWOK simulated nodes (default: none)
- `E2E_KWOK_BATCH_SIZE` - Batch size for KWOK node creation (default: `150`)
- `E2E_KWOK_NODE_CPU` - CPU capacity per KWOK node (default: `64`)
- `E2E_KWOK_NODE_MEMORY` - Memory capacity per KWOK node (default: `512Gi`)
- `E2E_KWOK_MAX_PODS` - Max pods per KWOK node (default: `110`)
- `E2E_PYROSCOPE_NS` - Pyroscope namespace (default: `pyroscope`)

## Shell Scripts

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"""config-cluster.py - Declarative configuration for an existing E2E cluster.

This script applies or removes configuration on top of a cluster that was
already created by create-e2e-cluster.py. It is **idempotent**: running it
already created by infra-manager.py. It is **idempotent**: running it
twice with the same flags is a no-op.

Supported configuration axes:
Expand Down Expand Up @@ -367,7 +367,7 @@ def main() -> None:

# Change to the operator directory (helm chart path is relative)
script_dir = os.path.dirname(os.path.abspath(__file__))
operator_dir = os.path.join(script_dir, "..", "..")
operator_dir = os.path.join(script_dir, "..")
os.chdir(operator_dir)

# 1. Fake GPU operator
Expand Down
6 changes: 3 additions & 3 deletions operator/hack/e2e-autoMNNVL/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ make run-e2e-mnnvl-full
|--------|-------------|
| `run_autoMNNVL_e2e_all.py` | Run all 4 configurations sequentially (expects existing cluster) |
| `run_autoMNNVL_e2e.py` | Run a single configuration (configure + test, expects existing cluster) |
| `../e2e-cluster/create-e2e-cluster.py` | Create the k3d cluster and deploy Grove operator |
| `../e2e-cluster/config-cluster.py` | Declaratively configure fake GPU + MNNVL on an existing cluster |
| `../infra-manager.py` | Unified CLI for cluster infrastructure management |
| `../config-cluster.py` | Declaratively configure fake GPU + MNNVL on an existing cluster |

## Usage Examples

Expand All @@ -73,7 +73,7 @@ python3 ./hack/e2e-autoMNNVL/run_autoMNNVL_e2e_all.py
python3 ./hack/e2e-autoMNNVL/run_autoMNNVL_e2e.py --fake-gpu=yes --auto-mnnvl=enabled

# 5. Configure an existing cluster directly (without running tests)
python3 ./hack/e2e-cluster/config-cluster.py --fake-gpu=yes --auto-mnnvl=enabled
python3 ./hack/config-cluster.py --fake-gpu=yes --auto-mnnvl=enabled

# 6. Delete the cluster
make e2e-cluster-down
Expand Down
Loading
Loading