Open
Conversation
- Use rollout status instead of pod wait to handle helm upgrade races - Make KWOK delete remove controller and increase timeout for large node counts - Tolerate partial failures in KWOK controller install (FlowSchema) - Extract constants, use tenacity retries, frozenset display fields - Remove redundant open mode, consistent list building - Replace sys.exit() in functions with RuntimeError Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace argparse opt-in flags with Typer opt-out (--skip-*) model. Split monolithic ClusterConfig into K3dConfig, ComponentConfig, and KwokConfig. Add frozen ActionFlags dataclass and resolve_config() as single merge point for CLI > env > default precedence. Update Makefile targets to match new default-on semantics. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Extract 1,583-line monolith into 8 focused modules: - constants.py, config.py, utils.py, cluster.py, components.py, kwok.py, orchestrator.py, __init__.py Entry point remains a thin CLI wrapper (~160 lines). Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
d98731a to
2d2ed0b
Compare
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Rename e2e-scoped scripts to generic infra names since they manage general cluster infrastructure (k3d, Kai, Grove, KWOK, Pyroscope) useful beyond just E2E testing. - e2e-cluster-manager.py -> infra-manager.py - create-e2e-cluster.py -> create-cluster.py - e2e_manager/ -> infra_manager/ - Update all imports, Makefile, Go refs, READMEs Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…etup Introduce `GroveInstallOptions` for deployment tuning, replacing legacy `ActionFlags` in new APIs. Split orchestrator logic into modular workflows: `run_e2e_setup` and `run_scale_setup`. Add `test-cli-integration.sh` for unmocked CLI validation. Update Makefile and CLI aliases for full backward compatibility. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove obsolete scale testing scripts and monitoring configurations, including JSON converters, dashboards, and related deployment utilities. These scripts are no longer relevant to the updated workflow and infrastructure setup. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Delete create-cluster.py (legacy monolith) and the old infra-manager.py shim, both fully superseded by the CLI + infra_manager package. Rename cli.py to infra-manager.py and update all references. Remove dead code (ActionFlags, validate_flags, resolve_config, display_config, resolve_bool_flag, legacy orchestrator.run) that only the old shim used. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove the `test-cli-integration.sh` script and its `Makefile` target (`test-cli-integration`). This script is no longer required due to updates in the infrastructure setup workflow. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Move environment variables from config-cluster.py to a top-level section and document all 18 E2E_* vars grouped by config class. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a unified local/CI cluster infrastructure management workflow for Grove E2E + scale testing by adding a new infra-manager.py Typer CLI backed by a modular infra_manager Python package. It also wires in scale-testing capabilities (KWOK + Pyroscope) and enables exposing pprof when profiling is turned on.
Changes:
- Add
operator/hack/infra-manager.pyCLI and theoperator/hack/infra_manager/package (k3d lifecycle, component installs, KWOK, orchestration, dependency config). - Add KWOK + Pyroscope support for scale testing and expose pprof endpoints/ports when profiling is enabled.
- Update docs/Makefile/CI wiring and remove the legacy
create-e2e-cluster.pyscript.
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/internal/controller/manager_test.go | Updates profiling test expectations for the new default pprof bind address. |
| operator/internal/controller/manager.go | Changes default pprof bind address to listen on all interfaces when enabled. |
| operator/hack/requirements.txt | Updates Python dependency versions and adds tenacity; switches typer to a range. |
| operator/hack/infra_manager/utils.py | Adds shared utilities (kubectl runner, registry helpers, helm overrides, prereq checks). |
| operator/hack/infra_manager/pyroscope-values.yaml | Adds default Pyroscope resource overrides for scale runs. |
| operator/hack/infra_manager/orchestrator.py | Implements end-to-end and scale setup workflows (parallelized steps). |
| operator/hack/infra_manager/kwok.py | Adds KWOK controller install and batch creation/deletion of simulated nodes. |
| operator/hack/infra_manager/dependencies.yaml | Extends dependency catalog with KWOK and Pyroscope versions and images. |
| operator/hack/infra_manager/constants.py | Centralizes paths, defaults, dependency loading, and shared constants. |
| operator/hack/infra_manager/config.py | Adds pydantic settings models for k3d/components/KWOK and install options. |
| operator/hack/infra_manager/components.py | Adds installers for Kai/Grove/Pyroscope and related readiness helpers. |
| operator/hack/infra_manager/commands/uninstall_cmd.py | Adds uninstall subcommands for Kai and Grove. |
| operator/hack/infra_manager/commands/setup_cmd.py | Adds composite setup e2e and setup scale workflows. |
| operator/hack/infra_manager/commands/install_cmd.py | Adds install subcommands for Kai/Grove/Pyroscope. |
| operator/hack/infra_manager/commands/delete_cmd.py | Adds delete subcommands for k3d cluster and KWOK nodes. |
| operator/hack/infra_manager/commands/create_cmd.py | Adds create subcommands for k3d cluster and KWOK nodes. |
| operator/hack/infra_manager/commands/init.py | Initializes the CLI commands package. |
| operator/hack/infra_manager/cluster.py | Implements k3d cluster ops, parallel image prepull, and topology labeling. |
| operator/hack/infra_manager/init.py | Adds shared console/logging utilities, including thread-local buffered output. |
| operator/hack/infra-manager.py | Introduces the top-level Typer CLI wiring subcommands together. |
| operator/hack/e2e-cluster/create-e2e-cluster.py | Removes the legacy script replaced by infra-manager. |
| operator/hack/e2e-autoMNNVL/run_autoMNNVL_e2e_all.py | Updates docs/paths to reference infra-manager and the new config script location. |
| operator/hack/e2e-autoMNNVL/run_autoMNNVL_e2e.py | Updates docs/paths to reference infra-manager and the new config script location. |
| operator/hack/e2e-autoMNNVL/README.md | Updates documentation to point to infra-manager and config-cluster.py. |
| operator/hack/config-cluster.py | Updates docs and fixes operator working-directory resolution. |
| operator/hack/README.md | Updates hack tooling docs to describe infra-manager and new env var model. |
| operator/e2e/setup/shared_cluster.go | Tracks KWOK nodes separately and reports real vs simulated node counts. |
| operator/e2e/setup/k8s_clusters.go | Updates developer guidance to use infra-manager for local k3d setup. |
| operator/charts/templates/service.yaml | Adds optional Service port for pprof when profiling is enabled. |
| operator/charts/templates/deployment.yaml | Adds optional pprof scrape annotations and container ports when profiling is enabled. |
| operator/Makefile | Switches E2E cluster targets to infra-manager; adds scale-cluster up/down targets. |
| .github/actions/e2e-setup/action.yaml | Updates CI dependency install path to operator/hack/requirements.txt. |
Comments suppressed due to low confidence (1)
operator/hack/requirements.txt:30
requirements.txtclaims a minimum Python version of 3.8+, but the newinfra_managerpackage uses PEP 604 union annotations likestr | Nonethroughout. Those annotations are not reliably supported on Python 3.8/3.9 when libraries (e.g. Pydantic) evaluate type hints, so this can break at runtime for users following the stated minimum version. Either bump the documented minimum Python version (and CI) to 3.10+ or refactor annotations totyping.Optional[...]/typing.Union[...]to preserve 3.8 compatibility.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add PprofBindAddress field to DebuggingConfiguration with a secure default of 127.0.0.1:2753 (loopback-only). Wire through defaults, deepcopy, manager, and Helm templates. Derive pprof port from the bind address in Helm via split instead of hardcoding 2753. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add helm to prerequisites when grove is installed. Rename _apply_kai_queues to apply_kai_queues (public API). Simplify prepull_images to delegate to prepull_image_groups. Wire pprof bind address override in scale setup helm overrides. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add early-return for total<=0 in create_nodes. Call _wait_kwok_nodes_ready at end of create_nodes. Guard empty worker_nodes in apply_topology_labels. Exclude KWOK nodes from topology labeling. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add trailing newline to pyroscope-values.yaml. Add __all__ exports to commands/__init__.py. Update minimum Python version from 3.8+ to 3.10+ in requirements.txt. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove all subcommands (create, delete, install, uninstall) from `infra_manager/commands`, as they are no longer required. Update all impacted references and documentation accordingly. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Introduce a new `delete` subcommand under infra_manager for handling infrastructure resource deletion. Added `k3d-cluster` and `kwok-nodes` delete operations. Updated CLI, imports, and documentation accordingly. Removed unused `_node_sort_key` helper. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
/kind feature
What type of PR is this?
Feature — Phase 1 of unified scale testing infrastructure
What this PR does / why we need it:
Introduces
infra-manager, a unified CLI tool (operator/hack/infra-manager.py) that replaces the scattered collection of ad-hoc cluster provisioning scripts with a single, modular Python package for managing Grove's E2E and scale testing infrastructure.Key changes:
infra-manager.py): Typer-based CLI with subcommands:create,delete,install,uninstall,setupfor full cluster lifecycle managementinfra_manager/): Clean separation —cluster.py(EKS ops),components.py(addon installation),kwok.py(KWOK virtual node simulation),orchestrator.py(workflow coordination),config.py/constants.py(configuration)dependencies.yamlwith version pinning and auto-install of required tools (eksctl, kubectl, helm, kwok, etc.)e2e-autoMNNVL/,e2e-cluster/,config-cluster.py) that are superseded by infra-managerhack/tests/test_cli_integration.sh)Which issue(s) this PR fixes:
Part of #402
Special notes for your reviewer:
This is Phase 1 — infrastructure tooling only. Scale test scenarios (PodGangSet scaling, KWOK-based load tests, etc.) will come in subsequent PRs building on this foundation.
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: