Skip to content

Scale/phase1 cluster infra#455

Open
Ronkahn21 wants to merge 26 commits intoai-dynamo:mainfrom
Ronkahn21:scale/phase1-cluster-infra
Open

Scale/phase1 cluster infra#455
Ronkahn21 wants to merge 26 commits intoai-dynamo:mainfrom
Ronkahn21:scale/phase1-cluster-infra

Conversation

@Ronkahn21
Copy link
Contributor

@Ronkahn21 Ronkahn21 commented Feb 25, 2026

/kind feature

What type of PR is this?

Feature — Phase 1 of unified scale testing infrastructure

What this PR does / why we need it:

Introduces infra-manager, a unified CLI tool (operator/hack/infra-manager.py) that replaces the scattered collection of ad-hoc cluster provisioning scripts with a single, modular Python package for managing Grove's E2E and scale testing infrastructure.

Key changes:

  • CLI tool (infra-manager.py): Typer-based CLI with subcommands: create, delete, install, uninstall, setup for full cluster lifecycle management
  • Modular package (infra_manager/): Clean separation — cluster.py (EKS ops), components.py (addon installation), kwok.py (KWOK virtual node simulation), orchestrator.py (workflow coordination), config.py / constants.py (configuration)
  • KWOK integration: Simulate large-scale clusters with virtual nodes for scale testing without real GPU hardware
  • Observability stack: Automated Pyroscope installation for profiling during scale tests
  • Dependency management: Declarative dependencies.yaml with version pinning and auto-install of required tools (eksctl, kubectl, helm, kwok, etc.)
  • Legacy cleanup: Removed deprecated scripts (e2e-autoMNNVL/, e2e-cluster/, config-cluster.py) that are superseded by infra-manager
  • Integration tests: Shell-based CLI integration tests (hack/tests/test_cli_integration.sh)

Which issue(s) this PR fixes:

Part of #402

Special notes for your reviewer:

This is Phase 1 — infrastructure tooling only. Scale test scenarios (PodGangSet scaling, KWOK-based load tests, etc.) will come in subsequent PRs building on this foundation.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

See operator/hack/README.md for full usage documentation.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Use rollout status instead of pod wait to handle helm upgrade races
- Make KWOK delete remove controller and increase timeout for large node counts
- Tolerate partial failures in KWOK controller install (FlowSchema)
- Extract constants, use tenacity retries, frozenset display fields
- Remove redundant open mode, consistent list building
- Replace sys.exit() in functions with RuntimeError

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace argparse opt-in flags with Typer opt-out (--skip-*) model.
Split monolithic ClusterConfig into K3dConfig, ComponentConfig, and
KwokConfig. Add frozen ActionFlags dataclass and resolve_config()
as single merge point for CLI > env > default precedence. Update
Makefile targets to match new default-on semantics.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Extract 1,583-line monolith into 8 focused modules:
- constants.py, config.py, utils.py, cluster.py,
  components.py, kwok.py, orchestrator.py, __init__.py
Entry point remains a thin CLI wrapper (~160 lines).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the scale/phase1-cluster-infra branch from d98731a to 2d2ed0b Compare February 25, 2026 22:34
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Rename e2e-scoped scripts to generic infra names since they
manage general cluster infrastructure (k3d, Kai, Grove, KWOK,
Pyroscope) useful beyond just E2E testing.

- e2e-cluster-manager.py -> infra-manager.py
- create-e2e-cluster.py -> create-cluster.py
- e2e_manager/ -> infra_manager/
- Update all imports, Makefile, Go refs, READMEs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…etup

Introduce `GroveInstallOptions` for deployment tuning, replacing legacy `ActionFlags` in new APIs. Split orchestrator logic into modular workflows: `run_e2e_setup` and `run_scale_setup`. Add `test-cli-integration.sh` for unmocked CLI validation. Update Makefile and CLI aliases for full backward compatibility.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove obsolete scale testing scripts and monitoring configurations, including JSON converters, dashboards, and related deployment utilities. These scripts are no longer relevant to the updated workflow and infrastructure setup.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Delete create-cluster.py (legacy monolith) and the old infra-manager.py
shim, both fully superseded by the CLI + infra_manager package. Rename
cli.py to infra-manager.py and update all references. Remove dead code
(ActionFlags, validate_flags, resolve_config, display_config,
resolve_bool_flag, legacy orchestrator.run) that only the old shim used.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove the `test-cli-integration.sh` script and its `Makefile` target (`test-cli-integration`). This script is no longer required due to updates in the infrastructure setup workflow.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 marked this pull request as ready for review February 27, 2026 22:42
Move environment variables from config-cluster.py to a top-level
section and document all 18 E2E_* vars grouped by config class.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a unified local/CI cluster infrastructure management workflow for Grove E2E + scale testing by adding a new infra-manager.py Typer CLI backed by a modular infra_manager Python package. It also wires in scale-testing capabilities (KWOK + Pyroscope) and enables exposing pprof when profiling is turned on.

Changes:

  • Add operator/hack/infra-manager.py CLI and the operator/hack/infra_manager/ package (k3d lifecycle, component installs, KWOK, orchestration, dependency config).
  • Add KWOK + Pyroscope support for scale testing and expose pprof endpoints/ports when profiling is enabled.
  • Update docs/Makefile/CI wiring and remove the legacy create-e2e-cluster.py script.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
operator/internal/controller/manager_test.go Updates profiling test expectations for the new default pprof bind address.
operator/internal/controller/manager.go Changes default pprof bind address to listen on all interfaces when enabled.
operator/hack/requirements.txt Updates Python dependency versions and adds tenacity; switches typer to a range.
operator/hack/infra_manager/utils.py Adds shared utilities (kubectl runner, registry helpers, helm overrides, prereq checks).
operator/hack/infra_manager/pyroscope-values.yaml Adds default Pyroscope resource overrides for scale runs.
operator/hack/infra_manager/orchestrator.py Implements end-to-end and scale setup workflows (parallelized steps).
operator/hack/infra_manager/kwok.py Adds KWOK controller install and batch creation/deletion of simulated nodes.
operator/hack/infra_manager/dependencies.yaml Extends dependency catalog with KWOK and Pyroscope versions and images.
operator/hack/infra_manager/constants.py Centralizes paths, defaults, dependency loading, and shared constants.
operator/hack/infra_manager/config.py Adds pydantic settings models for k3d/components/KWOK and install options.
operator/hack/infra_manager/components.py Adds installers for Kai/Grove/Pyroscope and related readiness helpers.
operator/hack/infra_manager/commands/uninstall_cmd.py Adds uninstall subcommands for Kai and Grove.
operator/hack/infra_manager/commands/setup_cmd.py Adds composite setup e2e and setup scale workflows.
operator/hack/infra_manager/commands/install_cmd.py Adds install subcommands for Kai/Grove/Pyroscope.
operator/hack/infra_manager/commands/delete_cmd.py Adds delete subcommands for k3d cluster and KWOK nodes.
operator/hack/infra_manager/commands/create_cmd.py Adds create subcommands for k3d cluster and KWOK nodes.
operator/hack/infra_manager/commands/init.py Initializes the CLI commands package.
operator/hack/infra_manager/cluster.py Implements k3d cluster ops, parallel image prepull, and topology labeling.
operator/hack/infra_manager/init.py Adds shared console/logging utilities, including thread-local buffered output.
operator/hack/infra-manager.py Introduces the top-level Typer CLI wiring subcommands together.
operator/hack/e2e-cluster/create-e2e-cluster.py Removes the legacy script replaced by infra-manager.
operator/hack/e2e-autoMNNVL/run_autoMNNVL_e2e_all.py Updates docs/paths to reference infra-manager and the new config script location.
operator/hack/e2e-autoMNNVL/run_autoMNNVL_e2e.py Updates docs/paths to reference infra-manager and the new config script location.
operator/hack/e2e-autoMNNVL/README.md Updates documentation to point to infra-manager and config-cluster.py.
operator/hack/config-cluster.py Updates docs and fixes operator working-directory resolution.
operator/hack/README.md Updates hack tooling docs to describe infra-manager and new env var model.
operator/e2e/setup/shared_cluster.go Tracks KWOK nodes separately and reports real vs simulated node counts.
operator/e2e/setup/k8s_clusters.go Updates developer guidance to use infra-manager for local k3d setup.
operator/charts/templates/service.yaml Adds optional Service port for pprof when profiling is enabled.
operator/charts/templates/deployment.yaml Adds optional pprof scrape annotations and container ports when profiling is enabled.
operator/Makefile Switches E2E cluster targets to infra-manager; adds scale-cluster up/down targets.
.github/actions/e2e-setup/action.yaml Updates CI dependency install path to operator/hack/requirements.txt.
Comments suppressed due to low confidence (1)

operator/hack/requirements.txt:30

  • requirements.txt claims a minimum Python version of 3.8+, but the new infra_manager package uses PEP 604 union annotations like str | None throughout. Those annotations are not reliably supported on Python 3.8/3.9 when libraries (e.g. Pydantic) evaluate type hints, so this can break at runtime for users following the stated minimum version. Either bump the documented minimum Python version (and CI) to 3.10+ or refactor annotations to typing.Optional[...] / typing.Union[...] to preserve 3.8 compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Add PprofBindAddress field to DebuggingConfiguration with a secure
default of 127.0.0.1:2753 (loopback-only). Wire through defaults,
deepcopy, manager, and Helm templates. Derive pprof port from the
bind address in Helm via split instead of hardcoding 2753.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add helm to prerequisites when grove is installed. Rename
_apply_kai_queues to apply_kai_queues (public API). Simplify
prepull_images to delegate to prepull_image_groups. Wire
pprof bind address override in scale setup helm overrides.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add early-return for total<=0 in create_nodes. Call
_wait_kwok_nodes_ready at end of create_nodes. Guard empty
worker_nodes in apply_topology_labels. Exclude KWOK nodes
from topology labeling.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add trailing newline to pyroscope-values.yaml. Add __all__
exports to commands/__init__.py. Update minimum Python version
from 3.8+ to 3.10+ in requirements.txt.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove all subcommands (create, delete, install, uninstall) from `infra_manager/commands`, as they are no longer required. Update all impacted references and documentation accordingly.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Introduce a new `delete` subcommand under infra_manager for handling infrastructure resource deletion. Added `k3d-cluster` and `kwok-nodes` delete operations. Updated CLI, imports, and documentation accordingly. Removed unused `_node_sort_key` helper.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants