Skip to content

Conversation

@athreesh
Copy link

@athreesh athreesh commented Jan 16, 2026

Summary

This PR implements the kubectl-grove CLI with comprehensive commands for managing Grove AI inference workloads on Kubernetes. It includes a complete migration of the CLI to a standalone module and replaces the Bubbletea TUI with the Arborist (tview-based) TUI.

Note: This PR is in Draft mode for prototyping the Grove CLI. See split strategy below if needed for merge.

What's Included

CLI Commands

Command Description
kubectl grove status Show PodCliqueSet status with clique/gang information
kubectl grove topology Visualize pod placement across topology domains (with watch mode)
kubectl grove health Gang health dashboard with threshold monitoring
kubectl grove metrics Scrape and display inference metrics from pods
kubectl grove tui Interactive Arborist TUI for hierarchical resource browsing
kubectl grove diagnostics Collect diagnostics for troubleshooting

Key Features

  • Standalone cli-plugin module: Migrated from operator/cmd/kubectl-grove/ to cli-plugin/
  • Arborist TUI: tview-based hierarchical resource viewer replacing Bubbletea
  • Watch Mode: Real-time updates for topology and health commands
  • Shell Completion: bash, zsh, and fish support
  • Namespace Resolution: Uses kubeconfig current namespace as default

PR Size Breakdown (~62K lines)

Category Lines % of PR Notes
Test files (*_test.go) ~29K 47% Comprehensive test coverage
cli-plugin/ (total) ~15K 24% New CLI module (includes ~7K tests)
operator/ (non-test) ~12K 19% Operator code changes
docs/ ~2.8K 5% User guide, design docs, API reference
CRD YAML ~1.7K 3% Generated CRD manifests
go.mod/go.sum ~1.5K 2% Dependency lockfiles

Key insight: Almost half (~29K) is test code. Actual implementation is ~30K lines.


Split Strategy (if needed for merge)

This PR can be split into 2-3 smaller PRs:

PR 1: cli-plugin migration (~15K lines)

  • New standalone CLI module at cli-plugin/
  • All commands: status, topology, health, metrics, generate, plan, compare
  • Arborist TUI
  • AIConfigurator integration

PR 2: Operator changes (~12K lines)

  • API changes and webhook validation
  • Topology constraints
  • ClusterTopology CRD

PR 3: Tests & Docs (~32K lines)

  • All *_test.go files
  • Documentation updates (user guide, design docs)
  • Can be reviewed/merged last

Review Tips

  • Skip generated files: CRDs, go.sum, zz_generated.*
  • 74 commits total: Can review by commit for context
  • Tests validate behavior: Review implementation first, tests last

Example Usage

# Show status for a PodCliqueSet
kubectl grove status my-inference -n production

# Watch topology changes in real-time
kubectl grove topology my-inference --watch

# Generate manifests for a disaggregated deployment
kubectl grove generate \
  --model QWEN3_32B \
  --system h200_sxm \
  --total-gpus 8 \
  --backend sglang \
  --isl 2048 --osl 512 \
  --ttft 200 --tpot 50 \
  --save-dir ./manifests

# Launch interactive TUI
kubectl grove tui -n default

# Enable shell completion
source <(kubectl grove completion bash)

Test plan

  • Unit tests for all commands pass (cd cli-plugin && go test ./...)
  • Integration tested on Kind cluster with sample PodCliqueSets
  • Verified all commands work with real Grove CRDs
  • Watch mode tested for topology and health commands
  • Arborist TUI navigation tested

Related Issues

Closes #329 (kubectl grove status)
Closes #330 (kubectl grove topology)
Closes #331 (kubectl grove generate)
Closes #332 (kubectl grove plan)
Closes #333 (kubectl grove health)

🤖 Generated with Claude Code

gflarity and others added 14 commits January 14, 2026 13:33
Comprehensive design document for transforming Arborist from a basic
diagnostics tool into a full-featured CLI for Grove operations.

Key features planned:
- P0: arborist status (match RBG), arborist generate (AIC integration)
- P1: arborist topology (visualization), arborist health (gang monitoring)
- P2: arborist compare (plan vs actual), arborist metrics (Prometheus)
- P2+: arborist tui (interactive terminal UI)

The strategy is to leapfrog RBG by building observability features they
don't have, leveraging Grove's unique data (PlacementScore, ClusterTopology,
TerminationDelay countdown).

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Major changes:
- Rename operator/cmd/arborist/ → operator/cmd/kubectl-grove/
- Delete empty cli-plugin/ directory (was placeholder only)
- Update all references from arborist to kubectl-grove
- Update requirements doc with PM decisions:
  - CLI naming: kubectl grove (kubectl plugin)
  - P0 priority: Parallel (status + topology together)
  - Plan storage: ConfigMap with grove.io/aic-plan label
  - TUI priority: Phase 2 (higher than originally planned)
  - Metrics: Direct pod scraping (no Prometheus dependency)

The kubectl-grove plugin will be Grove's answer to kubectl rbg,
with differentiating features like topology visualization and
PlacementScore display that RBG doesn't have.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ate, plan)

Add comprehensive kubectl plugin functionality for Grove:

- status: Show PodCliqueSet status with clique and gang information
- health: Gang health dashboard with threshold monitoring
- topology: Visualize pod placement across topology domains
- generate: Generate Grove manifests using AIConfigurator logic
- plan: Store, show, and diff deployment plans

Features:
- Watch mode for real-time updates (topology, health)
- ASCII visualization for topology tree
- Plan storage in ConfigMaps for GitOps workflows
- Comprehensive test coverage

Closes ai-dynamo#329, ai-dynamo#330, ai-dynamo#331, ai-dynamo#332, ai-dynamo#333

Co-Authored-By: Claude Opus 4.5 <[email protected]>
New commands:
- tui: Interactive terminal UI with Bubble Tea framework
  - 4 tab-switchable views: Hierarchy, Topology, Health, Help
  - Vim-style navigation (j/k, g/G, Enter to expand)
  - Real-time updates via K8s watch API

- metrics: Direct pod metrics scraping
  - Auto-detect inference engine (SGLang, vLLM, TRT-LLM)
  - Prometheus format parsing
  - Watch mode with trend indicators
  - JSON output support

- compare: Plan vs actual comparison
  - Compare configuration (replicas, GPUs, TP size)
  - Topology/placement score analysis
  - Auto-generate diagnosis and recommendations

Closes ai-dynamo#334, ai-dynamo#335, ai-dynamo#336

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add ANSI color support to topology visualization:
  - Green/Yellow/Red for pod status (Running/Pending/Failed)
  - Cyan [P] / Magenta [D] role badges for prefill/decode
  - Color-coded GPU utilization bars with Unicode blocks
  - Placement score coloring based on quality
  - Colored warnings with icons
  - Respects NO_COLOR env var and TTY detection

- Add k9s plugin configuration (cmd/kubectl-grove/k9s/):
  - plugins.yaml: 14 shortcuts for Grove commands
  - aliases.yaml: CRD shortcuts (:pcs, :pc, :pg, :ct)
  - README.md: Installation and usage guide

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@unmarshall
Copy link
Collaborator

unmarshall commented Jan 17, 2026

@athreesh while this PR is draft can you please use an existing go module cli-plugin to create any kubectl plugin and not include this in the operator go module. I would personally wish to keep operator module separate with a clear semantic purpose. We do not follow a mono-repo concept but use a more modern way for golang based applications by leveraging go modules to allow clear separation of concerns and dependencies.

athreesh and others added 4 commits January 17, 2026 16:50
- Create new cli-plugin/ module as standalone kubectl-grove CLI
- Replace Bubbletea TUI with tview-based Arborist TUI
- Add topology visualization to TUI (press 't' to view)
- Add hierarchical navigation: Forest -> PodCliqueSet -> PodGang -> PodClique -> Pod
- Copy diagnostics package to cli-plugin (can't import internal from another module)
- Fix test label selectors (grove.io/podcliqueset -> app.kubernetes.io/part-of)
- Fix GPU bar tests to use Unicode characters

New Arborist TUI features:
- Split-pane UI with resources table and events panel
- Drill-down navigation with Enter, back with Escape
- Tab to switch between resources/events panes
- Auto-refresh every 2 seconds
- Color-coded status (green=Running, yellow=Pending, red=Failed)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
## AIConfigurator Integration (Rewritten)
- Properly execute aiconfigurator CLI as subprocess
- Parse generator_config.yaml output (not stdout)
- Transform to PodCliqueSet manifests
- Use JSON struct tags for sigs.k8s.io/yaml compatibility

## Bug Fixes
- Fix metrics port-forward: Use actual SPDY port-forwarding instead of
  falling back to direct pod IP (which doesn't work outside cluster)
- Fix silent failures in arborist_client.go: Log warnings instead of
  silently continuing on conversion errors
- Fix hardcoded image version: Use AIConfigurator-provided image as
  fallback, with :latest as final default

## New Features
- Namespace resolution from kubeconfig context
- Shell completion support (bash, zsh, fish)
- Topology watch mode (-w flag now works)

## Documentation
- Add docs/user-guide/cli.md comprehensive CLI reference
- Add docs/designs/cli-update-commands.md for rolling updates design

Co-Authored-By: Claude Opus 4.5 <[email protected]>
This PR is in Draft mode for prototyping Grove CLI.

## Line Breakdown (~62K total)

| Category              | Lines  | Notes                                    |
|-----------------------|--------|------------------------------------------|
| Test files (*_test.go)| ~29K   | ~47% of PR - comprehensive test coverage |
| cli-plugin/ (total)   | ~15K   | New CLI module (includes ~7K tests)      |
| operator/ (non-test)  | ~12K   | Operator code changes                    |
| docs/                 | ~2.8K  | User guide, design docs, API reference   |
| CRD YAML              | ~1.7K  | Generated CRD manifests                  |
| go.mod/go.sum         | ~1.5K  | Dependency lockfiles (machine-generated) |

Key insight: Almost half (~29K) is test code. Actual implementation
is ~30K lines across operator and cli-plugin.

## Split Strategy (if needed for merge)

Option to split into 2-3 smaller PRs:

1. **PR 1: cli-plugin migration** (~15K lines)
   - New standalone CLI module
   - Moved from operator/cmd/kubectl-grove/
   - Includes Arborist TUI, commands, AIC integration

2. **PR 2: Operator changes** (~12K lines)
   - API changes, webhook validation
   - Topology constraints
   - ClusterTopology CRD

3. **PR 3: Tests & Docs** (~32K lines)
   - All *_test.go files
   - Documentation updates
   - Can be reviewed/merged last

## Review Tips

- Skip generated files: CRDs, go.sum, zz_generated.*
- 74 commits total - can review by commit
- Tests validate behavior - review implementation first

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Design document proposing kubectl-grove, a kubectl plugin for managing
Grove AI inference workloads. Key features:

- Arborist TUI for hierarchical resource navigation
- Topology visualization with GPU allocation and fragmentation warnings
- Status, health, and diagnostics commands
- Lifecycle management (rollout, scale, update, restart, apply)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants