-
Notifications
You must be signed in to change notification settings - Fork 35
kubectl-grove v2: Comprehensive CLI for Grove Cluster Management #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Comprehensive design document for transforming Arborist from a basic diagnostics tool into a full-featured CLI for Grove operations. Key features planned: - P0: arborist status (match RBG), arborist generate (AIC integration) - P1: arborist topology (visualization), arborist health (gang monitoring) - P2: arborist compare (plan vs actual), arborist metrics (Prometheus) - P2+: arborist tui (interactive terminal UI) The strategy is to leapfrog RBG by building observability features they don't have, leveraging Grove's unique data (PlacementScore, ClusterTopology, TerminationDelay countdown). Co-Authored-By: Claude Opus 4.5 <[email protected]>
Major changes: - Rename operator/cmd/arborist/ → operator/cmd/kubectl-grove/ - Delete empty cli-plugin/ directory (was placeholder only) - Update all references from arborist to kubectl-grove - Update requirements doc with PM decisions: - CLI naming: kubectl grove (kubectl plugin) - P0 priority: Parallel (status + topology together) - Plan storage: ConfigMap with grove.io/aic-plan label - TUI priority: Phase 2 (higher than originally planned) - Metrics: Direct pod scraping (no Prometheus dependency) The kubectl-grove plugin will be Grove's answer to kubectl rbg, with differentiating features like topology visualization and PlacementScore display that RBG doesn't have. Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ate, plan) Add comprehensive kubectl plugin functionality for Grove: - status: Show PodCliqueSet status with clique and gang information - health: Gang health dashboard with threshold monitoring - topology: Visualize pod placement across topology domains - generate: Generate Grove manifests using AIConfigurator logic - plan: Store, show, and diff deployment plans Features: - Watch mode for real-time updates (topology, health) - ASCII visualization for topology tree - Plan storage in ConfigMaps for GitOps workflows - Comprehensive test coverage Closes ai-dynamo#329, ai-dynamo#330, ai-dynamo#331, ai-dynamo#332, ai-dynamo#333 Co-Authored-By: Claude Opus 4.5 <[email protected]>
New commands: - tui: Interactive terminal UI with Bubble Tea framework - 4 tab-switchable views: Hierarchy, Topology, Health, Help - Vim-style navigation (j/k, g/G, Enter to expand) - Real-time updates via K8s watch API - metrics: Direct pod metrics scraping - Auto-detect inference engine (SGLang, vLLM, TRT-LLM) - Prometheus format parsing - Watch mode with trend indicators - JSON output support - compare: Plan vs actual comparison - Compare configuration (replicas, GPUs, TP size) - Topology/placement score analysis - Auto-generate diagnosis and recommendations Closes ai-dynamo#334, ai-dynamo#335, ai-dynamo#336 Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add ANSI color support to topology visualization: - Green/Yellow/Red for pod status (Running/Pending/Failed) - Cyan [P] / Magenta [D] role badges for prefill/decode - Color-coded GPU utilization bars with Unicode blocks - Placement score coloring based on quality - Colored warnings with icons - Respects NO_COLOR env var and TTY detection - Add k9s plugin configuration (cmd/kubectl-grove/k9s/): - plugins.yaml: 14 shortcuts for Grove commands - aliases.yaml: CRD shortcuts (:pcs, :pc, :pg, :ct) - README.md: Installation and usage guide Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
@athreesh while this PR is draft can you please use an existing go module |
- Create new cli-plugin/ module as standalone kubectl-grove CLI - Replace Bubbletea TUI with tview-based Arborist TUI - Add topology visualization to TUI (press 't' to view) - Add hierarchical navigation: Forest -> PodCliqueSet -> PodGang -> PodClique -> Pod - Copy diagnostics package to cli-plugin (can't import internal from another module) - Fix test label selectors (grove.io/podcliqueset -> app.kubernetes.io/part-of) - Fix GPU bar tests to use Unicode characters New Arborist TUI features: - Split-pane UI with resources table and events panel - Drill-down navigation with Enter, back with Escape - Tab to switch between resources/events panes - Auto-refresh every 2 seconds - Color-coded status (green=Running, yellow=Pending, red=Failed) Co-Authored-By: Claude Opus 4.5 <[email protected]>
## AIConfigurator Integration (Rewritten) - Properly execute aiconfigurator CLI as subprocess - Parse generator_config.yaml output (not stdout) - Transform to PodCliqueSet manifests - Use JSON struct tags for sigs.k8s.io/yaml compatibility ## Bug Fixes - Fix metrics port-forward: Use actual SPDY port-forwarding instead of falling back to direct pod IP (which doesn't work outside cluster) - Fix silent failures in arborist_client.go: Log warnings instead of silently continuing on conversion errors - Fix hardcoded image version: Use AIConfigurator-provided image as fallback, with :latest as final default ## New Features - Namespace resolution from kubeconfig context - Shell completion support (bash, zsh, fish) - Topology watch mode (-w flag now works) ## Documentation - Add docs/user-guide/cli.md comprehensive CLI reference - Add docs/designs/cli-update-commands.md for rolling updates design Co-Authored-By: Claude Opus 4.5 <[email protected]>
This PR is in Draft mode for prototyping Grove CLI. ## Line Breakdown (~62K total) | Category | Lines | Notes | |-----------------------|--------|------------------------------------------| | Test files (*_test.go)| ~29K | ~47% of PR - comprehensive test coverage | | cli-plugin/ (total) | ~15K | New CLI module (includes ~7K tests) | | operator/ (non-test) | ~12K | Operator code changes | | docs/ | ~2.8K | User guide, design docs, API reference | | CRD YAML | ~1.7K | Generated CRD manifests | | go.mod/go.sum | ~1.5K | Dependency lockfiles (machine-generated) | Key insight: Almost half (~29K) is test code. Actual implementation is ~30K lines across operator and cli-plugin. ## Split Strategy (if needed for merge) Option to split into 2-3 smaller PRs: 1. **PR 1: cli-plugin migration** (~15K lines) - New standalone CLI module - Moved from operator/cmd/kubectl-grove/ - Includes Arborist TUI, commands, AIC integration 2. **PR 2: Operator changes** (~12K lines) - API changes, webhook validation - Topology constraints - ClusterTopology CRD 3. **PR 3: Tests & Docs** (~32K lines) - All *_test.go files - Documentation updates - Can be reviewed/merged last ## Review Tips - Skip generated files: CRDs, go.sum, zz_generated.* - 74 commits total - can review by commit - Tests validate behavior - review implementation first Co-Authored-By: Claude Opus 4.5 <[email protected]>
Design document proposing kubectl-grove, a kubectl plugin for managing Grove AI inference workloads. Key features: - Arborist TUI for hierarchical resource navigation - Topology visualization with GPU allocation and fragmentation warnings - Status, health, and diagnostics commands - Lifecycle management (rollout, scale, update, restart, apply) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
This PR implements the kubectl-grove CLI with comprehensive commands for managing Grove AI inference workloads on Kubernetes. It includes a complete migration of the CLI to a standalone module and replaces the Bubbletea TUI with the Arborist (tview-based) TUI.
What's Included
CLI Commands
kubectl grove statuskubectl grove topologykubectl grove healthkubectl grove metricskubectl grove tuikubectl grove diagnosticsKey Features
operator/cmd/kubectl-grove/tocli-plugin/topologyandhealthcommandsPR Size Breakdown (~62K lines)
*_test.go)Key insight: Almost half (~29K) is test code. Actual implementation is ~30K lines.
Split Strategy (if needed for merge)
This PR can be split into 2-3 smaller PRs:
PR 1: cli-plugin migration (~15K lines)
cli-plugin/PR 2: Operator changes (~12K lines)
PR 3: Tests & Docs (~32K lines)
*_test.gofilesReview Tips
zz_generated.*Example Usage
Test plan
cd cli-plugin && go test ./...)Related Issues
Closes #329 (kubectl grove status)
Closes #330 (kubectl grove topology)
Closes #331 (kubectl grove generate)
Closes #332 (kubectl grove plan)
Closes #333 (kubectl grove health)
🤖 Generated with Claude Code