Skip to content

Conversation

@kaim-eng
Copy link

@kaim-eng kaim-eng commented Jan 8, 2026

Overview:

This PR implements power-aware autoscaling with GPU power enforcement. The planner monitors workload metrics via Prometheus, calculates required replicas within power budget constraints, and annotates pods with power limits. A Power Agent DaemonSet enforces limits at the hardware level via NVML.

Details:

Where should the reviewer start?

examples/deployments/powerplanner/README.md

to run full e2e test, assuming we have minikute executables under ${DEV_REPO}/bin_bin

export HF_TOKEN=YOUR_TOKEN
export DEV_REPO=YOUR_REPO_DIR
export MINIKUBE_HOME=${DEV_REPO}/minikube_home
cd ${DEV_REPO}/examples/deployments/powerplanner
bash full_clean_test.bash

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • New Features

    • Added GPU power-aware autoscaling with configurable power budgets
    • Added per-pod GPU power limit enforcement
    • Added GPU power usage monitoring via Prometheus integration
  • Documentation

    • Added comprehensive deployment and configuration guides
    • Added monitoring and verification utilities for power-aware deployments
  • Chores

    • Added Kubernetes deployment manifests and automated setup scripts

✏️ Tip: You can customize this high-level summary in your review settings.

@kaim-eng kaim-eng requested review from a team as code owners January 8, 2026 17:39
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

Adds a comprehensive power-aware autoscaling feature for GPU clusters, introducing a new Power Agent daemon that enforces GPU power limits via NVML, extends the SLA Planner with power budget enforcement logic, integrates Prometheus power queries, and provides complete deployment automation and verification scripts for Kubernetes environments.

Changes

Cohort / File(s) Summary
Power Agent Daemon Component
components/power_agent/Dockerfile, components/power_agent/power_agent.py, components/power_agent/requirements.txt, components/power_agent/README.md
New NodePowerAgent daemon implementation (249 LOC) that monitors pod annotations, maps GPU processes to pod UIDs via /proc/cgroup, and applies NVML power limits. Docker container setup using NVIDIA CUDA base image with Python3 and dependencies (kubernetes, nvidia-ml-py). Comprehensive documentation covering architecture, deployment, and operation flow.
Planner Power-Awareness Configuration
components/src/dynamo/planner/defaults.py, components/src/dynamo/planner/utils/planner_argparse.py
Four new power-related parameters added to SLAPlannerDefaults: enable_power_awareness flag, total_gpu_power_limit, prefill_engine_gpu_power_limit, decode_engine_gpu_power_limit. Corresponding CLI arguments registered in argparse for user configuration.
Planner Core Logic & Power Enforcement
components/src/dynamo/planner/utils/planner_core.py, components/src/dynamo/planner/utils/prometheus.py
Power budget enforcement during replica planning (clamping replicas to fit power limits), pod annotation patching for per-pod power limits, and Prometheus methods for cluster-wide and per-component power telemetry. 167 new lines in planner_core, 56 in prometheus utilities.
Kubernetes API Integration
components/src/dynamo/planner/kube.py, components/src/dynamo/planner/kubernetes_connector.py
CoreV1Api client initialization in KubernetesAPI; new get_component_pods method in KubernetesConnector to retrieve pods by component label for applying power limits.
Power Agent Deployment Infrastructure
deploy/power_agent/README.md, deploy/power_agent/daemonset.yaml
DaemonSet manifest (119 LOC) with privileged container, hostPID access, NVML driver mounting, and RBAC configuration; operational deployment guide (217 LOC) covering prerequisites, configuration, security, and troubleshooting.
Power-Aware Deployment Manifests & Examples
examples/deployments/powerplanner/agg.yaml, examples/deployments/powerplanner/disagg.yaml, examples/deployments/powerplanner/dynamo-worker-podmonitor.yaml, examples/deployments/powerplanner/planner-clusterrole-patch.yaml, examples/deployments/powerplanner/profile_sla_aic_dgdr.yaml, examples/deployments/powerplanner/prometheus-values.yaml
Example DynamoGraph deployments (aggregated and disaggregated topologies), Prometheus PodMonitor for worker metrics, RBAC ClusterRole patch granting pod patching permissions, profiling configuration, and Prometheus scrape settings.
Power-Aware Deployment Automation Scripts
examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash, examples/deployments/powerplanner/deploy_poweraware.bash, examples/deployments/powerplanner/monitor_poweraware.bash, examples/deployments/powerplanner/verify_poweraware.bash, examples/deployments/powerplanner/full_clean_test.bash
End-to-end deployment automation and testing: base infrastructure setup with Minikube provisioning (186 LOC), power-aware feature deployment with image build/load (375 LOC), real-time monitoring dashboard with status and log streaming (178 LOC), comprehensive verification suite with 13 validation tests (312 LOC), and orchestration script for full clean deployment workflow (119 LOC).
Documentation
examples/deployments/powerplanner/README.md
Comprehensive guide (895 LOC) covering power-aware autoscaling architecture, two-phase deployment, algorithm details, component modifications, production readiness criteria, and monitoring/troubleshooting procedures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes introduce substantial new logic across heterogeneous domains (daemon service with NVML, planner power enforcement, Kubernetes API integration, deployment automation), requiring careful review of daemon correctness, power calculation accuracy, pod annotation patching safety, and deployment manifest configurations. Multiple integration points and the variety of Python, YAML, and Bash code add review complexity.

Poem

🐰 Hops with joy at power's might,
GPU limits set just right!
Daemon dances through the nodes,
Annotating all the loads,
Watts and wisely routed high—
Let autoscaling touch the sky!

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (1 warning, 2 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat: power planner init check-in' is vague and generic, using non-descriptive terms that don't convey meaningful information about the specific changes made. Use a more specific title that clearly describes the main change, such as 'feat: add power-aware autoscaling with GPU power enforcement' or 'feat: implement power-aware replica planning with NVML enforcement'.
Description check ❓ Inconclusive The PR description lacks critical details about changes, has incomplete sections, and references a placeholder issue number. Complete the 'Details' section describing the full scope of changes across power planner, agent, and planner modifications. Replace '#xxx' with the actual GitHub issue number. Clarify integration points and breaking changes if any.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/planner/utils/planner_argparse.py (1)

1-162: Fix Black formatting issue before merge.

The pipeline indicates Black formatting needs to be applied to this file. Please run pre-commit run --all-files or black components/src/dynamo/planner/utils/planner_argparse.py to resolve the formatting issue.

🤖 Fix all issues with AI agents
In @components/power_agent/power_agent.py:
- Line 78: Replace the plain logger.error calls with logger.exception so
tracebacks are included: update the NVML initialization log in initialize_nvml
(currently logger.error(f"Failed to initialize NVML: {e}")), the pod listing
error in the function that lists pods (logger.error(f"Failed to list pods:
{e}")), both GPU error handlers in the GPU monitoring routine
(logger.error(f"NVML error on GPU {gpu_idx}: {e}") and logger.error(f"Unexpected
error on GPU {gpu_idx}: {e}")), and the reconciliation loop error
(logger.error(f"Error in reconciliation loop: {e}")) to use logger.exception
while keeping the existing messages for context.
- Around line 197-204: The loop that sets target_limit/target_pod_uid from
pid_pod_map assumes exclusive GPU assignment (single pod per GPU) but currently
breaks after the first match; update the logic in the block that iterates
pid_pod_map (symbols: pid_pod_map, desired_state, target_limit, target_pod_uid)
to detect when multiple matching pods exist and handle that case: either log a
warning identifying the GPU/pod UIDs when more than one pod maps to the same
GPU, or compute and use the most restrictive limit across all matching pods
(e.g., choose the minimum power limit) and log the decision; also add a brief
comment documenting the single-pod-per-GPU assumption and include
validation/logging so multi-tenant sharing is visible at runtime.

In @components/power_agent/requirements.txt:
- Around line 1-2: Add an explicit urllib3 requirement to
components/power_agent/requirements.txt to ensure the dependency used by
kubernetes (kubernetes==28.1.0) is at least the patched version; update the file
by appending a line for urllib3>=2.6.3 (keep existing kubernetes==28.1.0 and
nvidia-ml-py==12.535.133 lines) so the installer will pull a safe urllib3
release that addresses the listed CVEs.

In @examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash:
- Line 78: Before invoking the heavy minikube start command, add a pre-check
that calculates available system memory (e.g., set TOTAL_MEM_GB using free -g
and awk) and compare it to a REQUIRED_MEM_GB (500); if available memory is less
than REQUIRED_MEM_GB, print a clear error and exit instead of running minikube
start --driver docker --mount --mount-string="/proc:/host/proc"
--container-runtime docker --gpus all --memory=500gb --cpus=32, so the script
validates resources and avoids attempting the allocation on under-provisioned
hosts.
- Around line 11-12: The script sets incorrect path variable values due to
typos: change PATH export from using "${DEV_REPO}/bin_bin" to "${DEV_REPO}/bin"
and correct MINIKUBE_HOME from "${DEV_REPO}/minibute_home" to
"${DEV_REPO}/minikube_home" so the variables PATH and MINIKUBE_HOME point to the
intended directories; update the lines that reference these exact symbols
(export PATH=... and export MINIKUBE_HOME=...) accordingly and ensure no other
occurrences of the misspellings remain.
- Around line 112-116: The script performs directory changes with "cd
${DEV_REPO}/deploy/helm/charts" and later "cd -" without checking results;
update these to fail fast on error by testing the cd exit status (or use
pushd/popd and check their return values) and aborting with a clear error
message if the directory change fails so that subsequent commands like
helm/uninstall/install and kubectl run only when in the expected directory.
Ensure any added checks include descriptive context (which directory change
failed) and mirror the second "cd -" reversal logic so the script always returns
to the original directory or exits on failure.

In @examples/deployments/powerplanner/full_clean_test.bash:
- Around line 43-49: The exit check is using $? after a pipeline so it reads the
status of tail, not the bash command that ran deploy_poweraware_baseinfra.bash;
change the logic that runs "bash
${DEV_REPO}/examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash 0
2>&1 | tail -10" so you capture the deploy script's exit code (e.g., read
PIPESTATUS[0] right after the pipeline or assign the bash command to a variable
and check that) and then use that captured value instead of $? when deciding in
the if/else that prints "Phase 1: Cleanup complete/failed".

In @examples/deployments/powerplanner/verify_poweraware.bash:
- Around line 15-16: Replace the systematic typos in the powerplanner deployment
scripts by changing the PATH entry value from "${DEV_REPO}/bin_bin" to
"${DEV_REPO}/bin" and the MINIKUBE_HOME export from "${DEV_REPO}/minibute_home"
to "${DEV_REPO}/minikube_home" across all affected files (references appear in
verify_poweraware.bash, deploy_poweraware.bash, full_clean_test.bash,
deploy_poweraware_baseinfra.bash and README.md); update the export lines that
set PATH and MINIKUBE_HOME to use the corrected directory names and ensure any
README examples reflect the same corrected paths.
🟡 Minor comments (12)
examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash-1-1 (1)

1-1: Re-run pre-commit to fix formatting.

The pipeline detected trailing whitespace. Please run pre-commit hooks to clean up formatting issues.

components/power_agent/power_agent.py-1-1 (1)

1-1: Re-run pre-commit to fix formatting issues.

The pipeline detected multiple formatting issues (isort, Black, ruff). Please run pre-commit hooks to resolve these automatically.

components/power_agent/requirements.txt-1-1 (1)

1-1: Add required SPDX copyright header.

The file is missing the required SPDX header as detected by the copyright checker.

📝 Add this header
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 kubernetes==28.1.0
 nvidia-ml-py==12.535.133
examples/deployments/powerplanner/full_clean_test.bash-19-19 (1)

19-19: Typo in environment variable.

minibute_home should be minikube_home.

Proposed fix
-export MINIKUBE_HOME=${DEV_REPO}/minibute_home
+export MINIKUBE_HOME=${DEV_REPO}/minikube_home
examples/deployments/powerplanner/prometheus-values.yaml-1-6 (1)

1-6: Add missing SPDX license header.

The pipeline is failing because this file is missing the required SPDX header. Add the license header at the top of the file.

Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 prometheus:
   prometheusSpec:
     # Setting an empty object tells Prometheus to select monitors from all namespaces
     podMonitorSelectorNilUsesHelmValues: false
     podMonitorNamespaceSelector: {}
     probeNamespaceSelector: {}
examples/deployments/powerplanner/README.md-1-1 (1)

1-1: Fix trailing whitespace before merge.

The pipeline indicates trailing whitespace needs to be cleaned. Please run pre-commit run --all-files to automatically fix this issue.

examples/deployments/powerplanner/README.md-482-482 (1)

482-482: Remove or update the link to BARE_METAL_DEPLOYMENT.md.

The file examples/deployments/powerplanner/BARE_METAL_DEPLOYMENT.md does not exist in the repository. Either remove the reference from the link on line 482, or create the file if it should be included in this PR.

examples/deployments/powerplanner/deploy_poweraware.bash-1-10 (1)

1-10: Fix trailing whitespace.

The pipeline indicates trailing whitespace issues. Re-run pre-commit hooks to resolve.

components/src/dynamo/planner/utils/prometheus.py-212-226 (1)

212-226: Apply same exception handling improvements.

Similar to get_power_by_component, consider using logging.exception instead of logging.error on line 224 to include traceback information, and address the Black formatting issues flagged by the pipeline.

examples/deployments/powerplanner/verify_poweraware.bash-1-10 (1)

1-10: Fix trailing whitespace.

The pipeline indicates trailing whitespace issues in this file. Re-run pre-commit hooks to clean up.

components/src/dynamo/planner/defaults.py-73-73 (1)

73-73: Fix formatting issue.

The pipeline indicates Black formatting needs correction on this file. Re-run pre-commit hooks to resolve.

components/src/dynamo/planner/utils/prometheus.py-172-211 (1)

172-211: Fix formatting and consider improving exception handling.

The pipeline indicates Black formatting issues. Additionally, static analysis suggests improving the exception handling pattern:

  1. The broad Exception catch (line 208) could mask unexpected errors
  2. Consider using logging.exception (line 209) to include traceback
  3. The return statement on line 206 could be in an else block per TRY300

However, the query logic and per-entry error handling look solid for production use.

♻️ Improved exception handling
             logger.debug(f"Power consumption by component: {power_map}")
             return power_map
-        
         except Exception as e:
-            logger.error(f"Failed to query power by component: {e}")
+            logger.exception(f"Failed to query power by component: {e}")
             return {}

Note: While the broad Exception catch is flagged by ruff (BLE001), it's consistent with the existing pattern in this file and provides fail-safe behavior for production monitoring.

🧹 Nitpick comments (12)
components/power_agent/power_agent.py (1)

73-79: Add NVML cleanup on shutdown.

NVML is initialized but never shut down. While this may not cause issues in a long-running daemon, it's best practice to add cleanup handling.

🧹 Suggested cleanup handler
         try:
             pynvml.nvmlInit()
             self.device_count = pynvml.nvmlDeviceGetCount()
             logger.info(f"Initialized NVML. Found {self.device_count} GPUs on node {self.node_name}.")
         except pynvml.NVMLError as e:
-            logger.error(f"Failed to initialize NVML: {e}")
+            logger.exception(f"Failed to initialize NVML: {e}")
             raise
+    
+    def __del__(self):
+        """Cleanup NVML on shutdown."""
+        try:
+            pynvml.nvmlShutdown()
+        except:
+            pass

Alternatively, add signal handlers for graceful shutdown:

import signal

def shutdown_handler(signum, frame):
    logger.info("Shutting down Power Agent...")
    pynvml.nvmlShutdown()
    sys.exit(0)

# In __init__ or run():
signal.signal(signal.SIGTERM, shutdown_handler)
signal.signal(signal.SIGINT, shutdown_handler)
examples/deployments/powerplanner/full_clean_test.bash (1)

87-87: Add comment explaining the 5-minute wait.

A hardcoded 300-second sleep without explanation makes the script harder to maintain. Consider adding a comment explaining why this wait is necessary (e.g., waiting for pods to stabilize, metrics to populate).

Proposed fix
+# Wait for power-aware components to stabilize and metrics to populate
 sleep 300
components/power_agent/README.md (1)

83-84: Convert bare URLs to markdown links.

Per markdown best practices, bare URLs should be wrapped in markdown link syntax for better readability and accessibility.

Proposed fix
-- Kubernetes DaemonSet Best Practices: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
-- NVML Documentation: https://docs.nvidia.com/deploy/nvml-api/
+- [Kubernetes DaemonSet Best Practices](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/)
+- [NVML Documentation](https://docs.nvidia.com/deploy/nvml-api/)
deploy/power_agent/README.md (1)

18-22: Clarify registry placeholder in push command.

The docker push command uses dynamo/power-agent:v1.0.0 which is a placeholder. Consider adding a note that users should replace this with their container registry path.

Proposed fix
 ```bash
 cd components/power_agent
 docker build -t dynamo/power-agent:v1.0.0 .
-docker push dynamo/power-agent:v1.0.0  # Push to your registry
+docker push <your-registry>/power-agent:v1.0.0  # Replace with your registry
</details>

</blockquote></details>
<details>
<summary>examples/deployments/powerplanner/README.md (2)</summary><blockquote>

`15-15`: **Consider formatting bare URLs as markdown links.**

Static analysis flagged bare URLs on lines 15 and 77. While this is a minor style issue, consider formatting them as proper markdown links for better presentation:

```markdown
[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

Also applies to: 77-77


27-27: Consider adding language identifiers to code blocks.

Multiple code blocks are missing language identifiers (lines 27, 157, 183, 328, 344, 488, 507, 522). Adding language identifiers improves syntax highlighting and readability. For example:

-```
+```text
 ╔══════════════════════════════════════════════════════════════╗
 ║              ALL 17 TESTS PASSED ✓                           ║

Also applies to: 157-157, 183-183, 328-328, 344-344, 488-488, 507-507, 522-522

components/src/dynamo/planner/defaults.py (1)

74-78: Consider validating the total GPU power limit at runtime.

While the defaults provide backwards compatibility, the comment on line 76 states "must be configured per datacenter!" but the default of 2000W might be silently used if operators forget to override it. Consider adding runtime validation or warnings when power awareness is enabled with the default total_gpu_power_limit.

💡 Suggested approach

In the planner initialization code, add validation:

if self.enable_power_awareness and self.total_gpu_power_limit == 2000:
    logger.warning(
        "Using default total_gpu_power_limit of 2000W. "
        "This should be configured per datacenter for accurate power budgeting."
    )
examples/deployments/powerplanner/verify_poweraware.bash (2)

244-289: Verify nvidia-smi availability assumption.

Test 13 assumes nvidia-smi is available on the host machine running the script. This will fail in environments without direct GPU access (e.g., CI/CD, remote clusters, developer machines without GPUs). Consider:

  1. Adding a prerequisite check for nvidia-smi availability
  2. Marking this test as optional/skippable in GPU-less environments
  3. Documenting this requirement in the script header or README
🔧 Suggested prerequisite check
# At the beginning of Test 13
if ! command -v nvidia-smi &> /dev/null; then
    warn "nvidia-smi not found - skipping hardware power limit verification"
    # Continue to next test
fi

200-204: Consider more targeted port-forward cleanup.

Line 200 uses pkill -9 which sends SIGKILL to all matching processes. This could affect other users' port-forwards on shared systems. Consider:

  1. Using pkill -15 (SIGTERM) first for graceful shutdown
  2. Narrowing the pattern to include the namespace or PID file
  3. Only killing processes owned by the current user
♻️ Gentler cleanup approach
-pkill -9 -f "port-forward.*8000" 2>/dev/null || true
+# Try graceful shutdown first
+pkill -15 -f "port-forward.*8000.*${NAMESPACE}" 2>/dev/null || true
+sleep 1
+# Force kill if still running
+pkill -9 -f "port-forward.*8000.*${NAMESPACE}" 2>/dev/null || true
examples/deployments/powerplanner/monitor_poweraware.bash (1)

64-73: Remove unused STATUS variable.

The STATUS variable is extracted on Line 67 but never used. This creates unnecessary overhead and clutters the code.

♻️ Proposed cleanup
     kubectl get pods -n ${NAMESPACE} 2>/dev/null | grep -E "(prefill|decode)" | grep Running | while read line; do
         POD_NAME=$(echo $line | awk '{print $1}')
         ANNOTATION=$(kubectl get pod $POD_NAME -n ${NAMESPACE} -o jsonpath='{.metadata.annotations.dynamo\.nvidia\.com/gpu-power-limit}' 2>/dev/null || echo "not set")
-        STATUS=$(echo $line | awk '{print $3}')
         if [ "$ANNOTATION" != "not set" ]; then
             echo -e "${GREEN}✓${NC} $POD_NAME: ${ANNOTATION}W"
         else
             echo -e "${YELLOW}○${NC} $POD_NAME: $ANNOTATION"
         fi
     done
components/src/dynamo/planner/utils/planner_core.py (2)

591-617: Consider more specific exception handling for better diagnostics.

The bare Exception catch on Line 615 is intentional for resilience, but catching more specific exceptions would provide better error messages and allow truly unexpected errors to propagate appropriately.

♻️ More specific exception handling
     try:
         patch = {
             "metadata": {
                 "annotations": {
                     "dynamo.nvidia.com/gpu-power-limit": str(power_limit)
                 }
             }
         }
         # Uses standard K8s API - no exec!
         self.connector.kube_api.core_api.patch_namespaced_pod(
             name=pod_name,
             namespace=namespace,
             body=patch
         )
         logger.debug(f"Set power limit annotation on {namespace}/{pod_name}: {power_limit}W")
-    except Exception as e:
-        logger.warning(f"Failed to patch pod {namespace}/{pod_name}: {e}")
+    except (ApiException, AttributeError, ValueError) as e:
+        logger.warning(f"Failed to patch pod {namespace}/{pod_name}: {e}")

You'll need to add the import:

from kubernetes.client.exceptions import ApiException

661-662: Use logging.exception for better error diagnostics.

Line 662 uses logging.error which doesn't include the stack trace. Using logging.exception would provide better diagnostic information when power limit application fails.

🔍 Improved error logging
     except Exception as e:
-        logger.error(f"Failed to apply power limits: {e}")
+        logger.exception(f"Failed to apply power limits: {e}")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6458ef8 and 1637c3e.

📒 Files selected for processing (24)
  • components/power_agent/Dockerfile
  • components/power_agent/README.md
  • components/power_agent/power_agent.py
  • components/power_agent/requirements.txt
  • components/src/dynamo/planner/defaults.py
  • components/src/dynamo/planner/kube.py
  • components/src/dynamo/planner/kubernetes_connector.py
  • components/src/dynamo/planner/utils/planner_argparse.py
  • components/src/dynamo/planner/utils/planner_core.py
  • components/src/dynamo/planner/utils/prometheus.py
  • deploy/power_agent/README.md
  • deploy/power_agent/daemonset.yaml
  • examples/deployments/powerplanner/README.md
  • examples/deployments/powerplanner/agg.yaml
  • examples/deployments/powerplanner/deploy_poweraware.bash
  • examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash
  • examples/deployments/powerplanner/disagg.yaml
  • examples/deployments/powerplanner/dynamo-worker-podmonitor.yaml
  • examples/deployments/powerplanner/full_clean_test.bash
  • examples/deployments/powerplanner/monitor_poweraware.bash
  • examples/deployments/powerplanner/planner-clusterrole-patch.yaml
  • examples/deployments/powerplanner/profile_sla_aic_dgdr.yaml
  • examples/deployments/powerplanner/prometheus-values.yaml
  • examples/deployments/powerplanner/verify_poweraware.bash
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • examples/deployments/powerplanner/disagg.yaml
  • deploy/power_agent/daemonset.yaml
  • examples/deployments/powerplanner/profile_sla_aic_dgdr.yaml
  • examples/deployments/powerplanner/agg.yaml
📚 Learning: 2026-01-04T06:45:28.414Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 5153
File: examples/backends/vllm/launch/lora/setup_minio.sh:99-109
Timestamp: 2026-01-04T06:45:28.414Z
Learning: For HuggingFace CLI version reporting (hf version and huggingface-cli version) in v0.34.6 and later, use direct argument syntax instead of the --version flag. Review shell-script changes and any scripts invoking the HuggingFace CLI to ensure they call the version output with a direct argument (e.g., 'hf version' or equivalent) rather than using '--version'. Apply to shell scripts and any related CLI invocations in the repository.

Applied to files:

  • examples/deployments/powerplanner/deploy_poweraware.bash
  • examples/deployments/powerplanner/full_clean_test.bash
  • examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash
  • examples/deployments/powerplanner/verify_poweraware.bash
  • examples/deployments/powerplanner/monitor_poweraware.bash
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

  • examples/deployments/powerplanner/dynamo-worker-podmonitor.yaml
🧬 Code graph analysis (4)
components/src/dynamo/planner/utils/planner_argparse.py (1)
components/src/dynamo/planner/defaults.py (1)
  • SLAPlannerDefaults (59-78)
components/src/dynamo/planner/utils/prometheus.py (1)
lib/bindings/python/rust/lib.rs (1)
  • component (846-852)
components/src/dynamo/planner/kubernetes_connector.py (2)
components/src/dynamo/planner/defaults.py (2)
  • SubComponentType (133-135)
  • get_service_from_sub_component_type_or_name (185-231)
components/src/dynamo/planner/kube.py (1)
  • get_graph_deployment (62-80)
components/src/dynamo/planner/utils/planner_core.py (3)
components/src/dynamo/planner/utils/prometheus.py (1)
  • get_total_cluster_power (212-225)
components/src/dynamo/planner/kubernetes_connector.py (1)
  • get_component_pods (260-299)
components/src/dynamo/planner/defaults.py (1)
  • SubComponentType (133-135)
🪛 Checkov (3.2.334)
deploy/power_agent/daemonset.yaml

[medium] 53-118: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 53-118: Do not use the CAP_SYS_ADMIN linux capability

(CKV_K8S_39)


[high] 53-118: Container should not be privileged

(CKV_K8S_16)


[medium] 53-118: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 53-118: Containers should not share the host process ID namespace

(CKV_K8S_17)

🪛 GitHub Actions: Copyright Checks
components/power_agent/requirements.txt

[error] 1-1: Invalid/Missing Header: The file is missing required SPDX header as per copyright checker rules.

examples/deployments/powerplanner/prometheus-values.yaml

[error] 1-1: Invalid/Missing Header: The file is missing required SPDX header as per copyright checker rules.

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/5280/merge) by kaim-eng.
examples/deployments/powerplanner/README.md

[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.

components/src/dynamo/planner/utils/planner_argparse.py

[error] 1-1: Black formatting changed this file. Re-run pre-commit to format correctly.

examples/deployments/powerplanner/deploy_poweraware.bash

[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.


[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.

components/src/dynamo/planner/defaults.py

[error] 1-1: Black formatting changed this file. Re-run pre-commit to format correctly.

components/src/dynamo/planner/utils/prometheus.py

[error] 1-1: Black formatting changed this file. Re-run pre-commit to format correctly.

components/src/dynamo/planner/kubernetes_connector.py

[error] 1-1: Black formatting changed this file. Re-run pre-commit to format correctly.

examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash

[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.

components/power_agent/power_agent.py

[error] 1-1: isort: files were modified by this hook. Re-run pre-commit to apply changes.


[error] 1-1: Black formatting changed this file. Re-run pre-commit to format correctly.


[error] 1-1: ruff: 1 error fixed by this hook. See log for details.

examples/deployments/powerplanner/verify_poweraware.bash

[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.


[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.


[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.

examples/deployments/powerplanner/monitor_poweraware.bash

[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.


[error] 1-1: Trailing whitespace cleaned by pre-commit. Re-run pre-commit to apply.

components/src/dynamo/planner/utils/planner_core.py

[error] 1-1: Black formatting changed this file. Re-run pre-commit to format correctly.

🪛 GitHub Check: Check for broken markdown links
examples/deployments/powerplanner/README.md

[failure] 482-482:
Broken link: examples/deployments/powerplanner/BARE_METAL_DEPLOYMENT.md - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/examples/deployments/powerplanner/README.md?plain=1#L482

🪛 LanguageTool
examples/deployments/powerplanner/README.md

[style] ~251-~251: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ...ally - Configures Prometheus relabeling automatically - Removes namespace override for correc...

(ADVERB_REPETITION_PREMIUM)


[uncategorized] ~482-~482: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...iners) - ✅ To test enforcement: Use bare metal deployment (see [`examples/deployments/...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~751-~751: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...river) - ✅ Real Kubernetes clusters - ✅ Bare metal deployments - ✅ Multi-node GPU clusters...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~760-~760: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...erplanner/BARE_METAL_DEPLOYMENT.md`** - Bare metal deployment guide (full power enforcemen...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~772-~772: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...* - Complete clean test (all phases) Bare Metal Deployment (Full Enforcement): - **`e...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~774-~774: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: .../BARE_METAL_DEPLOYMENT.md** - Complete bare metal deployment guide **Monitoring:** - **...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)
examples/deployments/powerplanner/README.md

15-15: Bare URL used

(MD034, no-bare-urls)


27-27: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


77-77: Bare URL used

(MD034, no-bare-urls)


157-157: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


183-183: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


328-328: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


344-344: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


488-488: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


507-507: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


522-522: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


663-663: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


894-894: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

components/power_agent/README.md

83-83: Bare URL used

(MD034, no-bare-urls)


84-84: Bare URL used

(MD034, no-bare-urls)

🪛 OSV Scanner (2.3.1)
components/power_agent/requirements.txt

[HIGH] 1-1: urllib3 1.26.20: urllib3 streaming API improperly handles highly compressed data

(GHSA-2xpw-w6gg-jr37)


[HIGH] 1-1: urllib3 1.26.20: Decompression-bomb safeguards bypassed when following HTTP redirects (streaming API)

(GHSA-38jv-5279-wg99)


[HIGH] 1-1: urllib3 1.26.20: urllib3 allows an unbounded number of links in the decompression chain

(GHSA-gm62-xv2j-4w53)


[HIGH] 1-1: urllib3 1.26.20: urllib3 redirects are not disabled when retries are disabled on PoolManager instantiation

(GHSA-pq67-6m6q-mj2v)

🪛 Ruff (0.14.10)
components/src/dynamo/planner/utils/prometheus.py

206-206: Consider moving this statement to an else block

(TRY300)


208-208: Do not catch blind exception: Exception

(BLE001)


209-209: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


222-222: Consider moving this statement to an else block

(TRY300)


223-223: Do not catch blind exception: Exception

(BLE001)


224-224: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

components/src/dynamo/planner/kubernetes_connector.py

297-297: Do not catch blind exception: Exception

(BLE001)


298-298: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

components/power_agent/power_agent.py

62-62: Avoid specifying long messages outside the exception class

(TRY003)


78-78: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


110-110: Consider moving this statement to an else block

(TRY300)


112-112: Do not catch blind exception: Exception

(BLE001)


113-113: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


156-156: Do not catch blind exception: Exception

(BLE001)


200-200: Loop control variable pid not used within loop body

Rename unused pid to _pid

(B007)


227-227: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


228-228: Do not catch blind exception: Exception

(BLE001)


229-229: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


240-240: Do not catch blind exception: Exception

(BLE001)


241-241: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

components/src/dynamo/planner/utils/planner_core.py

507-507: Do not catch blind exception: Exception

(BLE001)


615-615: Do not catch blind exception: Exception

(BLE001)


661-661: Do not catch blind exception: Exception

(BLE001)


662-662: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🪛 Shellcheck (0.11.0)
examples/deployments/powerplanner/full_clean_test.bash

[warning] 24-24: BLUE appears unused. Verify use (or export if used externally).

(SC2034)

examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash

[warning] 10-10: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 21-21: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 112-112: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[warning] 116-116: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

examples/deployments/powerplanner/monitor_poweraware.bash

[warning] 67-67: STATUS appears unused. Verify use (or export if used externally).

(SC2034)

🔇 Additional comments (24)
components/src/dynamo/planner/kube.py (1)

49-49: LGTM!

The addition of the CoreV1Api client is clean and provides the necessary functionality for pod-level operations required by the power-aware features.

examples/deployments/powerplanner/planner-clusterrole-patch.yaml (1)

7-30: LGTM!

The ClusterRole grants appropriate permissions for the power planner to manage DynamoGraphDeployments and patch pod annotations with power limits. The principle of least privilege is followed with minimal required verbs.

deploy/power_agent/daemonset.yaml (2)

70-81: Acknowledged: Privileged mode and hostPID are required for NVML power management.

The static analysis flags (CKV_K8S_16, CKV_K8S_17, CKV_K8S_20, CKV_K8S_23, CKV_K8S_39) are valid security concerns. However, the privileged mode, hostPID, and SYS_ADMIN capability are necessary for this DaemonSet to:

  • Access /proc/{pid}/cgroup for PID-to-pod mapping
  • Use NVML to set hardware GPU power limits

This is documented in the README and comments. Consider adding a security policy exception annotation or documenting this in a security policy file for audit purposes.


111-114: Clarify host-proc volume behavior on non-Minikube clusters.

The host-proc volume references /host/proc which only exists in Minikube with --mount-string="/proc:/host/proc". On standard Kubernetes clusters, DirectoryOrCreate will create an empty directory at /host/proc, and the agent will fail to map PIDs correctly.

Consider documenting this limitation more prominently or implementing fallback logic in the agent to detect and handle this scenario.

deploy/power_agent/README.md (1)

1-217: LGTM!

The deployment documentation is comprehensive, covering prerequisites, quick start, architecture, troubleshooting, and security considerations. The note about GPU power limits persisting after uninstall (line 210) is an important operational detail.

components/power_agent/Dockerfile (1)

4-4: No compatibility issues found. CUDA 12.1.0 and nvidia-ml-py==12.535.133 are compatible—the NVIDIA 535 driver series (which corresponds to nvidia-ml-py 12.535.133) provides full support for CUDA 12.x.

components/src/dynamo/planner/utils/planner_argparse.py (1)

135-161: LGTM! Power awareness arguments are well-structured.

The new CLI arguments for power awareness follow the existing patterns and provide clear help text. The integration with SLAPlannerDefaults is appropriate and maintains consistency with the rest of the codebase.

examples/deployments/powerplanner/agg.yaml (2)

1-35: LGTM! Manifest follows dynamo CR conventions.

The DynamoGraphDeployment manifest is well-structured and uses the correct gpu: "1" format for GPU resources as per the team's conventions. The configuration is appropriate for the power-aware deployment scenario.


16-16: No action needed. The image version nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.0.post2 is consistently used across all powerplanner deployment manifests (agg.yaml, disagg.yaml, and profile_sla_aic_dgdr.yaml). Version consistency within this deployment set is verified.

examples/deployments/powerplanner/profile_sla_aic_dgdr.yaml (1)

1-37: LGTM! Well-structured profiling configuration.

The DynamoGraphDeploymentRequest manifest is properly configured with clear comments, appropriate AI Configurator settings, and explicit SLA targets. The autoApply: true setting provides good automation for the deployment workflow.

examples/deployments/powerplanner/dynamo-worker-podmonitor.yaml (2)

19-25: LGTM! Correct label relabeling for Prometheus integration.

The relabeling configuration properly converts the pod label nvidia.com/dynamo_namespace to the dynamo_namespace metric label. The inline comment about Prometheus converting special characters to underscores is helpful for maintainability.


26-31: LGTM! Metric relabeling ensures frontend compatibility.

The metric relabeling from model_name to model ensures compatibility with frontend metrics, which is important for consistent metric consumption across the system.

examples/deployments/powerplanner/README.md (1)

1-895: Excellent comprehensive documentation!

This README provides thorough documentation covering architecture, deployment steps, verification procedures, troubleshooting, and production readiness criteria. The structure is clear with a helpful table of contents, and the content addresses both Minikube and production deployment scenarios. Once the minor formatting issues are resolved, this will be a valuable resource for users.

examples/deployments/powerplanner/disagg.yaml (1)

1-58: LGTM! Well-structured power-aware deployment manifest.

The DynamoGraphDeployment manifest is correctly configured with proper GPU resource limits (using gpu: "1" format), consistent model references, and appropriate subComponentType declarations for the disaggregated architecture.

examples/deployments/powerplanner/deploy_poweraware.bash (6)

18-19: Verify directory name consistency.

Same potential typos as in verify_poweraware.bash:

  • Line 18: bin_bin (should be bin?)
  • Line 19: minibute_home (should be minikube_home?)

Please ensure these directory names are correct and consistent across scripts.


53-102: Well-implemented profiling wait logic.

The conditional profiling data creation with timeout, progress reporting, and helpful error messages is well-designed. The 10-minute timeout is appropriate for model profiling operations.


106-130: LGTM! Efficient image build and load logic.

The image build process with output redirection, error handling, and duplicate load prevention is well-implemented. The check for existing images in Minikube (lines 123-129) optimizes repeated script runs.


137-150: Verify Dockerfile COPY path context.

Line 137 changes to ${DEV_REPO} before building, and line 144's COPY command assumes the Docker build context is the repo root. This is correct, but ensure the Docker build command on line 152 uses the correct context (.).

Checking line 152... Yes, it uses -f /tmp/Dockerfile.planner-custom -t dynamo/planner-power-aware:dev . with . as the context. This is correct.


211-313: LGTM! Comprehensive DynamoGraphDeployment manifest.

The inline YAML generation creates a complete deployment with:

  • Power-aware planner configuration (lines 249-252)
  • Proper resource limits
  • Profiling data volume mounts
  • Consistent worker configuration

The manifest structure matches the standalone disagg.yaml file with the addition of the power-aware planner service.


344-374: Excellent deployment completion guidance.

The final output provides clear next steps, monitoring commands, and testing examples. This significantly improves the user experience for operators deploying the power-aware system.

examples/deployments/powerplanner/monitor_poweraware.bash (2)

103-177: LGTM! Well-structured monitoring utility.

The remaining functions (show_live_logs, show_agent_logs, show_help) and the main command dispatcher are well-implemented with good error handling and user experience. The 5-second refresh interval for the status dashboard is reasonable for monitoring purposes.


1-1: Fix trailing whitespace to pass pre-commit checks.

The file contains trailing whitespace on multiple lines that needs to be cleaned. Run pre-commit run --all-files to fix formatting issues.

The shebang #!/usr/bin/bash is the standard location for bash on modern Linux distributions (Ubuntu, Debian, RHEL, CentOS, Fedora, and others) and does not require changes for portability.

Likely an incorrect or invalid review comment.

components/src/dynamo/planner/utils/planner_core.py (2)

748-750: LGTM! Correct integration point for power limit enforcement.

Calling apply_power_limits() after setting component replicas ensures that power limits are applied to all active pods, including newly scaled instances. The async integration is correct and maintains the planner's non-blocking behavior.


498-588: Power budget enforcement logic is sound and correctly implemented.

The code properly:

  • Calculates projected power consumption for the planned replica configuration
  • Applies proportional reduction when over budget (safe calculation since only executes when requested_total_power > limit)
  • Handles rounding errors with a secondary check that further adjusts decode replicas if needed
  • Respects min_endpoint constraints via the max() operations

The early return when power query fails is appropriate—it allows the planner to continue operating without power enforcement rather than blocking autoscaling entirely.

self.device_count = pynvml.nvmlDeviceGetCount()
logger.info(f"Initialized NVML. Found {self.device_count} GPUs on node {self.node_name}.")
except pynvml.NVMLError as e:
logger.error(f"Failed to initialize NVML: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use logging.exception for better error diagnostics.

Multiple error handlers use logging.error without tracebacks. Using logging.exception automatically includes the traceback, which is crucial for debugging issues in production.

🔍 Apply to all error handlers
         except pynvml.NVMLError as e:
-            logger.error(f"Failed to initialize NVML: {e}")
+            logger.exception(f"Failed to initialize NVML: {e}")
             raise

Apply the same pattern to:

  • Line 113: logger.error(f"Failed to list pods: {e}")
  • Line 227: logger.error(f"NVML error on GPU {gpu_idx}: {e}")
  • Line 229: logger.error(f"Unexpected error on GPU {gpu_idx}: {e}")
  • Line 241: logger.error(f"Error in reconciliation loop: {e}")

Based on static analysis hints.

Also applies to: 113-113, 227-229, 241-241

🧰 Tools
🪛 Ruff (0.14.10)

78-78: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In @components/power_agent/power_agent.py at line 78, Replace the plain
logger.error calls with logger.exception so tracebacks are included: update the
NVML initialization log in initialize_nvml (currently logger.error(f"Failed to
initialize NVML: {e}")), the pod listing error in the function that lists pods
(logger.error(f"Failed to list pods: {e}")), both GPU error handlers in the GPU
monitoring routine (logger.error(f"NVML error on GPU {gpu_idx}: {e}") and
logger.error(f"Unexpected error on GPU {gpu_idx}: {e}")), and the reconciliation
loop error (logger.error(f"Error in reconciliation loop: {e}")) to use
logger.exception while keeping the existing messages for context.

Comment on lines 11 to 12
export PATH=${DEV_REPO}/bin_bin:$PATH
export MINIKUBE_HOME=${DEV_REPO}/minibute_home
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical typos in path variables.

Line 11 has bin_bin instead of bin, and Line 12 has minibute_home instead of minikube_home. These typos will cause the script to use incorrect paths.

🐛 Proposed fix
-export PATH=${DEV_REPO}/bin_bin:$PATH
-export MINIKUBE_HOME=${DEV_REPO}/minibute_home
+export PATH=${DEV_REPO}/bin:$PATH
+export MINIKUBE_HOME=${DEV_REPO}/minikube_home
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export PATH=${DEV_REPO}/bin_bin:$PATH
export MINIKUBE_HOME=${DEV_REPO}/minibute_home
export PATH=${DEV_REPO}/bin:$PATH
export MINIKUBE_HOME=${DEV_REPO}/minikube_home
🤖 Prompt for AI Agents
In @examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash around
lines 11 - 12, The script sets incorrect path variable values due to typos:
change PATH export from using "${DEV_REPO}/bin_bin" to "${DEV_REPO}/bin" and
correct MINIKUBE_HOME from "${DEV_REPO}/minibute_home" to
"${DEV_REPO}/minikube_home" so the variables PATH and MINIKUBE_HOME point to the
intended directories; update the lines that reference these exact symbols
(export PATH=... and export MINIKUBE_HOME=...) accordingly and ensure no other
occurrences of the misspellings remain.

Comment on lines 15 to 16
export PATH=${DEV_REPO}/bin_bin:$PATH
export MINIKUBE_HOME=${DEV_REPO}/minibute_home
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, check if these directories exist
fd -t d "(bin_bin|minibute_home)" --hidden 2>/dev/null | head -20

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Read the specific file to see context around lines 15-16
cat -n examples/deployments/powerplanner/verify_poweraware.bash | head -30

Repository: ai-dynamo/dynamo

Length of output: 1153


🏁 Script executed:

#!/bin/bash
# Search for any references to "bin_bin" or "minibute_home" in the codebase
rg "(bin_bin|minibute_home)" --type bash --type sh -B 2 -A 2

Repository: ai-dynamo/dynamo

Length of output: 88


🏁 Script executed:

#!/bin/bash
# Search for references to these names (using sh file type or no type filter)
rg "(bin_bin|minibute_home)" -B 2 -A 2

Repository: ai-dynamo/dynamo

Length of output: 5466


🏁 Script executed:

#!/bin/bash
# Check what actual directories exist at the DEV_REPO level
fd -t d "bin" --max-depth 2 2>/dev/null | grep -E "(bin|minikube)" | head -20

Repository: ai-dynamo/dynamo

Length of output: 73


🏁 Script executed:

#!/bin/bash
# Search for any "minikube" related patterns (checking for similar names)
rg "minikube" --type sh -i | head -20

Repository: ai-dynamo/dynamo

Length of output: 2273


Fix systematic typos in directory path names across powerplanner deployment scripts.

The directory names bin_bin and minibute_home are typos that appear consistently across multiple deployment scripts. These should be bin and minikube_home respectively. This issue affects:

  • examples/deployments/powerplanner/verify_poweraware.bash (lines 15-16)
  • examples/deployments/powerplanner/deploy_poweraware.bash
  • examples/deployments/powerplanner/full_clean_test.bash
  • examples/deployments/powerplanner/deploy_poweraware_baseinfra.bash
  • examples/deployments/powerplanner/README.md

Update all files to use the correct names consistently.

🤖 Prompt for AI Agents
In @examples/deployments/powerplanner/verify_poweraware.bash around lines 15 -
16, Replace the systematic typos in the powerplanner deployment scripts by
changing the PATH entry value from "${DEV_REPO}/bin_bin" to "${DEV_REPO}/bin"
and the MINIKUBE_HOME export from "${DEV_REPO}/minibute_home" to
"${DEV_REPO}/minikube_home" across all affected files (references appear in
verify_poweraware.bash, deploy_poweraware.bash, full_clean_test.bash,
deploy_poweraware_baseinfra.bash and README.md); update the export lines that
set PATH and MINIKUBE_HOME to use the corrected directory names and ensure any
README examples reflect the same corrected paths.

@kaim-eng kaim-eng force-pushed the power-planner-dev branch 10 times, most recently from ce08849 to 33480b6 Compare January 10, 2026 19:21
@github-actions github-actions bot added documentation Improvements or additions to documentation ci Issues/PRs that reference CI build/test planner labels Jan 10, 2026
- Reorganize kaim_dev_env to examples/deployments/powerplanner/
- Rename scripts: deploy_power_aware.bash -> deploy_poweraware.bash,
  verify_deployment.bash -> verify_poweraware.bash,
  monitor_power_aware.bash -> monitor_poweraware.bash
- Copy agg.yaml and disagg.yaml to powerplanner directory
- Combine POWER_ENFORCEMENT_SUCCESS.md and POWER_PLANNER_README.md into README.md
- Add prerequisite checks for HF_TOKEN and required binaries (kubectl, minikube, helm, docker)
- Add GPU power limit verification test to verify_poweraware.bash
- Update all script references and paths to reflect new directory structure
- Fix Power Agent image caching issue to ensure proper /host/proc workaround deployment
- Clean up unused files (set_limit.py, gpulog_analyzer.py, parse_workload.bash, for_nuno.bash)
- Update documentation with setup instructions and warnings about prerequisites

Signed-off-by: Kai Ma <[email protected]>
Restructure README.md into user-centric documentation and add CHANGELOG.md
with technical implementation details for developers.

Changes:
- Move project status and test results to CHANGELOG.md
- Reorganize README into linear flow for users
- Add detailed technical flows (metrics, enforcement) to CHANGELOG
- Document scaling algorithm and file/function call chains

Signed-off-by: Kai Ma <[email protected]>
Fix pre-commit validation failure by removing trailing whitespace.

Signed-off-by: Kai Ma <[email protected]>
@github-actions github-actions bot added ci Issues/PRs that reference CI build/test and removed ci Issues/PRs that reference CI build/test labels Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Issues/PRs that reference CI build/test documentation Improvements or additions to documentation feat planner size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants