Skip to content

Comments

Add Kubernetes multi-node SFT backend validation assets and docs#1271

Open
esnvidia wants to merge 13 commits intoNVIDIA-NeMo:mainfrom
esnvidia:feature/kubernetes-backend
Open

Add Kubernetes multi-node SFT backend validation assets and docs#1271
esnvidia wants to merge 13 commits intoNVIDIA-NeMo:mainfrom
esnvidia:feature/kubernetes-backend

Conversation

@esnvidia
Copy link

@esnvidia esnvidia commented Feb 23, 2026

Summary

This PR adds Kubernetes multi-node SFT support and validation assets, including runnable scripts, manifests, and
execution evidence from completed runs.

Code changes

  • nemo_skills/pipeline/backends/base.py: extends job spec support needed by multi-node execution.
  • nemo_skills/pipeline/backends/kubernetes.py: multi-node Indexed Job + Headless Service flow, distributed env
    injection, RDMA options, DNS check init container, pod anti-affinity, image pull policy handling, RBAC preflight.
  • nemo_skills/pipeline/nemo_rl/sft.py: Kubernetes execution path routing for SFT workflows.
  • nemo_skills/pipeline/utils/declarative.py: declarative pipeline conversion updates for backend-aware job
    generation.
  • nemo_skills/pipeline/utils/cluster.py: backend/timeout config handling improvements.
  • pyproject.toml: adds integration pytest marker used by backend integration tests.
  • tests/test_backends.py, tests/test_pipeline_utils.py: expanded coverage for Kubernetes multi-node behavior and
    pipeline utilities.

Scripts and purpose

  • scripts/k8s-tests/validate_sft_e2e.sh: single-node end-to-end SFT validation.
  • scripts/k8s-tests/validate_sft_e2e_multinode.sh: multi-node E2E validator (Headless Service + Indexed Job + NCCL
    all-reduce checks).
  • scripts/k8s-tests/smoke_test_single_node.sh: minimal single-node DDP/NCCL smoke test.
  • scripts/k8s-tests/smoke_test_multi_node.sh: minimal multi-node DDP/NCCL smoke test.
  • scripts/k8s-tests/pipeline_smoke_test.py: exercises Pipeline -> KubernetesBackend path.
  • scripts/k8s-tests/sft_smoke_test.py: GPT-2 SFT smoke test via Pipeline+KubernetesBackend.
  • scripts/k8s-tests/run_sft_k8s_real.py: submits real SFT pipeline job through NeMo-Skills K8s code path.
  • scripts/k8s-tests/check_nccl_logs.py: parses NCCL logs for transport/topology sanity.
  • scripts/k8s-tests/generate_sft_data.py: synthetic SFT data generator.
  • scripts/k8s-tests/image-build/*: Kaniko build + node import automation and manifests.
  • scripts/k8s-tests/manifests/*: reusable K8s manifests for single-node, multi-node, and real NeMo-RL runs.

Docs

  • docs/kubernetes-guide.md: RBAC/service permission updates and guidance.
  • docs/kubernetes-migration.md: migration flow updates for multi-node readiness checks.
  • docs/kubernetes-status.md: status updated to include validated multi-node SFT.
  • docs/kubernetes-sft-runbook.md: infra/backend validation runbook and commands.
  • docs/kubernetes-nemo-rl-sft-runbook.md: real NeMo-RL SFT runbook (single-node + MNMG).

Cluster configs

  • cluster_configs/example-kubernetes.yaml: documents new K8s config knobs (rdma, dns_check, scheduling spread, rbac
    preflight, image pull policy).
  • cluster_configs/kubernetes/rbac.yaml: adds services create/delete/get/list/watch required for multi-node Headless
    Services.
  • cluster_configs/kubernetes/README.md: RBAC verification commands for services.

Cluster Config Integration

The Kubernetes backend changes are wired into the standard NeMo-Skills cluster-config flow.

  • CLI entrypoint (ns nemo_rl sft) loads cluster config via get_cluster_config(...) (nemo_skills/pipeline/nemo_rl/ sft.py).
  • Config discovery includes ./cluster_configs/<name>.yaml (nemo_skills/pipeline/utils/cluster.py).
  • Backend selection uses executor: kubernetes from that config (nemo_skills/pipeline/backends/factory.py).
  • Kubernetes backend consumes config values including:
    • namespace, containers, env_vars, image_pull_policy
    • multi-node options: rdma, dns_check, scheduling, rbac_preflight
      (nemo_skills/pipeline/backends/kubernetes.py).

Important scope note

The scripts/k8s-tests/* utilities used for validation are mostly standalone and often build inline cluster_config
dicts or call kubectl directly, so they do not always load cluster_configs/*.yaml.
The cluster config updates in this PR are for the canonical CLI/backend path and user-facing documentation.

Validation artifacts (completed runs) Intentionally not committed, but attached.

trace output but run completed)

Summary by CodeRabbit

  • New Features

    • Unified backend API plus Kubernetes, Local, and Slurm backends; declarative job-spec helpers and Kubernetes SFT execution path.
  • Documentation

    • Extensive Kubernetes docs and runbooks, migration guide, status, troubleshooting, and examples.
  • Chores

    • Kubernetes cluster/example configs, RBAC/storage/image-pull guidance, image build/load tooling, manifest validation script, and pre-commit hook updates.
  • Tests

    • New smoke/integration/k8s test scripts and pytest integration marker; added unit tests for Kubernetes-specific behaviors.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a ComputeBackend abstraction and factory plus Kubernetes, Slurm, and Local backend implementations; integrates Kubernetes routing into the declarative pipeline; introduces extensive Kubernetes docs, cluster manifests, image-build and test tooling, pre-commit K8s validation hooks, and related unit tests and utilities.

Changes

Cohort / File(s) Summary
Backend core & factory
nemo_skills/pipeline/backends/base.py, nemo_skills/pipeline/backends/factory.py, nemo_skills/pipeline/backends/__init__.py
Introduce ComputeBackend API, JobSpec/JobHandle/ContainerSpec/ResourceSpec/JobStatus datatypes, registry and BackendFactory with health checks, fallback logic, and public re-exports.
Backend implementations
nemo_skills/pipeline/backends/kubernetes.py, nemo_skills/pipeline/backends/slurm.py, nemo_skills/pipeline/backends/local.py
Add KubernetesBackend (Indexed Jobs, headless Service, PVC/volume handling, multi-node env injection, lifecycle/log streaming), SlurmBackend adapter (wrapper around existing Slurm flow), and LocalBackend (host/Docker execution, process/log management).
Integration & declarative pipeline
nemo_skills/pipeline/backends/integration.py, nemo_skills/pipeline/nemo_rl/sft.py, nemo_skills/pipeline/utils/declarative.py, nemo_skills/pipeline/utils/scripts.py, nemo_skills/pipeline/utils/cluster.py
Add job-spec factories (inference/training/data-processing), run_job_and_wait, backend detection/validation, Kubernetes routing in Pipeline.run, name sanitization, backend-aware hostname_ref, timeout parsing helpers, and conversion of CommandGroups to JobSpec.
Kubernetes cluster configs & pre-commit
.pre-commit-config.yaml, cluster_configs/example-kubernetes.yaml, cluster_configs/kubernetes/...
Add example cluster config, RBAC, PVC templates, image-pull-secret manifest, Kubernetes README; extend pre-commit hooks for multi-doc K8s YAML and add local validate-k8s-manifests hook.
Docs & runbooks
docs/kubernetes-guide.md, docs/kubernetes-migration.md, docs/*sft-runbook.md, docs/kubernetes-status.md, cluster_configs/kubernetes/README.md
Add comprehensive Kubernetes guides, migration guide from Slurm, SFT/RL runbooks, troubleshooting, architecture notes, and status matrix.
K8s test tooling & manifests
scripts/k8s-tests/*, scripts/k8s-tests/manifests/*, scripts/k8s-tests/image-build/*
Add Kaniko-compatible Dockerfile and build tooling, build-and-load scripts, import DaemonSet, Kaniko Jobs, many K8s test manifests (single/multi-node SFT), smoke tests (bash/python), and image-build utilities.
Manifest validators & helpers
scripts/k8s-tests/validate_k8s_manifests.sh, scripts/k8s-tests/check_nccl_logs.py
Add kubeconform/kubectl wrapper for manifest validation and an NCCL log parser/validator for distributed training logs.
Smoke/validation scripts
scripts/k8s-tests/*.sh, scripts/k8s-tests/*.py
Add multiple smoke tests and end-to-end validation runners (single-node and multi-node SFT, pipeline smoke tests, SFT real-run orchestrators).
Tests & pytest config
tests/test_declarative_pipeline.py, tests/test_pipeline_utils.py, pyproject.toml
Add unit tests for Kubernetes hostname resolution and timeout formatting; add pytest marker for integration tests.
Pre-commit & misc
.pre-commit-config.yaml, requirements/audio.txt, copyright bump
Update pre-commit hooks for K8s multi-doc support and add local validator hook; small dependency reorder and copyright year bump.

Sequence Diagram(s)

sequenceDiagram
    participant User as Pipeline / User
    participant Factory as BackendFactory
    participant K8s as KubernetesBackend
    participant API as Kubernetes API
    participant DNS as DNS / Headless Service
    participant Pod as Job Pod

    User->>Factory: get_backend(cluster_config)
    Factory-->>K8s: initialize(cluster_config)
    K8s->>API: load kubeconfig / health_check
    API-->>K8s: OK
    Factory-->>User: backend instance

    User->>K8s: submit_job(JobSpec)
    alt multi-node
        K8s->>API: create Headless Service
        API-->>K8s: Service created
    end
    K8s->>API: create Job / Indexed Job
    API-->>K8s: Job created
    K8s-->>User: JobHandle

    loop monitor
        User->>K8s: get_status / wait_for_completion
        K8s->>API: query Job / Pod status
        API-->>K8s: status
        K8s-->>User: JobStatus
    end

    Pod->>DNS: resolve MASTER_ADDR (headless service)
    DNS-->>Pod: rank-0 IP
    Pod->>Pod: init distributed group (NCCL)

    User->>K8s: get_logs(handle, follow=True)
    K8s->>API: stream pod logs
    API-->>K8s: log lines
    K8s-->>User: log iterator

    User->>K8s: cleanup(handle)
    K8s->>API: delete Job
    alt multi-node
        K8s->>API: delete Headless Service
    end
    API-->>K8s: resources deleted
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

run GPU tests

Suggested reviewers

  • activatedgeek
  • Kipok
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Add Kubernetes multi-node SFT backend validation assets and docs" is clear, specific, and directly summarizes the main change—adding Kubernetes multi-node support with validation and documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (18)
scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s-11-12 (1)

11-12: ⚠️ Potential issue | 🟡 Minor

NEMO_SKILLS_PATH build arg documented but never declared or used.

The header comment at line 12 mentions NEMO_SKILLS_PATH as a build arg, but no ARG NEMO_SKILLS_PATH is declared. Line 100 hardcodes COPY skills/ /opt/NeMo-Skills/. Either remove the misleading comment or wire up the ARG.

Suggested fix (remove misleading comment)
 # Build args:
 #   NEMO_RL_COMMIT: NeMo-RL commit to build from (default: pinned)
-#   NEMO_SKILLS_PATH: path to local NeMo-Skills code in build context (default: /skills)

Also applies to: 99-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s` around lines 11 - 12,
The header documents a build arg NEMO_SKILLS_PATH that isn't declared or used;
either remove the misleading comment or wire it up by adding ARG
NEMO_SKILLS_PATH with a sensible default and use it in the Dockerfile COPY
(replace COPY skills/ /opt/NeMo-Skills/ with COPY ${NEMO_SKILLS_PATH}
/opt/NeMo-Skills/), ensuring the variable name NEMO_SKILLS_PATH is declared
before use (ARG NEMO_SKILLS_PATH=/skills) so the build arg is recognized.
scripts/k8s-tests/validate_sft_e2e.sh-177-178 (1)

177-178: ⚠️ Potential issue | 🟡 Minor

Checkpoint numbering mismatch between comments and print statements.

The inline comments and the print() labels are out of sync:

Comment (line) Says Print output
177 "Validation checkpoint 3" Checkpoint 4: SFT training
208 "Validation checkpoint 4" Checkpoint 5: NCCL all-reduce

This will confuse anyone correlating log output with the source.

Suggested fix
-# === Validation checkpoint 3: SFT-style training ===
-print(f"\n=== Checkpoint 4: SFT training (Rank {rank}) ===")
+# === Validation checkpoint 4: SFT-style training ===
+print(f"\n=== Checkpoint 4: SFT training (Rank {rank}) ===")
-# === Validation checkpoint 4: NCCL all-reduce ===
-print(f"\n=== Checkpoint 5: NCCL all-reduce (Rank {rank}) ===")
+# === Validation checkpoint 5: NCCL all-reduce ===
+print(f"\n=== Checkpoint 5: NCCL all-reduce (Rank {rank}) ===")

Also applies to: 208-209

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 177 - 178, The inline
checkpoint comments and the print() labels are out of sync; update either the
comment text ("Validation checkpoint 3"/"Validation checkpoint 4") or the
corresponding print messages (print(f"...Checkpoint 4: SFT training (Rank
{rank})...") and the print for "Checkpoint 5: NCCL all-reduce") so they match
consistently (e.g., change the print strings to "Checkpoint 3..." and
"Checkpoint 4..." or rename the comments to match); ensure both the comment
preceding the block and the print(...) string for SFT training and the
subsequent NCCL all-reduce block use the same checkpoint numbers.
scripts/k8s-tests/sft_smoke_test.py-200-203 (1)

200-203: ⚠️ Potential issue | 🟡 Minor

Missing None check on pipeline.run() result before iterating.

If pipeline.run() returns None (e.g., backend error), result.items() on line 202 will raise AttributeError. The sibling script pipeline_smoke_test.py (line 221) guards against this.

Suggested fix
     result = pipeline.run(dry_run=False)
+    if result is None:
+        print("ERROR: Pipeline returned None (unexpected)")
+        sys.exit(1)
     print(f"\nSubmitted {len(result)} job(s)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/sft_smoke_test.py` around lines 200 - 203, The code calls
pipeline.run(dry_run=False) and immediately iterates result.items(), but
pipeline.run may return None causing an AttributeError; add a None-check after
result = pipeline.run(...) (e.g., if result is None: log an error or raise/exit)
before the print and for loop that reference result.items(), ensuring you
reference the pipeline.run call and the subsequent iteration over result.items()
in your fix.
scripts/k8s-tests/validate_sft_e2e_multinode.sh-249-270 (1)

249-270: ⚠️ Potential issue | 🟡 Minor

Cleanup is skipped when validation fails.

The exit 1 at line 262 prevents the cleanup block (lines 265–269) from executing, leaving the Job, Service, and ConfigMap behind. Consider using a trap for cleanup to ensure resources are always released.

Suggested fix

Add a cleanup trap near the top of the script (after variable setup):

+cleanup() {
+    echo "Cleaning up..."
+    kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null
+    kubectl delete svc "$SVC" -n "$NAMESPACE" --ignore-not-found 2>/dev/null
+    kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null
+}
+trap cleanup EXIT

Then remove the explicit cleanup block at lines 265–269 and the echo "Done." (the trap handles it). The exit 1 at line 262 will now trigger cleanup automatically.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh` around lines 249 - 270, The
script currently exits early on failure and skips resource cleanup; define a
cleanup function (e.g., cleanup() that kubectl deletes JOB_NAME, SVC, and the
configmap in NAMESPACE and prints completion) and register it with trap 'EXIT'
(and optionally ERR) near the top after variables are set so any exit runs
cleanup; then remove the explicit cleanup block at the end (the echo "Done." and
the kubectl delete lines) since the trap will handle resource deletion when exit
1 is triggered.
scripts/k8s-tests/validate_sft_e2e.sh-350-351 (1)

350-351: ⚠️ Potential issue | 🟡 Minor

Cleanup is incomplete — ConfigMap is leaked, and cleanup is skipped on failure.

  1. The ConfigMap sft-e2e-script (created at line 236) is never deleted.
  2. Multiple exit 1 paths (lines 303, 347) bypass the cleanup block entirely.

Use a trap and include the ConfigMap deletion.

Suggested fix — add near top of script after variable setup
+cleanup() {
+    kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null
+    kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null
+}
+trap cleanup EXIT

Then replace lines 350–351:

-# Cleanup
-kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 350 - 351, Add a robust
cleanup via a trap so the job and the ConfigMap "sft-e2e-script" are always
removed even if the script exits early; implement a cleanup function (e.g.,
cleanup()) that deletes the Job ($JOB_NAME) and the ConfigMap "sft-e2e-script"
in $NAMESPACE (use --ignore-not-found and redirect stderr as needed), register
it with trap 'cleanup' EXIT INT TERM near the top after variables are set, and
remove or replace any standalone kubectl delete lines that were bypassed by
early exit so all deletions happen inside cleanup.
scripts/k8s-tests/validate_sft_e2e_multinode.sh-69-69 (1)

69-69: ⚠️ Potential issue | 🟡 Minor

Hardcoded ConfigMap name can cause collisions between concurrent runs.

The job name uses a unique timestamp suffix (mn-sft-test-$(date +%s | tail -c 6)), but the ConfigMap name is always sft-e2e-script. If two instances of this script run concurrently in the same namespace, one will overwrite the other's ConfigMap.

Suggested fix — derive ConfigMap name from job name
-kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null
-kubectl create configmap sft-e2e-script -n "$NAMESPACE" --from-literal=train.py='
+CM_NAME="${JOB_NAME}-script"
+kubectl delete configmap "$CM_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null
+kubectl create configmap "$CM_NAME" -n "$NAMESPACE" --from-literal=train.py='

And update the volume reference at line 181:

-      - name: script
-        configMap: {name: sft-e2e-script}
+      - name: script
+        configMap: {name: ${CM_NAME}}

And cleanup at line 269:

-kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null
+kubectl delete configmap "$CM_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh` at line 69, Derive a unique
ConfigMap name from the job name instead of using the hardcoded
"sft-e2e-script": create a variable like CONFIGMAP_NAME="${JOB_NAME}-script"
(where JOB_NAME is set as mn-sft-test-$(date +%s | tail -c 6)), use kubectl
create configmap "$CONFIGMAP_NAME" ... when creating it, replace any hardcoded
references (the volume definition/mount that currently points to sft-e2e-script)
to reference "$CONFIGMAP_NAME" (the volume name and items), and update the
cleanup command to kubectl delete configmap "$CONFIGMAP_NAME" --ignore-not-found
so concurrent runs don’t collide.
nemo_skills/pipeline/backends/base.py-61-69 (1)

61-69: ⚠️ Potential issue | 🟡 Minor

Missing cross-validation: memory_request_gb > memory_limit_gb.

When both memory_request_gb and memory_limit_gb are specified, the request should not exceed the limit. Kubernetes will reject a pod spec where the request exceeds the limit, so validating here gives a clearer error message.

Proposed fix
         if self.memory_limit_gb is not None and self.memory_limit_gb <= 0:
             raise ValueError(f"memory_limit_gb must be positive, got {self.memory_limit_gb}")
+        if (
+            self.memory_request_gb is not None
+            and self.memory_limit_gb is not None
+            and self.memory_request_gb > self.memory_limit_gb
+        ):
+            raise ValueError(
+                f"memory_request_gb ({self.memory_request_gb}) cannot exceed "
+                f"memory_limit_gb ({self.memory_limit_gb})"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/base.py` around lines 61 - 69, Add a
cross-validation in __post_init__ to ensure memory_request_gb does not exceed
memory_limit_gb: after the existing non-negative/positive checks for
memory_request_gb and memory_limit_gb, if both self.memory_request_gb and
self.memory_limit_gb are not None and self.memory_request_gb >
self.memory_limit_gb, raise a ValueError with a clear message like
"memory_request_gb (X) cannot be greater than memory_limit_gb (Y)". This uses
the existing __post_init__ method and the fields memory_request_gb and
memory_limit_gb.
docs/kubernetes-guide.md-562-570 (1)

562-570: ⚠️ Potential issue | 🟡 Minor

Remove cluster_config=cluster_config from the GenerationClientScript example.

GenerationClientScript does not have a cluster_config parameter. Only ServerScript and SandboxScript accept it. Passing it here will cause a TypeError at runtime.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-guide.md` around lines 562 - 570, The GenerationClientScript
instantiation incorrectly passes a non-existent parameter cluster_config which
will raise a TypeError; remove the cluster_config=cluster_config argument from
the GenerationClientScript call (leave cluster_config only on ServerScript or
SandboxScript where supported) so the call to GenerationClientScript(...) uses
only its valid parameters like output_dir, servers, model_names, server_types,
and extra_arguments.
docs/kubernetes-status.md-39-47 (1)

39-47: ⚠️ Potential issue | 🟡 Minor

Contradictory placement: multi-node training listed as "Not Implemented" but status shows ✅ validated.

Line 43 places multi-node training under "Not Implemented" yet shows "✅ Backend + cluster validated". The rest of this PR (runbook, manifests, _run_sft_kubernetes()) demonstrates a working Indexed Job + Headless Service approach. Consider moving this to "Degraded Functionality" with a note that CLI integration (ns nemo_rl sft --num_nodes) is pending, or updating the section heading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-status.md` around lines 39 - 47, The "Not Implemented" table
incorrectly lists "Multi-node training" as not implemented despite the diff and
code (Indexed Job + Headless Service, runbook, and _run_sft_kubernetes())
showing it validated; update the docs by moving the "Multi-node training" row
out of the "Not Implemented" section into "Degraded Functionality" (or change
the section heading) and add a short note that CLI integration via "ns nemo_rl
sft --num_nodes" is still pending while the backend/cluster (Indexed Job +
Headless Service) is validated.
docs/kubernetes-status.md-54-61 (1)

54-61: ⚠️ Potential issue | 🟡 Minor

Stale reference: "Multi-node training needs MPI Operator" contradicts the validated Indexed Job approach.

Line 56 states MPI Operator is needed, but the PR validates multi-node via Indexed Job + Headless Service without MPI Operator. Similarly, line 85 repeats this. These references appear to be from an earlier design phase and should be updated to reflect the current architecture.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-status.md` around lines 54 - 61, Stale docs claim "Multi-node
training needs MPI Operator" but the implementation uses Indexed Job + Headless
Service; update the table row containing "Multi-node training needs MPI
Operator" (the entry tied to `ns nemo_rl sft --num_nodes=8`) and any repeated
text (the other occurrence that repeats the MPI Operator claim) to state the
validated approach ("Supported via Indexed Job + Headless Service, no MPI
Operator required") and remove or rephrase the earlier-design references; ensure
any surrounding notes and the lines referencing the same behavior (e.g., the
repeated sentence around the later paragraph) are adjusted to reflect the
current architecture.
scripts/k8s-tests/run_sft_k8s_real.py-37-42 (1)

37-42: ⚠️ Potential issue | 🟡 Minor

--skip-conversion is a dead argument: always True, never referenced.

action="store_true" with default=True makes the flag permanently True with no way to disable it. More importantly, args.skip_conversion is never used in main(). Either remove this argument or integrate it into the pipeline construction to conditionally skip the conversion job.

Suggested fix (remove if unused)
-    parser.add_argument(
-        "--skip-conversion",
-        action="store_true",
-        default=True,
-        help="Skip checkpoint conversion job (default: True for testing)",
-    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/run_sft_k8s_real.py` around lines 37 - 42, The
parser.add_argument call for "--skip-conversion" is dead (action="store_true"
with default=True makes it always true and args.skip_conversion is never used);
either remove this argument declaration or wire args.skip_conversion into main()
so the conversion job is conditionally skipped. Locate the parser.add_argument
for "--skip-conversion" and either delete it, or update main() (where the
pipeline/conversion job is constructed) to check args.skip_conversion and skip
creating or submitting the conversion job accordingly (use the
args.skip_conversion boolean to gate the conversion step).
nemo_skills/pipeline/backends/slurm.py-120-120 (1)

120-120: ⚠️ Potential issue | 🟡 Minor

" ".join(command) can break commands with arguments containing spaces.

If main_container.command contains arguments with spaces (e.g., ["python", "-c", "print('hello world')"]), joining with a space and passing as a single string to add_task will lose the argument boundaries. Consider using shlex.join() for proper shell quoting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/slurm.py` at line 120, The code builds a
command string with cmd = " ".join(main_container.command), which breaks
arguments that contain spaces; update the construction of cmd to use
shlex.join(main_container.command) (import shlex at top) so arguments are
shell-quoted correctly before passing cmd to add_task (or wherever cmd is used),
ensuring main_container.command remains an array and argument boundaries are
preserved.
scripts/k8s-tests/check_nccl_logs.py-95-98 (1)

95-98: ⚠️ Potential issue | 🟡 Minor

"IB" in line is prone to false positives.

The substring "IB" matches many unrelated words in log output (e.g., "LIB", "INHIBIT", "POSSIBLE", "VISIBLE"). This could cause ib_detected to be set incorrectly. Consider using a word-boundary regex or a more specific pattern.

Proposed fix
-        if "IB" in line and ("NCCL INFO" in line or "transport" in line.lower()):
+        if re.search(r'\bIB\b|InfiniBand|ibverbs', line) and ("NCCL INFO" in line or "transport" in line.lower()):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/check_nccl_logs.py` around lines 95 - 98, The check is
using a plain substring check ("IB" in line) which yields false positives;
update the detection in the loop that sets result.ib_detected and appends to
result.transport_used to use a word-boundary or more specific regex (e.g.,
r'\bIB\b' or explicit patterns like "InfiniBand" / "IB:"), and apply
case-insensitive matching while preserving the existing NCCL INFO/transport
logic so only true "IB" tokens trigger result.ib_detected and appending to
result.transport_used.
scripts/k8s-tests/smoke_test_multi_node.sh-131-139 (1)

131-139: ⚠️ Potential issue | 🟡 Minor

DNS wait loop silently continues on resolution failure.

If the master address never resolves, the loop completes all 60 iterations and execution falls through to torchrun, which will fail with a confusing connection error. Compare with the multinode manifest (real-nemo-rl-sft-multinode-test.yaml lines 66-69) which correctly adds a post-loop check and exit 1.

Proposed fix
           # Wait for DNS resolution of master node
           echo "Waiting for master DNS resolution..."
           for i in \$(seq 1 60); do
             if getent hosts ${MASTER_ADDR} > /dev/null 2>&1; then
               echo "Master resolved: \$(getent hosts ${MASTER_ADDR})"
               break
             fi
             sleep 2
           done
+          if ! getent hosts ${MASTER_ADDR} > /dev/null 2>&1; then
+            echo "FAIL: Master DNS not resolvable after 120s"
+            exit 1
+          fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 131 - 139, The DNS
wait loop using getent hosts ${MASTER_ADDR} may silently fail after 60
iterations and proceed to run torchrun; update the loop logic to check whether
resolution succeeded and if not print a clear error and exit non-zero.
Specifically, after the for loop that checks getent hosts ${MASTER_ADDR} and
breaks on success, add a post-loop check that verifies resolution (e.g., re-run
getent or track a success flag) and if unresolved emit a descriptive error
mentioning MASTER_ADDR and exit 1 so torchrun is not invoked on a bad master
address.
scripts/k8s-tests/check_nccl_logs.py-100-103 (1)

100-103: ⚠️ Potential issue | 🟡 Minor

Regex alternation precedence issue.

r"NET.*Socket|TCP" matches either NET.*Socket or just TCP anywhere, because | has the lowest precedence. A bare TCP match (e.g., in a log line mentioning "TCP_NODELAY" or any TCP reference) would set tcp_fallback = True incorrectly.

Proposed fix
-        if re.search(r"NET.*Socket|TCP", line) and "NCCL" in line:
+        if re.search(r"NET.*(?:Socket|TCP)", line) and "NCCL" in line:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/check_nccl_logs.py` around lines 100 - 103, The regex
r"NET.*Socket|TCP" is too permissive because the alternation applies only to the
left term; update the pattern used in the conditional that sets
result.tcp_fallback (the if checking re.search(..., line) before setting
result.tcp_fallback and appending to result.transport_used) to group the
alternation and/or require a more specific TCP token (e.g. use a grouped
alternation like "(?:NET.*Socket|TCP/Socket)" or include word boundaries for TCP
as "\bTCP\b") so that incidental "TCP" substrings (like TCP_NODELAY) do not
trigger tcp_fallback and only the intended "NET...Socket" or "TCP/Socket"
matches do.
scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml-101-105 (1)

101-105: ⚠️ Potential issue | 🟡 Minor

|| true on Ray worker masks genuine connection failures.

If the worker fails to connect to the Ray head (e.g., wrong address, network issue), the pod still exits 0 and the Job reports success. Consider adding a pre-check (e.g., verify Ray head is reachable) or logging the exit code before the || true so failures are at least observable in logs.

Suggested improvement
     else
       echo "Starting Ray worker node (joining head at ${MASTER_ADDR}:6379)..."
-      ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block || true
+      ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block
+      RAY_EXIT=$?
+      if [ $RAY_EXIT -ne 0 ]; then
+        echo "WARN: Ray worker exited with code $RAY_EXIT (expected when head shuts down)"
+      fi
       echo "Worker done (Ray head shut down)."
     fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml` around
lines 101 - 105, The shell wrapper for starting the Ray worker currently masks
failures by appending "|| true" to the "ray start
--address=\"${MASTER_ADDR}:6379\" --num-gpus=${NUM_GPUS} --block" command;
remove the unconditional "|| true" and instead perform a pre-check that the head
at ${MASTER_ADDR}:6379 is reachable (e.g., TCP check/wait loop) before invoking
ray start, and if ray start fails capture and log its exit code (and exit
non‑zero) so connection failures are visible in logs and cause the pod/Job to
fail; update the surrounding echo messages to include the exit code when ray
start returns non‑zero.
nemo_skills/pipeline/backends/local.py-198-213 (1)

198-213: ⚠️ Potential issue | 🟡 Minor

Timeout check uses if timeout which treats timeout=0 as "no timeout".

if timeout and (time.time() - start_time) > timeout evaluates to False when timeout=0. Per ComputeBackend.wait_for_completion, timeout=None means "wait indefinitely". Using if timeout is not None would correctly distinguish 0 from None, though 0-second timeouts are admittedly an edge case.

Proposed fix
-            if timeout and (time.time() - start_time) > timeout:
+            if timeout is not None and (time.time() - start_time) > timeout:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/local.py` around lines 198 - 213, The timeout
check in LocalComputeBackend.wait_for_completion uses "if timeout" which treats
timeout=0 as no timeout; change the condition in wait_for_completion (function
name: wait_for_completion, local variables: timeout, start_time, status) to
explicitly check "if timeout is not None and (time.time() - start_time) >
timeout" so a 0-second timeout is honored (i.e., return the current status
immediately after the first loop) while None still means wait indefinitely.
nemo_skills/pipeline/backends/kubernetes.py-759-779 (1)

759-779: ⚠️ Potential issue | 🟡 Minor

_parse_timeout raises unhandled ValueError on invalid input.

If timeout_str is something like "abc", the final int(timeout_str) at line 779 will raise a bare ValueError with no context about what went wrong. Similarly, "1.5h" would fail because int("1.5") raises.

Proposed fix
-        return int(timeout_str)
+        try:
+            return int(timeout_str)
+        except ValueError:
+            raise ValueError(f"Invalid timeout format: '{timeout_str}'. Use '6h', '30m', '300s', or integer seconds.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/kubernetes.py` around lines 759 - 779,
_parse_timeout currently can raise unhelpful ValueError (e.g., "abc" or "1.5h");
update _parse_timeout to validate and normalize inputs: accept numeric floats
for units (e.g., "1.5h") by parsing with float before converting to seconds,
wrap all int/float conversions for unit suffixes and HH:MM:SS parsing in
try/except to catch ValueError, and raise a new ValueError with a clear message
that includes the original timeout_str; ensure HH:MM:SS and MM:SS branches
validate the number of parts and numeric content and similarly raise informative
errors on failure.

@esnvidia
Copy link
Author

Addressed the review feedback and pushed fixes across backend, pipeline, scripts, and docs.

What was fixed:

  • memory_gb -> memory_request_gb in all ResourceSpec(...) calls in backend integration helpers.
  • Required config keys now use fail-fast access where expected (cluster_config["executor"], ssh_tunnel["user"], ssh_tunnel["host"]).
  • Kubernetes multi-node status handling now checks terminal conditions/completions correctly (no premature success on first completed pod).
  • Distributed rank export injection now supports bash/sh command variants.
  • ContainerSpec.mounts are now parsed and incorporated into K8s V1VolumeMounts; each container gets an independent mount list.
  • Local backend Docker GPU arg no longer includes literal quotes; timeout semantics fixed for timeout=0 edge case.
  • Slurm backend command construction uses shlex.join(...); stale context-managed experiment handle is no longer persisted.
  • Declarative timeout parsing now raises on invalid formats instead of silently defaulting.
  • K8s scripts hardened:
    • added missing dist.init_process_group(...) in multi-node smoke inline training
    • added --nnodes / --node_rank wiring in real SFT script torchrun command
    • added robust DNS post-loop checks
    • switched E2E validators to trap-based cleanup and unique per-run ConfigMap names
    • tightened NCCL transport regexes
  • Docs corrected:
    • removed invalid cluster_config arg from GenerationClientScript example
    • removed stale MPI-operator wording; clarified current validated Indexed Job + Headless Service path and pending CLI integration

Validation run:

  • ./.venv/bin/pre-commit run --all-files -> Passed
  • Scoped backend tests in .venv -> Passed
  • Live cluster validation scripts:
    • ./scripts/k8s-tests/validate_sft_e2e.sh --gpus 1 --timeout 900 -> Passed
    • ./scripts/k8s-tests/validate_sft_e2e_multinode.sh --nodes 2 --gpus 2 --timeout 1200 -> Passed

Logs were generated under scripts/k8s-tests/logs/ for the live runs and were not committed.

@esnvidia esnvidia force-pushed the feature/kubernetes-backend branch 2 times, most recently from 8d9211e to b766b9a Compare February 23, 2026 05:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (6)
docs/kubernetes-status.md (1)

57-59: Hardcoded source-line references will become stale.

References like declarative.py:641-647 and declarative.py:692-694 will drift as code changes. Consider replacing with symbol/function names or GitHub permalink anchors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-status.md` around lines 57 - 59, Replace the hardcoded
file:line references in the table with stable identifiers: point to the actual
function or class names that emit those runtime warnings in declarative.py
(e.g., the function that enforces job-dependency sequencing and the function
that logs skipped external dependencies) or use a GitHub permalink to a specific
commit/anchor rather than `declarative.py:641-647`/`692-694`; update the three
table cells ("Job dependencies force sequential", "SSH tunneling not available",
"External dependencies skipped") to reference those symbol names or permalink
anchors so the docs remain correct as code lines shift.
nemo_skills/pipeline/backends/kubernetes.py (1)

264-274: Add strict=True to zip() for container list alignment safety.

spec.containers and pod_spec.containers should always be the same length since the latter is built from the former (lines 205-207). Adding strict=True acts as a safety net against future bugs that could break this invariant.

Proposed fix
-        for container_spec, container in zip(spec.containers, pod_spec.containers):
+        for container_spec, container in zip(spec.containers, pod_spec.containers, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/kubernetes.py` around lines 264 - 274, The loop
that pairs spec.containers with pod_spec.containers should use zip(...,
strict=True) to enforce equal-length alignment; update the loop in the
Kubernetes backend where you currently have for container_spec, container in
zip(spec.containers, pod_spec.containers): to use zip(spec.containers,
pod_spec.containers, strict=True) so any mismatch is raised immediately—this
affects the block that calls self._build_container_mounts(container_spec.mounts,
existing_volumes), extends pod_spec.volumes with extra_volumes, appends
extra_mounts to container_mounts, and assigns container.volume_mounts.
scripts/k8s-tests/sft_smoke_test.py (1)

156-169: Both single-node and multi-node torchrun commands use heredoc for script injection — consider a shared builder.

The command construction for single-node (lines 164-168) and multi-node (lines 157-162) share most of their structure (NCCL env exports, heredoc, torchrun invocation). The only difference is the --nnodes, --node_rank, and --master_addr flags. A small helper could reduce duplication, though this is acceptable for a smoke test script.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/sft_smoke_test.py` around lines 156 - 169, Refactor the
duplicated torchrun command construction by building a shared base command that
includes the NCCL exports, heredoc injection of SFT_TRAIN_SCRIPT and the common
torchrun args (e.g., --nproc_per_node from args.gpus and --master_port), then
conditionally append the multi-node flags (--nnodes={args.nodes},
--node_rank=${{NODE_RANK:-0}}, --master_addr=${{MASTER_ADDR:-localhost}}) when
args.nodes > 1; update the variable cmd to use this builder so the only
differences between single-node and multi-node are added conditionally rather
than duplicating the full heredoc + torchrun string.
nemo_skills/pipeline/backends/local.py (1)

240-258: Minor thread-safety concern in get_logs when reading from shared list.

Lines 242-244 iterate over job.logs[container] while _collect_logs (running on a daemon thread) concurrently appends to the same list. CPython's GIL makes this safe in practice, but the follow path (lines 247-258) properly uses index-based access to avoid issues. The initial yield loop (lines 242-244) could similarly use index-based access for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/local.py` around lines 240 - 258, The initial
log-yield loop in get_logs reads job.logs[container] with a for-loop while
_collect_logs may append concurrently; replace that for-loop with the same
index-based pattern used in the follow path: grab current_logs =
job.logs.get(container, []) and use while yielded < len(current_logs): yield
current_logs[yielded]; yielded += 1 so iteration is consistent and avoids
concurrent-iteration issues with _collect_logs.
nemo_skills/pipeline/backends/factory.py (1)

152-174: Registry-checked backends can shadow lazy-loaded built-in backends.

If someone registers a backend with the name "slurm", "kubernetes", "local", or "none" via register_backend, it will silently override the built-in lazy-import path (line 153 takes precedence). This may be intentional for extensibility but is worth documenting or validating.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/factory.py` around lines 152 - 174, Registry
entries currently override built-in lazy-loaded backends; to prevent silent
shadowing, treat the built-in names {"slurm","kubernetes","local","none"} as
reserved: update the register_backend function to validate the provided name
against this reserved set and raise a ValueError (or reject) if someone attempts
to register one of them, and also add a short comment in the factory function
where _BACKEND_REGISTRY is checked to clarify the reserved-name behavior;
reference register_backend, _BACKEND_REGISTRY, and the executor variable to
locate where to add the validation and comment.
nemo_skills/pipeline/utils/declarative.py (1)

260-260: was_modified condition can be simplified.

name != original.lower() or original != original.lower() is equivalent to name != original since name is entirely derived from original through the transformations above.

♻️ Proposed simplification
-    was_modified = name != original.lower() or original != original.lower()
+    was_modified = name != original
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` at line 260, The condition
computing was_modified is overly complex; replace the expression "was_modified =
name != original.lower() or original != original.lower()" with the simpler
equality check "was_modified = name != original" because name is derived from
original via the prior transformations; update this in the function/method where
variables name, original, and was_modified are defined (search for the
was_modified assignment in declarative.py) to remove redundant checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/kubernetes-guide.md`:
- Around line 680-683: Add language specifiers to the three fenced code blocks
that currently lack them by changing their opening fences to ```text;
specifically update the blocks containing the log lines "Job name
'Qwen2.5_Math_7B' is not Kubernetes-compliant. Sanitized to 'qwen2-5-math-7b'."
and "Job 'Qwen2.5_Math_7B' will be submitted as 'qwen2-5-math-7b-12345'", and
the block that begins with the ASCII diagram line
"┌─────────────────────────..." (and the other similar code blocks referenced in
the comment) so they are fenced as ```text (or remove the fence for the ASCII
diagram if you prefer plain text).

In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 91-92: Replace the permissive lookup
cluster_config.get("executor") with direct access cluster_config["executor"] in
the KubernetesBackend validation so a missing executor raises KeyError
immediately; update the conditional in the function/method where
KubernetesBackend validates cluster_config (the block using
cluster_config.get("executor") and raising ValueError) to access
cluster_config["executor"] and keep the ValueError check for non-"kubernetes"
values.
- Around line 496-502: The try/except around int(gpu_count) currently swallows
TypeError/ValueError and leaves no trace, causing RDMA resources to be injected
for containers with malformed gpu_count; update the exception handler in the
gpu_count check (the block referencing gpu_count in the Kubernetes backend that
injects RDMA resources) to log a warning or error including the offending
gpu_count value and context (container/pod identifier if available) and then
continue (or skip injection) instead of silently passing, so malformed values
are visible in logs and RDMA injection is skipped for invalid counts.

In `@nemo_skills/pipeline/backends/slurm.py`:
- Line 136: The expression "main_container.resources.gpus or None" treats 0 as
falsy and converts CPU-only (0 GPUs) to None, which lets add_task (exp.py:497)
default to 1 GPU; change the assignment so 0 remains 0 and only actual None
stays None—e.g., replace num_gpus=main_container.resources.gpus or None with an
explicit None-check like num_gpus = main_container.resources.gpus if
main_container.resources.gpus is not None else None so CPU-only jobs are not
promoted to GPU requests.

In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 809-819: The callable branch checking callable(script.inline) is
dead because _prepare_command (via prepare_for_execution and script.set_inline)
always leaves script.inline as a string; remove the unreachable branch and
simplify to use script.inline directly in the block that follows
_prepare_command; update the logic in the function containing this diff
(referencing _prepare_command, script.inline and _rewrite_local_paths) so it
only reads cmd_str = script.inline and optionally add a defensive
type-check/assert if you want to guard against future regressions, ensuring no
behavioral change for the Kubernetes path.
- Line 929: The log currently slices the list (container.command[:50]) which
never truncates the joined command; instead join the command list into a single
string (e.g., full_command = ' '.join(container.command)), then truncate
characters from that string and add an ellipsis only when truncated; update the
LOG.info call that references container.command so it logs something like the
first N characters of full_command (and appends "..." only if len(full_command)
> N) to ensure actual truncation of the command string.
- Around line 743-748: The current sequential branch only aborts when status ==
JobStatus.FAILED; change it to treat any non-SUCCEEDED terminal state as
failure: after status = backend.wait_for_completion(handle) check if status !=
JobStatus.SUCCEEDED and then raise RuntimeError with a descriptive message
including job_name and status.value. Update the LOG.info/LOG.error calls around
backend.wait_for_completion(handle) accordingly so the outcome is logged and
dependent jobs are not allowed to proceed when status is not SUCCEEDED.
- Around line 844-846: The code currently sets ports = [script.port] whenever
script has a "port" attribute, which can add invalid values like None or 0 into
the ContainerSpec; update the guard around the assignment (where ports and
script are referenced) to only append script.port when it is truthy/valid (e.g.,
non-None and >0) before passing ports into ContainerSpec so ContainerSpec
receives only valid port numbers.
- Around line 696-700: Replace the safe lookup dep.get("name") with direct
access dep["name"] where the code expects the key to exist: update the
dict-handling branch inside the dependency parsing (the elif block that sets
dep_name from dep) and the corresponding occurrence inside the _run_nemo_run
function so they use dep["name"] (keeping the subsequent if not dep_name check
as-is); this ensures a missing "name" raises a KeyError immediately instead of
returning None.

In `@scripts/k8s-tests/check_nccl_logs.py`:
- Around line 59-60: The NCCL WARN messages are being appended to result.errors
but should populate the declared warnings list and not trigger a FAIL; update
the parsing logic where "NCCL WARN" (currently appended to result.errors) to
append to result.warnings instead, and adjust validate_result to only treat
result.errors as a hard FAIL (ignore or report result.warnings as non-fatal
info). Locate the dataclass fields errors and warnings and the function
validate_result and change the two/three occurrences that add NCCL WARN to use
result.warnings, and ensure validate_result checks result.errors (not warnings)
to decide a FAIL.

In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Line 31: The Dockerfile currently leaves the final image running as root (due
to the Stage 2 "USER root" persisting into Stage 4) and uses deprecated "apt-key
adv"; update the Dockerfile to explicitly drop privileges in the final stage by
creating a least-privilege user (e.g., add user nemo, chown needed dirs, and set
USER nemo) or, if root is genuinely required for GPU verification, add an
explicit comment justifying it and restrict scope to only the verification
steps; also replace any "apt-key adv" usage with the signed-by pattern: fetch
the GPG into /usr/share/keyrings/<name>-archive-keyring.gpg and reference it in
the apt source entry (apply the same changes for the other occurrences flagged
in lines 84-104).
- Around line 38-43: The Dockerfile adds the NVIDIA devtools repo over HTTP,
uses apt-key to fetch the wrong/obsolete CUDA key (7fa2af80.pub) from the
compute/cuda path, and should instead install the modern cuda-keyring and add
the devtools repo via HTTPS with signed-by; replace the apt-key fetch and
plain-HTTP source creation with installing the cuda-keyring (e.g., download and
dpkg -i cuda-keyring.deb) and then write the repo line to
/etc/apt/sources.list.d/nvidia-devtools.list as "deb
[signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg]
https://developer.download.nvidia.com/devtools/repos/ubuntu2404/$(dpkg
--print-architecture) /", followed by apt-get update and the nsight-systems-cli
install, and remove the apt-key adv call and the old key filename.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 145-170: Add a minimal pod/container securityContext to the
trainer container to disable privilege escalation and run non-root where
possible: in the pod spec that sets hostNetwork: true and defines the containers
list (container name "trainer", image nemo-skills/nemo-rl-full, command
["/scripts/run_sft_mn.sh"]), add a securityContext block on the container with
allowPrivilegeEscalation: false and runAsNonRoot: true (and optionally specify
runAsUser if the image supports a non-root UID), and if needed add an equivalent
pod-level securityContext to enforce fsGroup/runAsUser constraints so GPU/CUDA
and Ray networking still work; ensure these changes do not conflict with GPU
device access and test the "trainer" container startup.

In `@scripts/k8s-tests/pipeline_smoke_test.py`:
- Around line 92-129: The MULTI_NODE_TRAIN_CMD string uses torchrun with a -c
inline code flag which torchrun doesn't support; change the implementation to
write the same Python snippet to a temporary script file (like the single-node
template does, e.g. /tmp/smoke_train.py via a heredoc or echo >> file) and pass
that script path to torchrun instead of using -c, ensuring the env vars and
torchrun args (--nproc_per_node, --nnodes, --node_rank, --master_addr,
--master_port) remain unchanged and the script contains the existing distributed
setup (dist.init_process_group, DDP, all_reduce, etc.).

In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 174-178: Replace the broad "except Exception" around the log
streaming loop that calls backend.get_logs(handle) with a specific exception
handler (e.g., kubernetes.client.ApiException) or remove the try/except so
unexpected errors propagate; import ApiException from kubernetes.client and
change the block to "except ApiException as e:" and print/log that error,
allowing any other exceptions (AttributeError, TypeError, etc.) to bubble up
instead of being silently swallowed.
- Line 52: Update the pip install invocation in the string that currently
contains "pip install -q accelerate 2>/dev/null &&" so that install stderr is
not discarded; specifically remove the "2>/dev/null" redirection (or replace it
with a safe stderr-preserving redirect to container logs) so any pip error
output from the install step is visible in the container logs.

In `@scripts/k8s-tests/smoke_test_multi_node.sh`:
- Around line 272-277: The interactive read prompt block (read -p ...; if [[
$REPLY =~ ^[Yy]$ ]]; then kubectl delete job "$JOB_NAME" -n "$NAMESPACE"
--ignore-not-found; kubectl delete service "$SERVICE_NAME" -n "$NAMESPACE"
--ignore-not-found; echo "Resources deleted."; fi) will hang in CI; remove this
entire prompt and conditional deletion and rely on the existing cleanup trap to
perform kubectl deletions for JOB_NAME and SERVICE_NAME in NAMESPACE so the
script is non-interactive and safe for automated pipelines.
- Around line 20-30: The script lacks a cleanup trap so the created Kubernetes
Job and Service (JOB_NAME, SERVICE_NAME) can leak on early exit; add a cleanup
function that deletes the Job and Service (using JOB_NAME and SERVICE_NAME) and
register it with trap cleanup EXIT INT TERM immediately after the variable setup
block (after SCRIPT_DIR assignment) so any premature exit will run cleanup and
remove the resources.

---

Duplicate comments:
In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 103-108: The current falsy check after direct indexing is
misleading because cluster_config["executor"] already exists; replace the vague
guard with an explicit validation that the value is a non-empty string: after
assigning executor = cluster_config["executor"], validate with something like if
not isinstance(executor, str) or not executor.strip(): raise
ValueError("cluster_config['executor'] must be a non-empty string"), then call
executor = executor.lower(); this makes the intent clear and prevents calling
.lower() on invalid types or empty strings.

In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 470-477: The code now uses direct dict access for executor in two
places (self.cluster_config["executor"] and cluster_config["executor"]); ensure
both locations consistently use direct access and add an explicit, descriptive
KeyError if the key is missing rather than silently defaulting—update the logic
around the conditional that calls _run_kubernetes and _run_nemo_run to check for
"executor" presence and raise a clear error (e.g., "cluster config missing
'executor'") so callers of self._run_kubernetes and self._run_nemo_run get
actionable feedback.
- Around line 879-917: The _parse_timeout function should raise explicit
ValueError for unrecognized formats (addressing the prior silent fall-through);
ensure the implementation in _parse_timeout preserves exception chaining (use
"raise ... from e" for ValueError from int parsing), returns the 6-hour default
when timeout_str is empty, and covers all supported formats ('3600', '6h',
'30m', '06:00:00'); add unit tests for edge cases (empty string, numeric
seconds, 'h'/'m' suffixes, 'HH:MM:SS' and 'MM:SS', invalid strings) to validate
the new behavior and error messages.

---

Nitpick comments:
In `@docs/kubernetes-status.md`:
- Around line 57-59: Replace the hardcoded file:line references in the table
with stable identifiers: point to the actual function or class names that emit
those runtime warnings in declarative.py (e.g., the function that enforces
job-dependency sequencing and the function that logs skipped external
dependencies) or use a GitHub permalink to a specific commit/anchor rather than
`declarative.py:641-647`/`692-694`; update the three table cells ("Job
dependencies force sequential", "SSH tunneling not available", "External
dependencies skipped") to reference those symbol names or permalink anchors so
the docs remain correct as code lines shift.

In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 152-174: Registry entries currently override built-in lazy-loaded
backends; to prevent silent shadowing, treat the built-in names
{"slurm","kubernetes","local","none"} as reserved: update the register_backend
function to validate the provided name against this reserved set and raise a
ValueError (or reject) if someone attempts to register one of them, and also add
a short comment in the factory function where _BACKEND_REGISTRY is checked to
clarify the reserved-name behavior; reference register_backend,
_BACKEND_REGISTRY, and the executor variable to locate where to add the
validation and comment.

In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 264-274: The loop that pairs spec.containers with
pod_spec.containers should use zip(..., strict=True) to enforce equal-length
alignment; update the loop in the Kubernetes backend where you currently have
for container_spec, container in zip(spec.containers, pod_spec.containers): to
use zip(spec.containers, pod_spec.containers, strict=True) so any mismatch is
raised immediately—this affects the block that calls
self._build_container_mounts(container_spec.mounts, existing_volumes), extends
pod_spec.volumes with extra_volumes, appends extra_mounts to container_mounts,
and assigns container.volume_mounts.

In `@nemo_skills/pipeline/backends/local.py`:
- Around line 240-258: The initial log-yield loop in get_logs reads
job.logs[container] with a for-loop while _collect_logs may append concurrently;
replace that for-loop with the same index-based pattern used in the follow path:
grab current_logs = job.logs.get(container, []) and use while yielded <
len(current_logs): yield current_logs[yielded]; yielded += 1 so iteration is
consistent and avoids concurrent-iteration issues with _collect_logs.

In `@nemo_skills/pipeline/utils/declarative.py`:
- Line 260: The condition computing was_modified is overly complex; replace the
expression "was_modified = name != original.lower() or original !=
original.lower()" with the simpler equality check "was_modified = name !=
original" because name is derived from original via the prior transformations;
update this in the function/method where variables name, original, and
was_modified are defined (search for the was_modified assignment in
declarative.py) to remove redundant checks.

In `@scripts/k8s-tests/sft_smoke_test.py`:
- Around line 156-169: Refactor the duplicated torchrun command construction by
building a shared base command that includes the NCCL exports, heredoc injection
of SFT_TRAIN_SCRIPT and the common torchrun args (e.g., --nproc_per_node from
args.gpus and --master_port), then conditionally append the multi-node flags
(--nnodes={args.nodes}, --node_rank=${{NODE_RANK:-0}},
--master_addr=${{MASTER_ADDR:-localhost}}) when args.nodes > 1; update the
variable cmd to use this builder so the only differences between single-node and
multi-node are added conditionally rather than duplicating the full heredoc +
torchrun string.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (17)
scripts/k8s-tests/smoke_test_single_node.sh-119-124 (1)

119-124: ⚠️ Potential issue | 🟡 Minor

Exact float comparison in all-reduce assertion — use approximate comparison.

Line 124 uses assert tensor.item() == expected for comparing the all-reduce result. While this works for small world sizes (sum of small integers is exact in float32), the equivalent check in smoke_test_multi_node.sh (Line 196) correctly uses abs(...) < 1e-5. Be consistent and prefer the safer approximate comparison.

🔧 Proposed fix
-assert tensor.item() == expected, f"All-reduce failed: got {tensor.item()}, expected {expected}"
+assert abs(tensor.item() - expected) < 1e-5, f"All-reduce failed: got {tensor.item()}, expected {expected}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/smoke_test_single_node.sh` around lines 119 - 124, The
assertion uses an exact float equality check on tensor.item() which can be
unstable; change the all-reduce verification after dist.all_reduce (variables:
tensor, rank, world_size) to use an approximate comparison like
abs(tensor.item() - expected) < 1e-5 (matching smoke_test_multi_node.sh) and
keep the same descriptive failure message including the actual and expected
values; update the assertion line accordingly so it validates within a small
tolerance instead of using ==.
scripts/k8s-tests/manifests/multi-node-sft-test.yaml-125-132 (1)

125-132: ⚠️ Potential issue | 🟡 Minor

DNS wait loop has no post-loop failure check — torchrun will proceed even if DNS never resolved.

The loop at Lines 125–129 retries DNS resolution up to 60 times but doesn't exit on failure. Compare with the equivalent in smoke_test_multi_node.sh (lines 140–143) which explicitly checks after the loop and exits with an error message. If DNS never resolves, torchrun will fail with a less obvious connection error.

🔧 Proposed fix
          for i in $(seq 1 60); do
            getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local && break
            echo "  retry $i..."
            sleep 2
          done
+          if ! getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local > /dev/null 2>&1; then
+            echo "FAIL: Master DNS not resolvable after 120s"
+            exit 1
+          fi
          torchrun --nproc_per_node=2 --nnodes=2 \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml` around lines 125 - 132,
The DNS retry loop before invoking torchrun currently lacks a post-loop failure
check, so torchrun may run even if getent never succeeded; after the for loop
that runs getent hosts sft-multi-node-0... add an explicit check that the last
getent succeeded (or test resolvability of
sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local) and if not,
print an error message and exit non-zero (mirror the behavior in
smoke_test_multi_node.sh), ensuring torchrun --nproc_per_node=2 --nnodes=2
--node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=29500
/scripts/train.py only runs when DNS resolution succeeded.
scripts/k8s-tests/smoke_test_multi_node.sh-108-114 (1)

108-114: ⚠️ Potential issue | 🟡 Minor

WORLD_SIZE env var is set to NUM_NODES, not the actual world size.

Line 114 sets WORLD_SIZE to "${NUM_NODES}", which is the number of nodes, not the total number of ranks (NUM_NODES * NUM_GPUS). While torchrun overrides WORLD_SIZE for the spawned Python process, and WORLD_SIZE_TOTAL is separately set at Line 213, this env var is misleading and could cause confusion if anything reads it before torchrun starts.

🔧 Proposed fix
         - name: WORLD_SIZE
-          value: "${NUM_NODES}"
+          value: "${WORLD_SIZE}"

Where WORLD_SIZE (already computed at Line 43 as NUM_NODES * NUM_GPUS) would be the correct total. Alternatively, rename this env var to NUM_NODES to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 108 - 114, The env
var WORLD_SIZE in the container manifest is incorrectly assigned NUM_NODES;
update the assignment so WORLD_SIZE uses the precomputed total-rank value (the
script variable that multiplies NUM_NODES * NUM_GPUS, referenced as the computed
WORLD_SIZE/WORLD_SIZE_TOTAL) instead of NUM_NODES, or alternatively rename the
manifest key to NUM_NODES to reflect it actually holds node count; change the
env entry for name: WORLD_SIZE to reference the total-rank variable (or rename
the env var to NUM_NODES) so the intent is unambiguous.
scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml-103-108 (1)

103-108: ⚠️ Potential issue | 🟡 Minor

Dead error-handling code: set -e causes exit before RAY_EXIT=$? is reached on failure.

Line 40 sets set -e, so if ray start --block (Line 103) exits with a non-zero status, the script terminates immediately. The RAY_EXIT=$? assignment on Line 104 and the subsequent check (Lines 105–108) are only reached when the exit code is 0, making the error branch unreachable.

Either disable set -e around this command or rely on set -e to propagate the failure directly:

🔧 Option A: rely on set -e (simplify)
       echo "Starting Ray worker node (joining head at ${MASTER_ADDR}:6379)..."
       ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block
-      RAY_EXIT=$?
-      if [ $RAY_EXIT -ne 0 ]; then
-        echo "FAIL: Ray worker exited with code $RAY_EXIT"
-        exit $RAY_EXIT
-      fi
       echo "Worker done (Ray head shut down)."
🔧 Option B: capture exit code (disable set -e locally)
       echo "Starting Ray worker node (joining head at ${MASTER_ADDR}:6379)..."
+      set +e
       ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block
       RAY_EXIT=$?
+      set -e
       if [ $RAY_EXIT -ne 0 ]; then
         echo "FAIL: Ray worker exited with code $RAY_EXIT"
         exit $RAY_EXIT
       fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml` around
lines 103 - 108, The current error-handling after running "ray start
--address=\"${MASTER_ADDR}:6379\" --num-gpus=${NUM_GPUS} --block" is dead
because "set -e" causes the script to exit before RAY_EXIT is captured; fix by
temporarily disabling errexit around that command: before invoking the ray start
line, do a local "set +e", run the command, capture the exit code into RAY_EXIT,
then restore "set -e" and perform the existing echo+exit check, ensuring the
RAY_EXIT variable and check (RAY_EXIT=$? and the if [ $RAY_EXIT -ne 0 ] block)
are reachable; alternatively you can remove the manual RAY_EXIT check and rely
on global "set -e" to propagate failures (choose one approach and update the
block around the "ray start" call and the RAY_EXIT usage accordingly).
scripts/k8s-tests/image-build/import-image.yaml-46-89 (1)

46-89: ⚠️ Potential issue | 🟡 Minor

DaemonSet pod will restart continuously after the sleep 10 completes.

Because DaemonSet pods are always restarted by Kubernetes, the sleep 10; exit 0 pattern at the end of the script causes the pod to enter a restart loop. Each restart re-runs the full import sequence. While ctr images import is idempotent, the pods will show as CrashLoopBackOff (or rapid-restart) in kubectl get pods, which is misleading given the stated "pass criteria: no restarts."

Consider replacing sleep 10 with sleep infinity so pods remain in Running state after a successful import and are only terminated when the DaemonSet is explicitly deleted:

-          # Keep running briefly so logs are capturable
-          sleep 10
+          echo "=== Import succeeded. Sleeping until DaemonSet is deleted. ==="
+          sleep infinity
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/import-image.yaml` around lines 46 - 89, The
DaemonSet pod restarts after "sleep 10" causing restart loops; change the
command script in the container's command block to stop exiting and instead
block indefinitely (e.g., replace the final "sleep 10" with "sleep infinity" or
an equivalent long-running wait) so the container stays in Running after
successful import; update the script that uses CTR (the shell block that runs
"ctr --address /run/containerd/containerd.sock -n k8s.io images import
/raid/nemo-rl.tar" and then "sleep 10") to use an indefinite sleep/tail to
prevent Kubernetes from restarting the pod.
cluster_configs/kubernetes/rbac.yaml-77-95 (1)

77-95: ⚠️ Potential issue | 🟡 Minor

Services create/delete is required for multi-node jobs but increases blast radius; document for operators.

The implementation calls create_namespaced_service() to create headless services for multi-node job discovery (in submit_job()) and delete_namespaced_service() to clean them up (in cleanup()), so the RBAC verbs are correct and necessary. However, Trivy KSV-0056 is valid: a compromised pod using this service account could create or remove arbitrary services in the namespace. This is intentional for the headless service lifecycle but operators should understand the threat model when applying this RBAC.

Also note: configmaps is not used in the codebase and can be removed from the role entirely. No patch verb is needed since ConfigMaps are never created, updated, or modified by the Kubernetes backend.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cluster_configs/kubernetes/rbac.yaml` around lines 77 - 95, The Services
permission (verbs "create","delete") is indeed required by submit_job() (it
calls create_namespaced_service()) and cleanup() (it calls
delete_namespaced_service()), but it increases blast radius—update
documentation/operator notes to explicitly call out this threat model and why
the service-account needs create/delete for headless service lifecycle; also
remove the unused ConfigMaps permission (the "configmaps" rule) from the Role
since the codebase does not use or patch ConfigMaps, and keep the existing
Secrets and Events read-only verbs as-is.
docs/kubernetes-nemo-rl-sft-runbook.md-75-75 (1)

75-75: ⚠️ Potential issue | 🟡 Minor

Add language specifiers to the three unlabelled fenced code blocks (MD040).

Lines 75, 116, and 137 open fenced blocks without a language tag, triggering markdownlint MD040. Use text for log output and the ASCII diagram.

🔧 Proposed fix
-```
+```text
 ▶ Setting up data...

- +text
Node 0 (Ray head, dgx-30):


-```
+```text
 ✓ Ray cluster initialized with 2 nodes

Also applies to: 116-116, 137-137

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-nemo-rl-sft-runbook.md` at line 75, Three fenced code blocks
that begin with "▶ Setting up data...", "Node 0 (Ray head, dgx-30):", and "✓ Ray
cluster initialized with 2 nodes" are missing language specifiers, triggering
MD040; update each opening fence from ``` to ```text so they become ```text ...
``` to mark them as plain text/console output (search for those exact block
starts to locate the three fences and replace the fence only).
scripts/k8s-tests/image-build/build-nemo-rl.yaml-53-53 (1)

53-53: ⚠️ Potential issue | 🟡 Minor

Pin Kaniko and init-container images to specific versions.

Both alpine/git:latest and gcr.io/kaniko-project/executor:latest are unpinned. For a build-artifact job whose explicit purpose is reproducibility, :latest is contradictory — a silent upstream change will produce a different build product without any code change.

🔧 Proposed fix
-        image: alpine/git:latest
+        image: alpine/git:2.45.2
-        image: gcr.io/kaniko-project/executor:latest
+        image: gcr.io/kaniko-project/executor:v1.23.0

Also applies to: 101-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml` at line 53, Replace the
unpinned images with immutable versioned tags or digests: change "image:
alpine/git:latest" to a specific alpine/git tag or digest (e.g.,
"alpine/git:2.34.1" or "alpine/git@sha256:...") and replace
"gcr.io/kaniko-project/executor:latest" with a fixed kaniko release tag or
digest (e.g., "gcr.io/kaniko-project/executor:v1.10.0" or the image@sha256:...).
Update any other occurrences (noted around lines 101) so the build uses pinned
versions for reproducibility.
scripts/k8s-tests/image-build/kaniko-build.sh-37-37 (1)

37-37: ⚠️ Potential issue | 🟡 Minor

Remove unused REPO_ROOT variable (SC2034).

REPO_ROOT is computed but never referenced anywhere in the script.

🔧 Proposed fix
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/kaniko-build.sh` at line 37, The variable
REPO_ROOT is computed but never used (SC2034); remove the unused assignment
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" from kaniko-build.sh
(referencing SCRIPT_DIR if still needed elsewhere) so the script no longer
defines an unused variable—delete that line or replace it with a used expression
if you intended to reference the repository root elsewhere.
cluster_configs/example-kubernetes.yaml-120-126 (1)

120-126: ⚠️ Potential issue | 🟡 Minor

Misplaced "Fallback backend" comment creates confusing documentation.

Line 120 reads # Fallback backend if Kubernetes is unavailable but the immediately following active configuration belongs to image_pull_policy (line 123). The fallback_executor configuration actually appears at line 126. Reorder the comments to match their respective settings.

🔧 Proposed fix
-# Fallback backend if Kubernetes is unavailable
 # Image pull policy (default: IfNotPresent)
 # Set to "Never" when using locally-loaded images via ctr import
 # image_pull_policy: Never
 
+# Fallback backend if Kubernetes is unavailable
 # Useful for development/testing when K8s cluster is not accessible
 # fallback_executor: local
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cluster_configs/example-kubernetes.yaml` around lines 120 - 126, The comment
"Fallback backend if Kubernetes is unavailable" is misplaced above the
image_pull_policy setting; move or swap the comment so it sits immediately above
the fallback_executor setting and ensure the image_pull_policy comment remains
directly above image_pull_policy. Specifically, locate the comments and keys
image_pull_policy and fallback_executor in the snippet and reorder them so each
descriptive comment precedes its matching config key (place the fallback backend
comment directly above fallback_executor and keep the image pull policy comment
directly above image_pull_policy).
cluster_configs/example-kubernetes.yaml-148-155 (1)

148-155: ⚠️ Potential issue | 🟡 Minor

Missing watch verb for services in the inline RBAC example.

The inline RBAC comment at line 154 lists ["create", "delete", "get", "list"] for the services resource, but docs/kubernetes-guide.md (line 118) includes "watch" as well. Headless Service lifecycle management during multi-node jobs benefits from watch access to detect creation/deletion races. The inconsistency will confuse users who copy from this file.

🔧 Proposed fix
-#    - apiGroups: [""]
-#      resources: ["services"]
-#      verbs: ["create", "delete", "get", "list"]  # Required for multi-node headless services
+#    - apiGroups: [""]
+#      resources: ["services"]
+#      verbs: ["create", "delete", "get", "list", "watch"]  # Required for multi-node headless services
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cluster_configs/example-kubernetes.yaml` around lines 148 - 155, The inline
RBAC example is missing the "watch" verb on the services resource; update the
commented block that defines the resources for "services" (the line showing
resources: ["services"] and verbs: ["create", "delete", "get", "list"]) to
include "watch" in the verbs array so it matches docs/kubernetes-guide.md and
allows headless Service lifecycle watches during multi-node jobs.
scripts/k8s-tests/image-build/build-and-load.sh-19-29 (1)

19-29: ⚠️ Potential issue | 🟡 Minor

No trap-based cleanup — helper pods may linger on failure.

The script creates multiple ephemeral pods (build-prep, build-nemo-rl-local, cache-copy, load-$NODE) but has no trap ... EXIT handler. If the script exits early due to set -e, these pods remain in the cluster. Compare with validate_sft_e2e.sh (line 61) and validate_sft_e2e_multinode.sh (lines 54-58), which both use trap-based cleanup.

Consider adding a trap that deletes known helper pods on exit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 19 - 29, The
script lacks a trap-based cleanup so helper pods (build-prep,
build-nemo-rl-local, cache-copy, load-$NODE) can linger on failure; add a
cleanup function (e.g., cleanup()) that runs on EXIT via trap cleanup EXIT and
uses kubectl -n "$NAMESPACE" delete pod for those known pod names (use patterns
or the variables that construct load-$NODE) and any temp resources, then call
that trap near the top (after NAMESPACE/CACHE_PATH/IMAGE_TAG are set) so any
early exit from the script (set -e) will remove the ephemeral pods.
nemo_skills/pipeline/backends/kubernetes.py-264-274 (1)

264-274: ⚠️ Potential issue | 🟡 Minor

zip() without strict=True silently ignores mismatched container lists.

If spec.containers and pod_spec.containers (which is the containers list built above) have different lengths — for instance, if _build_container filters or skips a spec — the loop would silently process only the shorter list, leaving some containers without their mounts.

Proposed fix
-        for container_spec, container in zip(spec.containers, pod_spec.containers):
+        for container_spec, container in zip(spec.containers, pod_spec.containers, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/kubernetes.py` around lines 264 - 274, The loop
pairing spec.containers and pod_spec.containers uses zip without strict=True so
mismatched lengths are silently ignored; update the code that iterates over
container_spec and container to either validate that len(spec.containers) ==
len(pod_spec.containers) and raise a clear error (including related symbols
container_spec, container, spec.containers, pod_spec.containers) before calling
_build_container_mounts, or replace zip(...) with zip(..., strict=True) to force
an exception on mismatch; ensure the change lives alongside the existing
container_mounts handling and volume_mounts assignment so any mismatch is
detected instead of silently skipped.
scripts/k8s-tests/pipeline_smoke_test.py-246-250 (1)

246-250: ⚠️ Potential issue | 🟡 Minor

Broad except Exception swallows unexpected log-retrieval errors.

Per coding guidelines, don't catch exceptions not normally expected to be raised. get_logs() already handles ApiException internally. Let other exceptions propagate.

Proposed fix
-            try:
-                for line in backend.get_logs(handle):
-                    print(line, end="" if line.endswith("\n") else "\n")
-            except Exception as e:
-                print(f"Failed to get logs: {e}")
+            for line in backend.get_logs(handle):
+                print(line, end="" if line.endswith("\n") else "\n")

As per coding guidelines: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 246 - 250, The broad
"except Exception as e" around the log retrieval loop should be removed so
unexpected errors from backend.get_logs(handle) can propagate; edit the block
containing backend.get_logs(handle) and the print loop (the try/except around
that loop) to directly iterate and print lines (for line in
backend.get_logs(handle): print(...)) without catching Exception, since get_logs
already handles ApiException internally.
scripts/k8s-tests/sft_smoke_test.py-213-222 (1)

213-222: ⚠️ Potential issue | 🟡 Minor

Missing backend.cleanup(handle) after job completion.

Unlike run_sft_k8s_real.py (line 181), this script doesn't call backend.cleanup(handle) after waiting for completion. For multi-node jobs, this means the headless service will not be cleaned up.

Proposed fix
         if status in (JobStatus.SUCCEEDED, JobStatus.FAILED):
             print("\n--- Logs ---")
             for line in backend.get_logs(handle):
                 print(line, end="" if line.endswith("\n") else "\n")
+        backend.cleanup(handle)
         if status == JobStatus.FAILED:
             sys.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/sft_smoke_test.py` around lines 213 - 222, The loop
handling job completion never calls backend.cleanup(handle), so headless
services for multi-node jobs are left running; after calling
backend.wait_for_completion(handle, timeout=1800) and after printing logs (the
block that iterates backend.get_logs(handle)), invoke backend.cleanup(handle)
for every finished job (both SUCCEEDED and FAILED) and ensure cleanup is
executed before exiting on failure (i.e., call backend.cleanup(handle) prior to
sys.exit(1) in the JobStatus.FAILED branch).
scripts/k8s-tests/pipeline_smoke_test.py-239-254 (1)

239-254: ⚠️ Potential issue | 🟡 Minor

Missing backend.cleanup(handle) after job completion.

Same issue as in sft_smoke_test.py — headless services for multi-node jobs won't be cleaned up. run_sft_k8s_real.py does call cleanup().

Proposed fix
         if status == JobStatus.FAILED:
             print(f"\nERROR: Job '{name}' failed")
             sys.exit(1)
+
+        backend.cleanup(handle)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 239 - 254, Add a
cleanup call for each job handle after completion: after using
backend.wait_for_completion(handle) and after printing logs from
backend.get_logs(handle) (or on exception), ensure backend.cleanup(handle) is
invoked so headless services are removed; update the loop over result items
(variables name and handle) to call backend.cleanup(handle) in a finally block
(or otherwise guaranteed path) and keep existing failure logic that sys.exit(1)
when status == JobStatus.FAILED.
nemo_skills/pipeline/utils/declarative.py-446-456 (1)

446-456: ⚠️ Potential issue | 🟡 Minor

Same-second job submissions will collide due to timestamp-only suffix.

int(time.time()) % 100000 produces the same value for all calls within the same second. If two pipelines (or a retry) submit jobs in the same second, they get identical names, causing a Kubernetes conflict.

Consider mixing in a random component or an atomic counter.

Proposed fix
         import time
+        import random

-        suffix = f"-{int(time.time()) % 100000:05d}"
+        suffix = f"-{int(time.time()) % 100000:05d}-{random.randint(0, 999):03d}"
         suffix_len = len(suffix)  # 6 characters

(Adjust suffix_len comment and max_length math accordingly.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 446 - 456, The suffix
generation in declarative.py (variables suffix, suffix_len, base_name,
max_base_len, max_length) uses int(time.time()) % 100000 which collides for
same-second submissions; modify the suffix to include a high-resolution or
random component (e.g., use time.time_ns() % N or append a short random hex from
uuid4()/secrets.randbits()) and update suffix_len accordingly, then recalculate
max_base_len = max_length - suffix_len and adjust truncation logic for
base_name; ensure the log/comment about suffix length is corrected to reflect
the new suffix size.
🧹 Nitpick comments (11)
scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml (1)

46-46: Nit: NCCL_DEBUG=INFO is set in both the shell script (Line 46) and the container env (Line 98).

The script's export NCCL_DEBUG=INFO overrides the env var from the Job spec anyway, so the container-level env entry is redundant. Consider keeping it in only one place.

Also applies to: 97-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml` at line 46, The
NCCL_DEBUG setting is duplicated: the shell script contains "export
NCCL_DEBUG=INFO" and the Kubernetes Job manifest also sets an env var NCCL_DEBUG
in the container spec; remove the redundancy by keeping NCCL_DEBUG=INFO in only
one place—either delete the shell script export or remove the container env
entry for NCCL_DEBUG in the Job spec—ensuring the remaining location (shell
export or container env) is the single source of truth for NCCL_DEBUG.
scripts/k8s-tests/image-build/import-image.yaml (1)

42-108: Privileged security context is intentional — document the trust boundary.

Trivy and Checkov flag privileged: true, hostPID: true, and the lack of readOnlyRootFilesystem. These are required by design:

  • privileged: true — needed to access /run/containerd/containerd.sock from the container.
  • hostPID: true — required for the nsenter -t 1 fallback path.
  • Writable root FS — needed for apk add at runtime.

One hardening improvement that does not require relaxed privileges: mount only the socket file rather than the entire containerd directory:

-      - name: containerd-sock
-        hostPath:
-          path: /run/containerd
+      - name: containerd-sock
+        hostPath:
+          path: /run/containerd/containerd.sock
+          type: Socket

This limits the host surface exposed to the container from the whole /run/containerd/ directory to just the socket. The --address flag in the ctr invocation already points to the full path so no other changes are needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/import-image.yaml` around lines 42 - 108,
Change the containerd volume to mount only the socket file instead of the whole
directory: update the volume named containerd-sock (and its corresponding
volumeMount) so hostPath.path points to /run/containerd/containerd.sock and the
mountPath is /run/containerd/containerd.sock, and set hostPath.type to Socket if
supported; leave the ctr invocation (--address /run/containerd/containerd.sock)
and the volumeMount name containerd-sock unchanged so the existing CTR usage
continues to work.
nemo_skills/pipeline/backends/local.py (1)

130-137: Rename unused loop variable i to _.

i from enumerate() is never referenced inside the loop body. Ruff B007 flags this correctly.

✏️ Proposed fix
-        for i, (proc, container_spec) in enumerate(zip(processes, spec.containers)):
+        for _, (proc, container_spec) in enumerate(zip(processes, spec.containers)):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/local.py` around lines 130 - 137, The loop in
the local backend uses enumerate(...) but never uses the index variable `i`;
rename `i` to `_` in the for statement (for _, (proc, container_spec) in
enumerate(zip(processes, spec.containers)): ) to silence the unused-variable
warning (Ruff B007) in the method that starts log-collector threads and appends
to job._log_threads in nemo_skills.pipeline.backends.local (the block creating
threading.Thread and calling self._collect_logs).
cluster_configs/example-kubernetes.yaml (1)

88-92: NCCL_DEBUG=INFO in the default env_vars will flood logs on every run.

NCCL_DEBUG=INFO logs a status line for every collective call; at training batch frequency, this produces megabytes of noise per job. Either comment it out (example-only) or downgrade to WARN.

🔧 Proposed fix
 env_vars:
   - HF_HOME=/models/hf-cache
-  - NCCL_DEBUG=INFO
-  # - CUDA_VISIBLE_DEVICES=all
+  # - NCCL_DEBUG=INFO  # Uncomment to debug NCCL connectivity issues
+  # - CUDA_VISIBLE_DEVICES=all
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cluster_configs/example-kubernetes.yaml` around lines 88 - 92, The default
env_vars block currently sets NCCL_DEBUG=INFO which will generate excessive log
noise; change this in the example by either commenting out NCCL_DEBUG or
lowering it to NCCL_DEBUG=WARN in the env_vars list (refer to the env_vars entry
and the NCCL_DEBUG key) so the example config doesn't flood logs on every run.
scripts/k8s-tests/image-build/kaniko-build.sh (1)

186-202: Copy-helper pod leaks on failure due to missing cleanup trap.

With set -euo pipefail, any failure after the kubectl run "copy-${name}" call (e.g., kubectl wait timeout or kubectl cp error) will exit the function immediately, leaving the helper pod running for its full 600-second sleep. Adding a trap ensures cleanup:

🔧 Proposed fix
+    trap "kubectl delete pod 'copy-${name}' -n '$NAMESPACE' --ignore-not-found" EXIT
     kubectl run "copy-${name}" --image=busybox:1.36 --restart=Never \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/kaniko-build.sh` around lines 186 - 202, The
helper pod "copy-${name}" can be leaked if any command fails after kubectl run;
add a cleanup trap immediately after creating the pod to always delete it on
EXIT/ERR (use kubectl delete pod "copy-${name}" -n "$NAMESPACE"
--ignore-not-found) so failures in kubectl wait or kubectl cp don’t leave the
sleep pod running; ensure the trap is set before calling kubectl wait and that
the existing explicit delete at the end remains or is idempotent with the trap
to avoid double-delete errors.
scripts/k8s-tests/image-build/build-and-load.sh (2)

129-129: Kaniko image pinned to :latest — non-reproducible builds.

gcr.io/kaniko-project/executor:latest can change between runs, potentially causing silent build behavior changes. Pin to a specific digest or version tag for reproducibility.

-    image: gcr.io/kaniko-project/executor:latest
+    image: gcr.io/kaniko-project/executor:v1.23.2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-and-load.sh` at line 129, The Kaniko
executor image is pinned to the floating tag
"gcr.io/kaniko-project/executor:latest" which makes builds non-reproducible;
update the image reference used in the image declaration (currently "image:
gcr.io/kaniko-project/executor:latest") to a fixed version or digest (e.g., a
specific tag like vX.Y.Z or an immutable digest `@sha256`:...) so the build uses
an exact, reproducible Kaniko binary across runs.

234-255: Privileged container with host root mount — expected for containerd import but worth documenting.

The loader pod runs as privileged: true with the entire host filesystem mounted at /host. This is required for ctr import but represents a significant security surface. The script header documents the acceptance criteria well, but consider adding a brief inline comment noting why privileged access is needed here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 234 - 255, The
loader pod is started with privileged: true and mounts the entire host at the
volume "host-root" (mounted to /host) to allow containerd operations; add a
concise inline comment next to the kubectl run/--overrides block (near the
"loader" container, "securityContext": {"privileged": true} and the "host-root"
volume/volumeMount) explaining that privileged access and the host-root mount
are required for running ctr import against the node's containerd socket, and
that this is an intentional tradeoff for the image import step.
scripts/k8s-tests/validate_sft_e2e.sh (1)

326-341: Pass-criteria checks are solid but the nemo_skills grep is very broad.

grep -q "nemo_skills" "$LOG_FILE" will match any occurrence of the string in the log, including error messages like "FAIL: nemo_skills core import error". Since the "E2E SFT VALIDATION PASSED" check on line 326 already confirms success, this secondary check is redundant but harmless. Consider tightening it to match the specific success line if you want higher-fidelity validation.

-if ! grep -q "nemo_skills" "$LOG_FILE"; then
+if ! grep -q "nemo_skills version:" "$LOG_FILE"; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 326 - 341, The grep for
"nemo_skills" is too broad and can match error lines; tighten the check in
validate_sft_e2e.sh by replacing the broad grep -q "nemo_skills" "$LOG_FILE"
with a pattern that matches the specific success message emitted on successful
import (or remove the check since "E2E SFT VALIDATION PASSED" already guarantees
success); update the conditional that sets PASS=false to use the new
success-specific pattern and reference LOG_FILE and the existing "E2E SFT
VALIDATION PASSED" check to keep behavior consistent.
nemo_skills/pipeline/backends/factory.py (1)

152-174: Registry check before built-in backends could shadow built-in names.

If someone registers a custom backend under the name "kubernetes", it would take precedence over the built-in lazy import path (line 153-154 runs before line 162-165). This is likely intentional for extensibility, but it means a malicious or buggy plugin could silently replace the real KubernetesBackend. Consider documenting this precedence or logging a warning when a registered backend shadows a built-in name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/factory.py` around lines 152 - 174, The current
factory returns _BACKEND_REGISTRY[executor] before considering built-ins, which
lets a registered name shadow built-ins like "kubernetes", "slurm",
"local"/"none" (symbols: _BACKEND_REGISTRY, executor, KubernetesBackend,
SlurmBackend, LocalBackend). Modify the branch that checks _BACKEND_REGISTRY to
detect when executor equals one of the built-in names and emit a warning (use
logging.getLogger(__name__).warning or warnings.warn) indicating a registry
entry is shadowing a built-in, then proceed to return the registry backend;
alternatively, if you prefer built-ins to win, flip the order to check built-in
names first and fall back to _BACKEND_REGISTRY, making the change around the
existing registry check in factory.py.
nemo_skills/pipeline/backends/slurm.py (2)

97-99: Placeholder methods are fine for a transitional wrapper, but consider raising NotImplementedError instead of silently returning misleading values.

get_status returns RUNNING unconditionally, wait_for_completion delegates to get_status (so also returns RUNNING), and cancel_job returns False. A caller relying on these methods will get incorrect data silently. Raising NotImplementedError would make it explicit that these are unimplemented, matching the documented "TRANSITIONAL WRAPPER" status.

Also applies to: 156-169, 171-178, 180-188, 190-203, 205-207, 209-216

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/slurm.py` around lines 97 - 99, Several methods
in the Slurm backend are placeholders that return misleading default values;
update each placeholder method (e.g., property name, get_status,
wait_for_completion, cancel_job and the other unimplemented methods referenced
in the review) to raise NotImplementedError instead of returning fixed values
like "RUNNING" or False so callers know they are unimplemented; locate and
replace the bodies of methods named get_status, wait_for_completion, cancel_job
and any other transitional-wrapper methods in the SlurmBackend class to raise
NotImplementedError with a brief message (e.g., "transitional wrapper: not
implemented") to make the behavior explicit.

128-153: Context-managed experiment stores task that may reference a closed experiment.

get_exp is used as a context manager (line 129). After the with block exits, the experiment may be finalized. However, task (stored in self._jobs at line 143) was created within that context and may hold a reference to the now-closed experiment. Since get_status, wait_for_completion, and cancel_job are all placeholders that don't actually use task, this isn't immediately exploitable — but it will become a problem once those methods are implemented.

Consider documenting this constraint or restructuring so the experiment is reopened when needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/slurm.py` around lines 128 - 153, The code
stores a Task object created inside the get_exp context (via add_task) into
self._jobs (job_info["task"]) which may reference a finalized/closed experiment;
instead, stop storing the live Task from inside the with-block and either (a)
persist only serializable identifiers/metadata (e.g., experiment name, task id,
container image, cmd) in job_info so later methods can re-open the experiment
via get_exp(spec.name, self.config) and reconstruct or look up the Task, or (b)
ensure any future uses of the stored task (in get_status, wait_for_completion,
cancel_job) re-enter get_exp to reopen the experiment before touching the Task;
update job_info creation around add_task/get_exp and corresponding consumer
methods (get_status, wait_for_completion, cancel_job) to follow one of these
approaches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/backends/integration.py`:
- Around line 269-302: run_job_and_wait currently submits a job (via get_backend
-> submit_job) and waits for completion (backend.wait_for_completion) but never
calls backend.cleanup(handle) and only returns JobStatus, leaking resources and
preventing caller cleanup; fix by either invoking backend.cleanup(handle) after
wait_for_completion (ensure cleanup runs even on exceptions/timeouts) or change
the function signature to return both (status, handle) so the caller can call
backend.cleanup(handle); update references to run_job_and_wait callers to handle
the new return value if you choose the second option.

In `@nemo_skills/pipeline/backends/local.py`:
- Around line 95-98: The constructor (__init__) silently swallows a missing
required key by using cluster_config.get("executor") when determining
use_docker; change this to use direct access (cluster_config["executor"] ==
"local") so a KeyError is raised if "executor" is absent, ensuring a loud
fail-fast behavior; update the assignment to self.use_docker in the
LocalBackend/__init__ (referencing cluster_config and use_docker) and, if
desired, wrap the access in a try/except that raises a clearer, custom error
message before populating self._jobs.

In `@scripts/k8s-tests/smoke_test_single_node.sh`:
- Around line 236-243: The interactive read prompt blocks CI; remove the prompt
and conditional and perform non-interactive cleanup instead: delete the read -p
... echo and the if [[ $REPLY =~ ... ]]; then ... fi block and either call
kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found
unconditionally (recommended when a cleanup trap exists) or gate deletion behind
an explicit env var (e.g., CLEANUP=true) so CI runs won’t hang; update
references to JOB_NAME and NAMESPACE and keep the existing kubectl delete
command.
- Around line 20-28: The script lacks a cleanup trap so the Kubernetes Job
(JOB_NAME) can be left running on early exits; add a cleanup function (e.g.,
cleanup_job) that deletes the Job using kubectl delete job "$JOB_NAME" -n
"$NAMESPACE" (and optionally waits) and register it with trap 'cleanup_job' EXIT
or trap 'cleanup_job ERR EXIT' after JOB_NAME and NAMESPACE are set; ensure the
trap is installed early (right after JOB_NAME/TIMEOUT_SECONDS/SCRIPT_DIR are
defined) so any premature exit will call cleanup_job.

In `@scripts/k8s-tests/validate_k8s_manifests.sh`:
- Around line 30-40: The current fallback unconditionally switches to kubectl
--validate=false which can hide real schema errors; change the loop that runs
kubectl apply --dry-run=client --validate=true -f "$file" to capture its
stderr/output and exit code into variables, and only retry with --validate=false
when the captured output matches known schema-unavailability or connectivity
patterns (e.g., "unable to retrieve the complete list", "failed to download",
"no matches for kind", "unable to connect", "connection refused", "x509",
"context deadline exceeded"); if strict validation failed for any other reason
(indicative of a real schema error like misspelled fields) print the captured
error and exit non-zero instead of falling back. Ensure you reference the
existing kubectl apply invocation and the files[] loop when implementing this
conditional retry.

---

Duplicate comments:
In `@docs/kubernetes-guide.md`:
- Around line 680-683: Several fenced code blocks (e.g., the log snippet
beginning "Job name 'Qwen2.5_Math_7B' is not Kubernetes-compliant. Sanitized to
'qwen2-5-math-7b'." and the ASCII diagram block) are missing language
specifiers; update each open fenced block at the shown locations (lines around
the quoted log, the block spanning 701-707, and the ASCII diagram 715-758) to
use a language tag of text (```text) so they comply with MD040 and render
correctly. Ensure you add ```text at the start of each fenced block and keep the
closing ``` unchanged.

In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 91-92: Replace the optional lookup with direct dictionary access:
in the Kubernetes backend check using cluster_config.get("executor") (in the
KubernetesBackend/related initialization code) change it to
cluster_config["executor"] != "kubernetes" so a missing "executor" key raises a
KeyError and fails fast; keep the existing ValueError for the wrong value case,
ensuring the check uses the direct access expression.
- Around line 497-502: The except block currently swallows TypeError/ValueError
when parsing gpu_count, which lets malformed values (e.g., "8x") fall through
and cause RDMA injection; change the handler to treat parse failures as invalid
counts by logging a warning that includes the offending gpu_count and the
container identifier (e.g., container name or id) and then continue to skip RDMA
resource injection. Update the except clause that catches TypeError/ValueError
around the int(gpu_count) check (referencing gpu_count and the current
container/RDMA injection logic) to emit a clear log message and perform continue
instead of pass.

In `@nemo_skills/pipeline/backends/slurm.py`:
- Around line 136-139: Replace the falsy-or and hardcoded node count: set
num_gpus to the actual GPU value without treating 0 as None by using an explicit
None check on main_container.resources.gpus (e.g., num_gpus =
main_container.resources.gpus if main_container.resources.gpus is not None else
None) and set num_nodes from the job spec instead of hardcoding 1 by using
spec.num_nodes with an explicit None fallback (e.g., num_nodes = spec.num_nodes
if spec.num_nodes is not None else 1) in the call that builds the Slurm task
(the call that currently passes num_gpus=main_container.resources.gpus or None
and num_nodes=1).

In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 743-748: The sequential branch currently only raises on
JobStatus.FAILED, allowing other non-SUCCEEDED terminal states to continue;
update the check after backend.wait_for_completion(handle) (references:
sequential, backend.wait_for_completion, status, JobStatus, job_name) to treat
any terminal state other than JobStatus.SUCCEEDED as a fatal error—i.e., if
status != JobStatus.SUCCEEDED then raise RuntimeError with the job_name and
status included in the message so CANCELLED/UNKNOWN/timeouts also abort the
pipeline.
- Around line 696-700: The code uses dep.get("name") when the name is required,
which masks missing-key errors; replace dep.get("name") with direct access
dep["name"] in the dict-handling branches (the occurrences handling dict
dependency in declarative.py, including the branch that raises ValueError and
the identical pattern earlier around the other referenced spot) so a KeyError is
raised immediately if "name" is absent and fail-fast behavior is preserved; keep
the existing ValueError check/raise logic unchanged if you still want a custom
message after accessing the key.
- Around line 843-846: The code sets ports = [script.port] without validation,
which can pass None or 0 into ContainerSpec.ports; change the assignment to
validate script.port (e.g., ensure it's an int and > 0) and only append it if
valid (otherwise leave ports empty or filter out invalid values). Update the
block that currently checks hasattr(script, "port") and sets ports to use a
guard like isinstance(script.port, int) and script.port > 0 (or coerce/raise if
desired) so ContainerSpec receives only valid port numbers.
- Around line 809-819: The branch checking callable(script.inline) is dead
because _prepare_command already resolves callables via prepare_for_execution;
remove that conditional and simplify the logic in the block after
_prepare_command to directly set cmd_str = script.inline (and drop the
tuple-handling path), ensuring any downstream code uses the resolved string
returned by _prepare_command.
- Line 929: The log currently slices the list container.command[:50] which
doesn't truncate the joined command string; change logging to join the command
into a string first and then truncate characters (e.g., cmd_str = "
".join(container.command); msg = cmd_str[:50] + "..." if len(cmd_str) > 50 else
cmd_str) and use LOG.info to print that truncated msg (refer to LOG and
container.command in declarative.py).

In `@scripts/k8s-tests/check_nccl_logs.py`:
- Around line 124-127: The script currently appends both "NCCL WARN" and "NCCL
ERROR" to result.errors causing benign WARNs to be treated as failures; change
the parsing in the loop so lines containing "NCCL WARN" are appended to
result.warnings (the existing warnings list declared on the result object) and
only "NCCL ERROR" go to result.errors, and then update validate_result (the
validate_result function that reports PASS/FAIL) to include and print warnings
without marking the run as FAIL (i.e., surface result.warnings as warnings in
the output but only treat entries in result.errors as failures).

In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Line 31: The Dockerfile currently leaves USER root set in the base stage and
never overrides it, so the final release stage still runs as root; update the
Dockerfile to create a non-root user (e.g., add group/user via groupadd/useradd
or use existing app user) in an appropriate stage (base or release) and switch
to that user before the final image is produced (replace or override the
existing USER root in the release stage), ensuring processes in the release
stage run as the non-root account instead of root.
- Around line 38-43: Replace the plain-HTTP repo and deprecated apt-key usage:
instead of echoing an http:// URL and using apt-key adv to fetch 7fa2af80.pub,
download the NVIDIA repository public key over HTTPS, dearmor it into a keyring
(e.g. /usr/share/keyrings/nvidia-devtools-archive-keyring.gpg) using curl | gpg
--dearmor, and write the repo entry to
/etc/apt/sources.list.d/nvidia-devtools.list with the
"signed-by=/usr/share/keyrings/nvidia-devtools-archive-keyring.gpg" option; then
run apt-get update and install nsight-systems-cli as before (remove the apt-key
adv call and the old 7fa2af80.pub reference).

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 145-165: The pod spec uses hostNetwork: true for Ray; harden the
trainer container by adding a container-level securityContext to prevent
privilege escalation: set allowPrivilegeEscalation: false (and optionally drop
capabilities/runAsNonRoot if compatible) on the container named "trainer" (the
container block with image nemo-skills/nemo-rl-full and command
"/scripts/run_sft_mn.sh"); ensure this change does not conflict with GPU access
or Ray by testing, and keep the change minimal to avoid altering networking
behavior.

In `@scripts/k8s-tests/pipeline_smoke_test.py`:
- Around line 92-129: The MULTI_NODE_TRAIN_CMD currently embeds Python via an
unsupported -c flag for torchrun; replace the inline -c approach with the same
pattern used by the single-node template: write the Python block to a temp
script (e.g., /tmp/smoke_train.py) before invoking torchrun and then call
torchrun with that script path (keep environment exports, torchrun args
--nproc_per_node/--nnodes/--node_rank/--master_addr/--master_port intact);
update MULTI_NODE_TRAIN_CMD to reference the temp file and ensure whatever code
path assembles commands writes the Python payload to the same temp file name as
_SINGLE_NODE_TRAIN_TEMPLATE does.

In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 174-178: The broad except in the log streaming around
backend.get_logs(handle) hides unexpected errors; replace the blanket "except
Exception as e" with either no catch (let errors like AttributeError/TypeError
propagate) or catch only the expected kubernetes ApiException (e.g., catch
kubernetes.client.rest.ApiException) so only known K8s API errors are handled
and other bugs surface; update the try/except around the loop using
backend.get_logs(handle) (and the print(...) logic) accordingly.
- Line 52: The pip install command in scripts/k8s-tests/run_sft_k8s_real.py
hides stderr by using "pip install -q accelerate 2>/dev/null &&"; remove the
"2>/dev/null" redirection so installation errors are visible in container logs
(e.g., use "pip install -q accelerate &&" or explicitly log stderr), ensuring
failures still propagate via the existing && chain.

In `@scripts/k8s-tests/smoke_test_multi_node.sh`:
- Around line 270-278: The interactive cleanup block using read -p that prompts
for deletion (referencing JOB_NAME, SERVICE_NAME and NAMESPACE) must be removed
because it blocks in non-interactive CI; delete the entire prompt and
conditional (the read, echo, and if [[ $REPLY =~ ... ]] ... fi) and rely on the
existing cleanup/trap logic instead so the script performs non-interactive
cleanup automatically.
- Around line 20-30: Add a cleanup function and register a trap immediately
after the variable setup so the K8s Job and Service are deleted on early exit;
implement a function (e.g., cleanup) that checks and calls kubectl delete job
"$JOB_NAME" --namespace "$NAMESPACE" and kubectl delete svc "$SERVICE_NAME"
--namespace "$NAMESPACE" (ignore not-found errors), then register it with trap
'cleanup' EXIT ERR INT TERM so any failure or termination removes the created
resources; ensure JOB_NAME and SERVICE_NAME are referenced exactly as defined in
the script.

---

Nitpick comments:
In `@cluster_configs/example-kubernetes.yaml`:
- Around line 88-92: The default env_vars block currently sets NCCL_DEBUG=INFO
which will generate excessive log noise; change this in the example by either
commenting out NCCL_DEBUG or lowering it to NCCL_DEBUG=WARN in the env_vars list
(refer to the env_vars entry and the NCCL_DEBUG key) so the example config
doesn't flood logs on every run.

In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 152-174: The current factory returns _BACKEND_REGISTRY[executor]
before considering built-ins, which lets a registered name shadow built-ins like
"kubernetes", "slurm", "local"/"none" (symbols: _BACKEND_REGISTRY, executor,
KubernetesBackend, SlurmBackend, LocalBackend). Modify the branch that checks
_BACKEND_REGISTRY to detect when executor equals one of the built-in names and
emit a warning (use logging.getLogger(__name__).warning or warnings.warn)
indicating a registry entry is shadowing a built-in, then proceed to return the
registry backend; alternatively, if you prefer built-ins to win, flip the order
to check built-in names first and fall back to _BACKEND_REGISTRY, making the
change around the existing registry check in factory.py.

In `@nemo_skills/pipeline/backends/local.py`:
- Around line 130-137: The loop in the local backend uses enumerate(...) but
never uses the index variable `i`; rename `i` to `_` in the for statement (for
_, (proc, container_spec) in enumerate(zip(processes, spec.containers)): ) to
silence the unused-variable warning (Ruff B007) in the method that starts
log-collector threads and appends to job._log_threads in
nemo_skills.pipeline.backends.local (the block creating threading.Thread and
calling self._collect_logs).

In `@nemo_skills/pipeline/backends/slurm.py`:
- Around line 97-99: Several methods in the Slurm backend are placeholders that
return misleading default values; update each placeholder method (e.g., property
name, get_status, wait_for_completion, cancel_job and the other unimplemented
methods referenced in the review) to raise NotImplementedError instead of
returning fixed values like "RUNNING" or False so callers know they are
unimplemented; locate and replace the bodies of methods named get_status,
wait_for_completion, cancel_job and any other transitional-wrapper methods in
the SlurmBackend class to raise NotImplementedError with a brief message (e.g.,
"transitional wrapper: not implemented") to make the behavior explicit.
- Around line 128-153: The code stores a Task object created inside the get_exp
context (via add_task) into self._jobs (job_info["task"]) which may reference a
finalized/closed experiment; instead, stop storing the live Task from inside the
with-block and either (a) persist only serializable identifiers/metadata (e.g.,
experiment name, task id, container image, cmd) in job_info so later methods can
re-open the experiment via get_exp(spec.name, self.config) and reconstruct or
look up the Task, or (b) ensure any future uses of the stored task (in
get_status, wait_for_completion, cancel_job) re-enter get_exp to reopen the
experiment before touching the Task; update job_info creation around
add_task/get_exp and corresponding consumer methods (get_status,
wait_for_completion, cancel_job) to follow one of these approaches.

In `@scripts/k8s-tests/image-build/build-and-load.sh`:
- Line 129: The Kaniko executor image is pinned to the floating tag
"gcr.io/kaniko-project/executor:latest" which makes builds non-reproducible;
update the image reference used in the image declaration (currently "image:
gcr.io/kaniko-project/executor:latest") to a fixed version or digest (e.g., a
specific tag like vX.Y.Z or an immutable digest `@sha256`:...) so the build uses
an exact, reproducible Kaniko binary across runs.
- Around line 234-255: The loader pod is started with privileged: true and
mounts the entire host at the volume "host-root" (mounted to /host) to allow
containerd operations; add a concise inline comment next to the kubectl
run/--overrides block (near the "loader" container, "securityContext":
{"privileged": true} and the "host-root" volume/volumeMount) explaining that
privileged access and the host-root mount are required for running ctr import
against the node's containerd socket, and that this is an intentional tradeoff
for the image import step.

In `@scripts/k8s-tests/image-build/import-image.yaml`:
- Around line 42-108: Change the containerd volume to mount only the socket file
instead of the whole directory: update the volume named containerd-sock (and its
corresponding volumeMount) so hostPath.path points to
/run/containerd/containerd.sock and the mountPath is
/run/containerd/containerd.sock, and set hostPath.type to Socket if supported;
leave the ctr invocation (--address /run/containerd/containerd.sock) and the
volumeMount name containerd-sock unchanged so the existing CTR usage continues
to work.

In `@scripts/k8s-tests/image-build/kaniko-build.sh`:
- Around line 186-202: The helper pod "copy-${name}" can be leaked if any
command fails after kubectl run; add a cleanup trap immediately after creating
the pod to always delete it on EXIT/ERR (use kubectl delete pod "copy-${name}"
-n "$NAMESPACE" --ignore-not-found) so failures in kubectl wait or kubectl cp
don’t leave the sleep pod running; ensure the trap is set before calling kubectl
wait and that the existing explicit delete at the end remains or is idempotent
with the trap to avoid double-delete errors.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml`:
- Line 46: The NCCL_DEBUG setting is duplicated: the shell script contains
"export NCCL_DEBUG=INFO" and the Kubernetes Job manifest also sets an env var
NCCL_DEBUG in the container spec; remove the redundancy by keeping
NCCL_DEBUG=INFO in only one place—either delete the shell script export or
remove the container env entry for NCCL_DEBUG in the Job spec—ensuring the
remaining location (shell export or container env) is the single source of truth
for NCCL_DEBUG.

In `@scripts/k8s-tests/validate_sft_e2e.sh`:
- Around line 326-341: The grep for "nemo_skills" is too broad and can match
error lines; tighten the check in validate_sft_e2e.sh by replacing the broad
grep -q "nemo_skills" "$LOG_FILE" with a pattern that matches the specific
success message emitted on successful import (or remove the check since "E2E SFT
VALIDATION PASSED" already guarantees success); update the conditional that sets
PASS=false to use the new success-specific pattern and reference LOG_FILE and
the existing "E2E SFT VALIDATION PASSED" check to keep behavior consistent.

@esnvidia esnvidia force-pushed the feature/kubernetes-backend branch from 093ff4c to d331162 Compare February 23, 2026 05:48
Implement a new ComputeBackend abstraction layer to support Kubernetes
as an alternative to Slurm for job execution.

New files:
- nemo_skills/pipeline/backends/base.py: Abstract interface and data classes
  (ComputeBackend, JobSpec, ContainerSpec, ResourceSpec, JobHandle, JobStatus)
- nemo_skills/pipeline/backends/kubernetes.py: Native K8s Job API backend
  with multi-container Pod support for server+client inference pattern
- nemo_skills/pipeline/backends/local.py: Local/Docker execution backend
- nemo_skills/pipeline/backends/slurm.py: Transitional wrapper around
  existing nemo-run integration (limited functionality, see docstring)
- nemo_skills/pipeline/backends/factory.py: Backend factory with fallback
- nemo_skills/pipeline/backends/integration.py: Helper utilities and
  architecture documentation for bridging with nemo-run
- cluster_configs/example-kubernetes.yaml: Example K8s cluster config
  with setup instructions for namespace, RBAC, PVCs, GPU scheduling
- tests/test_backends.py: Comprehensive unit tests (52 passing)

The Kubernetes backend maps Slurm heterogeneous jobs to multi-container
Pods where containers communicate via localhost, enabling the existing
server+client inference pattern to work on K8s clusters.

Note: The nemo-run based get_executor() remains the recommended interface
for Slurm workloads. See integration.py for migration guidance.

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Update BaseJobScript.hostname_ref() to return 'localhost' for Kubernetes
backend, since containers in the same Pod share the network namespace.

Changes:
- scripts.py: Add 'backend' field to BaseJobScript, update hostname_ref()
  to check backend type and return 'localhost' for Kubernetes
- declarative.py: Set backend field on scripts when running pipelines
- test_declarative_pipeline.py: Add tests for Kubernetes hostname behavior

This enables the server+client inference pattern to work on Kubernetes
where multi-container Pods communicate via localhost instead of Slurm's
SLURM_MASTER_NODE environment variable.

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Phase 3 enterprise features:

- cluster_configs/kubernetes/rbac.yaml: ServiceAccount, Role, RoleBinding
  templates for NeMo-Skills job management permissions
- cluster_configs/kubernetes/storage.yaml: PVC templates for models, data,
  results, and HF cache with cloud-specific examples
- cluster_configs/kubernetes/image-pull-secret.yaml: Instructions for creating
  image pull secrets for NGC and other private registries
- cluster_configs/kubernetes/README.md: Setup guide and troubleshooting

Observability improvements:
- kubernetes.py: Add structured logging with extra context fields
  (job_name, namespace, job_id, uid) for better log aggregation

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Documentation:
- docs/kubernetes-migration.md: Step-by-step guide for migrating from
  Slurm to Kubernetes including config conversion examples, storage
  setup, verification steps, and troubleshooting

Validation utilities:
- validate_kubernetes_config(): Validates K8s cluster config
- validate_slurm_config(): Validates Slurm cluster config
- validate_cluster_config(): Auto-detects executor and validates

These functions return a list of error messages, enabling pre-flight
checks before job submission.

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Add 10 tests for cluster configuration validation utilities:
- Kubernetes config validation (valid, full, missing namespace,
  missing containers, invalid storage)
- Slurm config validation (valid, missing account, missing partition)
- Generic validate_cluster_config routing tests

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Implement backend routing in Pipeline.run():
- Route to _run_kubernetes() for executor='kubernetes'
- Route to _run_nemo_run() for Slurm/local (existing behavior)

The _run_kubernetes() method:
- Converts CommandGroups to JobSpecs for KubernetesBackend
- Maps Commands to ContainerSpecs (multi-container Pods)
- Handles job dependencies between pipeline stages
- Supports sequential execution mode
- Includes dry_run for validation without submission

Add 6 tests for Pipeline Kubernetes integration:
- test_pipeline_routes_to_kubernetes
- test_pipeline_converts_command_group_to_job_spec
- test_pipeline_multi_container_conversion
- test_parse_timeout_formats
- test_slurm_still_uses_nemo_run
- test_kubernetes_does_not_use_nemo_run

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Refine Kubernetes backend with production-ready memory scheduling (auto-calculated
requests without hard limits), K8s-compliant job name sanitization with unique
suffixes, and auto-sequential mode for pipelines with job dependencies.

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- add Kubernetes backend multi-node support wiring and SFT routing updates

- add k8s test scripts, manifests, image-build tooling, and runbooks

- update cluster configs/RBAC/docs for multi-node service requirements

- expand backend and pipeline utility tests

- add pre-commit policy for multi-document k8s YAML and manifest validation hook

Signed-off-by: Emanuel Scoullos <escoullos@pleiades-mgmt-01.cm.cluster>
Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- replace invalid ResourceSpec memory_gb usages with memory_request_gb in backend integration helpers
- enforce required executor key access (fail fast) in backend factory, slurm backend, and declarative pipeline paths
- fix Kubernetes multi-node status handling so jobs only succeed when complete conditions/completions are met
- support shell variants (/bin/bash, sh, /bin/sh) for NODE_RANK/RANK export injection in distributed commands
- incorporate ContainerSpec.mounts into Kubernetes manifests and avoid shared mutable volume_mount lists across containers
- harden timeout parsing and error behavior in Kubernetes/declarative helpers
- fix Local backend Docker GPU arg quoting and timeout semantics
- improve Slurm backend command quoting (shlex.join), required ssh_tunnel key access, and avoid storing stale context-managed experiment handles
- update k8s validation/smoke scripts (dist init, torchrun nnodes/node_rank wiring, DNS failure checks, trap-based cleanup, unique ConfigMap names)
- tighten NCCL log parser transport regexes and clean up stale/incorrect docs/examples

Validation:
- pre-commit: passed (`./.venv/bin/pre-commit run --all-files`)
- scoped backend tests: passed (`env PYTHONPATH=. ./.venv/bin/pytest -q tests/test_backends.py ...`)
- live K8s scripts:
  - `scripts/k8s-tests/validate_sft_e2e.sh --gpus 1 --timeout 900` passed
  - `scripts/k8s-tests/validate_sft_e2e_multinode.sh --nodes 2 --gpus 2 --timeout 1200` passed

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- fix backend executor/fallback validation and run_job_and_wait cleanup

- harden declarative/slurm/kubernetes behavior and NCCL warning handling

- make single/multi-node smoke scripts CI-safe with automatic cleanup

- tighten kubectl manifest-validation fallback rules

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
@esnvidia esnvidia force-pushed the feature/kubernetes-backend branch from d331162 to 866fe61 Compare February 23, 2026 05:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (34)
nemo_skills/pipeline/backends/base.py (1)

92-121: Consider validating that command is non-empty.

name and image are validated, but an empty command list would pass silently and likely cause a backend error later. If the intent is to allow container-default entrypoints, a comment would clarify.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/base.py` around lines 92 - 121, In
ContainerSpec.__post_init__, validate the command field: if command is empty
(e.g., falsy or len(command) == 0) raise a ValueError describing that the
container command cannot be empty, or alternatively add a clear
docstring/comment on ContainerSpec explaining that an empty command is allowed
only to use the image's default entrypoint; update tests/usage accordingly;
reference the ContainerSpec dataclass and its __post_init__ method and the
command attribute when making this change.
docs/kubernetes-migration.md (1)

279-281: Minor: image tag inconsistency with the rest of the PR.

The test-pull example uses pytorch:24.01-py3 while the runbook (kubernetes-sft-runbook.md) and other manifests reference pytorch:25.03-py3. Consider aligning to avoid confusion for readers following both docs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-migration.md` around lines 279 - 281, The kubectl test-pull
example uses image tag nvcr.io/nvidia/pytorch:24.01-py3 which is inconsistent
with the rest of the PR; update the image in the kubectl run command (the line
containing --image=nvcr.io/nvidia/pytorch:24.01-py3) to use the same tag used
elsewhere (pytorch:25.03-py3) and verify the change aligns with references in
kubernetes-sft-runbook.md and other manifests.
docs/kubernetes-sft-runbook.md (1)

170-176: Points 4 and 5 are redundant.

Both describe using podAntiAffinity to force pods onto separate nodes. Consider merging them into a single bullet that also mentions the optional nodeAffinity for heterogeneous GPU counts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-sft-runbook.md` around lines 170 - 176, Merge bullets 4 and 5
into a single item: replace the duplicate podAntiAffinity entries with one
bullet that explains applying podAntiAffinity to force pods onto separate nodes
and add a note that nodeAffinity can be used optionally for clusters with
heterogeneous GPU counts; reference the existing implementation name
KubernetesBackend._build_pod_anti_affinity() to ensure the wording matches the
code and remove the redundant line so the list becomes: Headless Service,
Indexed Job, distributed env vars, podAntiAffinity (+ optional nodeAffinity).
scripts/k8s-tests/check_nccl_logs.py (2)

133-137: Return type hint is underspecified.

-> tuple doesn't convey the structure. Per coding guidelines, type hints should be used for simple types.

♻️ Suggested fix
 def validate_result(
     result: NCCLCheckResult,
     expected_nodes: Optional[int] = None,
     expected_gpus_per_node: Optional[int] = None,
-) -> tuple:
+) -> tuple[bool, list[str]]:

As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/check_nccl_logs.py` around lines 133 - 137, The return type
hint for validate_result is underspecified as "-> tuple"; change it to an
explicit typing annotation that matches the actual return structure (e.g.
Tuple[bool, List[str]] or Tuple[bool, List[str]] if the function returns a
success flag and a list of error/messages). Update the function signature in
validate_result to use typing.Tuple and typing.List (add the necessary imports
from typing) and adjust any related docstring/type comments to reflect the
precise return type (NCCLCheckResult remains the input type).

58-58: Unparameterized set type hint.

♻️ Suggested fix
-    ranks_seen: set = field(default_factory=set)
+    ranks_seen: set[int] = field(default_factory=set)

As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/check_nccl_logs.py` at line 58, The field ranks_seen uses
an unparameterized set type; update its annotation to a parameterized typing.Set
(e.g., Set[int] or Set[str] depending on stored rank type) and keep the
default_factory as an empty set, e.g. change ranks_seen: set =
field(default_factory=set) to ranks_seen: Set[int] = field(default_factory=set)
and add the corresponding import from typing (from typing import Set) if missing
so the type hint is explicit and parameterized.
scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml (1)

91-116: Consider adding minimal security hardening to the pod spec.

Static analysis (Trivy KSV-0014/KSV-0118, Checkov CKV_K8S_20/CKV_K8S_23) flags the missing security context. While this is a test manifest and GPU training jobs genuinely need filesystem writes, adding allowPrivilegeEscalation: false is a low-cost improvement that doesn't break functionality.

♻️ Suggested addition
       - name: trainer
         image: nemo-skills/nemo-rl-full:latest
         imagePullPolicy: Never
+        securityContext:
+          allowPrivilegeEscalation: false
         env:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml` around lines 91 -
116, Add a minimal securityContext to the pod/containers to harden the manifest:
set allowPrivilegeEscalation: false at the container level (container name
"trainer") and optionally add runAsNonRoot/allowPrivilegeEscalation combination
in the pod spec to keep GPU writable behavior; update the spec under the
"trainer" container (and/or top-level pod spec) to include securityContext with
allowPrivilegeEscalation: false so the job still runs the existing command
("bash", "/scripts/run_sft.sh") and mounts but prevents privilege escalation.
nemo_skills/pipeline/utils/cluster.py (1)

62-77: Duplicate Slurm-format description in docstring.

Lines 66–68 describe the simple duration format and lines 70–72 repeat the original Slurm-format description that's already on lines 66–68 (the "Slurm format:" bullet). The two blocks overlap, making the docstring read redundantly.

♻️ Suggested cleanup
     """
-    Parse a timeout string into a timedelta object.
-
-    Supports both:
-    - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds",
-      "days-hours", "days-hours:minutes", "days-hours:minutes:seconds"
-    - Simple duration format used in K8s configs: "6h", "30m", "45s", "2d"
-
-    Time format for SLURM: "minutes", "minutes:seconds", "hours:minutes:seconds",
-    "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds"
+    Parse a timeout string into a timedelta object.
+
+    Supports:
+    - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds",
+      "days-hours", "days-hours:minutes", "days-hours:minutes:seconds"
+    - Simple duration format used in K8s configs: "6h", "30m", "45s", "2d"
+
     https://slurm.schedmd.com/sbatch.html#OPT_time
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/cluster.py` around lines 62 - 77, Docstring
contains a duplicated description of the Slurm-format timeout; update the
docstring in nemo_skills/pipeline/utils/cluster.py to remove the redundant block
so it only lists the two supported formats once (keep one "Slurm format: ..."
bullet and one "simple duration format" bullet), ensuring references to
_parse_simple_duration_timeout and the subsequent code that normalizes value
remain accurate and unchanged.
scripts/k8s-tests/sft_smoke_test.py (1)

209-211: Second backend instance created for log retrieval.

Pipeline.run() already creates a backend internally. Creating another one here duplicates initialization (K8s API client, auth, etc.). Consider having Pipeline.run() return or expose its backend, or storing it as an attribute.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/sft_smoke_test.py` around lines 209 - 211, The test is
creating a second backend via get_backend(cluster_config) which duplicates
initialization; instead remove the extra backend creation and use the backend
produced by the Pipeline instance (either by accessing a Pipeline attribute like
pipeline.backend or by changing Pipeline.run() to return its backend and
capturing that value). Locate usages of Pipeline.run() and replace the
standalone get_backend call and the backend variable with the pipeline-provided
backend (referencing Pipeline.run() and get_backend to find the code to change).
scripts/k8s-tests/image-build/import-image.yaml (1)

42-47: Privileged + hostPID are required for this use case.

The static analysis tools flag hostPID: true and privileged: true. These are necessary for the nsenter fallback (line 77) and direct containerd socket access. Since this is a development/test utility (not a production workload), the elevated privileges are acceptable. Consider adding an inline comment noting why these are needed, to prevent future reviewers from removing them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/import-image.yaml` around lines 42 - 47, Add an
inline YAML comment next to hostPID: true and under the import container's
securityContext (privileged: true) explaining these elevated settings are
intentional for this development/test utility because the fixture needs host PID
namespace for the nsenter fallback and privileged access for direct containerd
socket operations; place the comment near the container entry (container name
"import") so future reviewers understand they are required and not accidental.
cluster_configs/kubernetes/storage.yaml (1)

1-1: Copyright year inconsistency.

This file uses 2024 while other new files in this PR use 2026. Minor nit — align if the file was authored as part of this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cluster_configs/kubernetes/storage.yaml` at line 1, Update the copyright year
in the header comment that reads "# Copyright (c) 2024, NVIDIA CORPORATION.  All
rights reserved." to match the rest of the PR (change 2024 to 2026) so the file
header is consistent with other new files.
cluster_configs/kubernetes/rbac.yaml (1)

72-75: pods/log likely only needs getlist and watch are unnecessary.

The kubectl logs equivalent only requires the get verb on pods/log. The list and watch verbs are not used for log streaming and grant slightly broader access than needed.

Suggested tightening
   # Pod logs - required for log streaming
   - apiGroups: [""]
     resources: ["pods/log"]
-    verbs: ["get", "list", "watch"]
+    verbs: ["get"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cluster_configs/kubernetes/rbac.yaml` around lines 72 - 75, The RBAC rule
granting access to the "pods/log" resource is overly permissive; update the rule
that currently lists resources: ["pods/log"] and verbs: ["get", "list", "watch"]
so it only includes verbs: ["get"] for "pods/log", removing "list" and "watch"
to tighten permissions.
scripts/k8s-tests/manifests/multi-node-sft-test.yaml (2)

69-81: Headless Service selector uses only job-name label.

The Service selector (line 80) matches job-name: sft-multi-node, which is correct for this isolated test. Just note that the smoke test scripts use a dual-label selector (app: nemo-skills + job-name) which is a slightly more specific pattern. Consistency across manifests would help if these are used as reference examples.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml` around lines 69 - 81,
The Service "sft-multi-node-workers" currently selects only on label "job-name:
sft-multi-node"; update its spec.selector to include the additional label "app:
nemo-skills" to match the dual-label pattern used by the smoke test scripts, and
ensure the corresponding Pods/Jobs that back this Service also include "app:
nemo-skills" in their metadata.labels (e.g., update the PodTemplate/Job spec
that creates the sft-multi-node pods) so the headless Service can resolve the
same set of endpoints consistently.

118-132: DNS wait loop lacks a post-loop failure check.

Unlike smoke_test_multi_node.sh (lines 155–158), the DNS loop here falls through silently if resolution never succeeds after 60 retries. torchrun will then fail with a less obvious connection error. Adding an explicit check improves debuggability.

Suggested addition after the DNS loop
           for i in $(seq 1 60); do
             getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local && break
             echo "  retry $i..."
             sleep 2
           done
+          if ! getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local > /dev/null 2>&1; then
+            echo "FAIL: Master DNS not resolvable after 120s"
+            exit 1
+          fi
           torchrun --nproc_per_node=2 --nnodes=2 \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml` around lines 118 - 132,
The DNS wait loop that uses getent hosts to wait for sft-multi-node-0 may fall
through silently; after the for loop finishes, add an explicit check that the
host was resolved (e.g., re-run getent hosts
sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local or test a
flag) and if it failed, print a clear error including NODE_RANK and MASTER_ADDR
and exit non-zero so torchrun is not started; modify the script block around the
NODE_RANK export / getent loop (the section that sets NODE_RANK from
JOB_COMPLETION_INDEX and then calls torchrun) to perform this post-loop failure
check and exit with a descriptive message when resolution did not succeed.
scripts/k8s-tests/smoke_test_multi_node.sh (2)

252-254: kubectl wait failure conflates timeout with job failure.

If the job times out (exceeds activeDeadlineSeconds) or the kubectl wait itself times out, both paths set JOB_STATUS="failed". Unlike the single-node script (lines 207–213 there), there's no follow-up check here to distinguish timeout from actual pod failure. Consider adding the same kubectl get job ... -o jsonpath='{.status.failed}' check for consistent diagnostic output.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 252 - 254, The
current kubectl wait invocation sets JOB_STATUS="failed" for any non-success
path, conflating a wait timeout with actual job pod failures; update the
post-wait logic around the "kubectl wait --for=condition=complete ...
job/$JOB_NAME" invocation to, on failure, run a follow-up check using "kubectl
get job $JOB_NAME -n $NAMESPACE -o jsonpath='{.status.failed}'" (and/or inspect
".status.conditions" / ".status.active") to decide whether to set
JOB_STATUS="timeout" (or preserve a distinct timeout indicator) versus "failed"
based on the job's failed count or condition reasons; locate the logic
controlling JOB_STATUS and add this conditional check referencing JOB_NAME,
JOB_STATUS, NAMESPACE and the kubectl get job JSONPath to distinguish timeout
from actual pod failure.

123-133: Misleading WORLD_SIZE env var — contains node count, not total ranks.

The env var WORLD_SIZE is set to ${NUM_NODES} (line 129), but WORLD_SIZE conventionally means total ranks (nodes × GPUs). While torchrun will override WORLD_SIZE to the correct total before the Python script runs, the value is printed in the pre-torchrun echo at line 143 (WORLD_SIZE: \${WORLD_SIZE} (nodes)), which could confuse debugging. Consider renaming to NUM_NODES or NNODES to avoid shadowing the conventional meaning.

Suggested rename
         - name: WORLD_SIZE
-          value: "${NUM_NODES}"
+          value: "${WORLD_SIZE}"

Or rename to avoid ambiguity:

-        - name: WORLD_SIZE
-          value: "${NUM_NODES}"
+        - name: NUM_NODES
+          value: "${NUM_NODES}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 123 - 133, The env
var WORLD_SIZE is misleading because it contains the node count (NUM_NODES) not
total ranks; rename it to NUM_NODES (or NNODES) in the Kubernetes env block and
update all script references (the env entry name WORLD_SIZE, the pre-torchrun
echo that prints WORLD_SIZE: ${WORLD_SIZE} (nodes), and any other uses) so the
container env conveys node-count clearly while letting torchrun still override
WORLD_SIZE at runtime; ensure you change the env key in the env: list and
corresponding variable expansions throughout the script to NUM_NODES (or NNODES)
and verify no logic relies on the old WORLD_SIZE name.
scripts/k8s-tests/smoke_test_single_node.sh (1)

133-139: Exact floating-point comparison is fragile — use tolerance like the multi-node script.

Line 138 uses assert tensor.item() == expected, while the multi-node script (smoke_test_multi_node.sh line 211) uses abs(tensor.item() - expected) < 1e-5. Although the sum of small integers should be exact in float32, using a tolerance is safer and more consistent across the test suite.

Suggested fix
-assert tensor.item() == expected, f"All-reduce failed: got {tensor.item()}, expected {expected}"
+assert abs(tensor.item() - expected) < 1e-5, f"All-reduce failed: got {tensor.item()}, expected {expected}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/smoke_test_single_node.sh` around lines 133 - 139, Replace
the fragile exact float equality in the all-reduce check: instead of asserting
tensor.item() == expected, use a tolerance-based comparison (e.g.,
abs(tensor.item() - expected) < 1e-5) to validate the result of dist.all_reduce
on tensor and keep the assertion message (referencing tensor, expected, and
dist.all_reduce) for clarity and consistency with the multi-node test.
scripts/k8s-tests/image-build/build-nemo-rl.yaml (1)

84-84: Editable install (-e .) in a container image is unnecessary — prefer regular install.

pip install -e . creates a symlink-based install pointing to /opt/NeMo-Skills/. In a container image this provides no benefit over a regular install and creates a fragile dependency on the source directory remaining intact. Use a standard install for image builds.

Suggested fix
-          RUN pip install --no-cache-dir -e .
+          RUN pip install --no-cache-dir .
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml` at line 84, Replace the
editable install invocation in the image build step: the RUN pip install
--no-cache-dir -e . line should be changed to a standard non-editable install
(e.g., RUN pip install --no-cache-dir .) so the container does not rely on
symlinked source files; update the single RUN step that contains "pip install -e
." to use the regular install flag and verify any build-time dependencies still
install correctly.
docs/kubernetes-status.md (2)

56-59: Hardcoded source line numbers will go stale quickly.

References like declarative.py:641-647 and integration.py:57-92 will become inaccurate after any refactor. Consider referencing by function/class name or using stable anchors (e.g., TODO tags with searchable identifiers) instead.

Also applies to: 71-81

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-status.md` around lines 56 - 59, Replace fragile hardcoded
source line references (e.g., "declarative.py:641-647" and
"integration.py:57-92") with stable identifiers: refer to the actual function or
class names that implement the behavior (e.g., the function that enforces job
dependencies in declarative.py and the integration test/class in integration.py)
or add searchable anchors/TODO tags (unique IDs) in the source that the docs can
point to; update the doc entries for the CLI examples (ns nemo_rl sft, ns
start_server --create_tunnel) and the table rows mentioned so they reference
those function/class names or anchor IDs instead of line numbers, and ensure the
same change is applied to the other affected block (lines referenced as 71-81).

169-178: Remove the placeholder ISSUE-001.

This placeholder with all-TBD fields adds no actionable information. The template above it (lines 139–165) is sufficient. Keeping a blank placeholder risks becoming permanently stale.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-status.md` around lines 169 - 178, Remove the placeholder
issue block titled "### [ISSUE-001] Placeholder for First Issue" (the entire
block with Severity/Category/Discovered/Status all set to TBD); delete that
header and its subsequent description lines so only the real template (the
section above) remains and no empty/stale ISSUE-001 entry is left in the
document.
nemo_skills/pipeline/nemo_rl/sft.py (1)

199-219: Add type hints to _run_sft_kubernetes signature.

The function has many parameters without type hints. For consistency with the coding guidelines and readability, consider adding basic type hints.

Suggested signature
 def _run_sft_kubernetes(
-    cluster_config,
-    train_cmd,
-    expname,
-    output_dir,
-    log_dir,
-    num_gpus,
-    num_nodes,
-    dependent_jobs,
-    partition,
-    final_hf_path,
-    conversion_step,
-    average_steps,
-    remove_checkpoints_after_average,
-    backend,
-    max_position_embeddings,
-    installation_command,
-    dry_run,
-    run_after,
-    skip_hf_home_check=None,
+    cluster_config: dict,
+    train_cmd: str,
+    expname: str,
+    output_dir: str,
+    log_dir: str,
+    num_gpus: int,
+    num_nodes: int,
+    dependent_jobs: int,
+    partition: str,
+    final_hf_path: str,
+    conversion_step: str,
+    average_steps: str,
+    remove_checkpoints_after_average: bool,
+    backend: str,
+    max_position_embeddings: int,
+    installation_command: str,
+    dry_run: bool,
+    run_after: list,
+    skip_hf_home_check: bool = None,
 ):

As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_rl/sft.py` around lines 199 - 219, The
_run_sft_kubernetes function signature lacks type hints; update it to annotate
simple parameter types and the return type for clarity and consistency (e.g.,
annotate cluster_config as dict, train_cmd as str or list[str],
expname/output_dir/log_dir/run_after/installation_command/final_hf_path as
Optional[str], num_gpus/num_nodes/average_steps/conversion_step as int,
dependent_jobs as Optional[list[str]] or list[str], partition/backend as
Optional[str], max_position_embeddings as Optional[int],
remove_checkpoints_after_average/dry_run/skip_hf_home_check as bool, and return
type as None or an appropriate type). Apply these annotations directly on the
_run_sft_kubernetes signature and update any imports (typing.Optional,
typing.List) if needed.
scripts/k8s-tests/image-build/kaniko-build.sh (3)

37-37: REPO_ROOT is defined but never used.

ShellCheck SC2034. Remove or export if intended for external use.

Fix
-REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/kaniko-build.sh` at line 37, REPO_ROOT is
declared but unused (ShellCheck SC2034); either delete the REPO_ROOT assignment
(remove the line using REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)") or
export it if callers expect it externally (change to export REPO_ROOT=...), and
then run a quick search for REPO_ROOT and SCRIPT_DIR references to confirm no
other code relies on it before committing.

106-109: node_selector variable value is unused — only its non-empty status matters.

Line 108 builds a JSON string for node_selector, but the heredoc at line 135 uses ${node_selector:+nodeName: ${NODE}}, which only checks if node_selector is non-empty and then directly uses $NODE. The JSON value is never consumed. Simplify by using a boolean flag or just checking $NODE directly:

Suggested fix
-    # Node selector
-    local node_selector=""
-    if [ -n "$NODE" ]; then
-        node_selector="\"nodeName\": \"$NODE\","
-    fi
...
-      ${node_selector:+nodeName: ${NODE}}
+      ${NODE:+nodeName: ${NODE}}

Also applies to: 133-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/kaniko-build.sh` around lines 106 - 109, The
node_selector variable is built as a JSON fragment but never used; the heredoc
uses the parameter expansion ${node_selector:+nodeName: ${NODE}} so only the
non-empty status matters. Simplify by removing node_selector and change
conditionals to check NODE directly (or introduce a simple boolean flag like
has_node), and update the heredoc insertion to use ${NODE:+nodeName: ${NODE}}
(or ${has_node:+nodeName: ${NODE}}) and remove any code that constructs
node_selector to avoid dead/unused variables; adjust all similar occurrences
(including the other spot referenced) to the same pattern.

144-144: Kaniko executor image uses unpinned :latest tag.

gcr.io/kaniko-project/executor:latest can break builds when Kaniko releases a new version with breaking changes. Pin to a specific digest or version tag for reproducibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/kaniko-build.sh` at line 144, Replace the
unpinned Kaniko executor image reference "gcr.io/kaniko-project/executor:latest"
with a fixed, reproducible identifier by specifying either a stable version tag
(e.g., "gcr.io/kaniko-project/executor:vX.Y.Z") or an immutable image digest
("gcr.io/kaniko-project/executor@sha256:...") in the same place where the image
is declared so builds don't break when Kaniko publishes new releases.
docs/kubernetes-nemo-rl-sft-runbook.md (1)

75-84: Add language specifiers to fenced code blocks.

Three code blocks (lines 75, 116, 137) are missing language tags. Use ```text for log output and the ASCII diagram to satisfy MD040 and improve rendering.

Also applies to: 116-127, 137-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-nemo-rl-sft-runbook.md` around lines 75 - 84, The three
fenced code blocks containing plain log/ASCII output (the block starting with "▶
Setting up data... ▶ Setting up compute cluster... ▶ Setting up model...", the
subsequent multi-line run/log block showing steps like "Setting up model...
Qwen2.5-0.5B loads via DTensorPolicyWorkerV2" and the ASCII diagram block) must
include a language specifier for Markdown linting; change their fences from ```
to ```text so they render as plain text and satisfy MD040 (e.g., replace the
three code fences that currently contain run logs/ASCII art with ```text).
scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s (1)

93-96: Silent error suppression for prefetch_venvs.py and generate_fingerprint.py.

Both lines use || echo "... not critical, continuing" to swallow failures. If prefetch_venvs.py fails, the resulting image will have missing Ray venvs, causing silent runtime failures later. Consider at minimum logging a warning to stderr, or letting the build fail so the issue is caught early.

If these are truly optional, document why they can be skipped (e.g., "only needed for offline mode") rather than blanket-suppressing errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s` around lines 93 - 96,
The RUN lines silently swallow failures for prefetch_venvs.py and
generate_fingerprint.py which can hide serious build/runtime issues; either
remove the "|| echo ..." so the build fails on error, or replace the suppression
with an explicit stderr warning and a clear comment explaining optionality
(e.g., why prefetch_venvs.py can be skipped such as "only needed for offline
mode" and what missing Ray venvs imply). Update the RUN invocations referencing
UV_LINK_MODE and nemo_rl/utils/prefetch_venvs.py and
tools/generate_fingerprint.py to either fail-fast or emit a visible stderr
warning and add a short inline comment documenting the rationale if you
intentionally allow continuation.
docs/kubernetes-guide.md (1)

46-74: Quick Start uses unpinned :latest image tags.

The quick start example uses vllm/vllm-openai:latest and nvcr.io/nvidia/nemo:24.07. While nemo:24.07 is pinned, the vLLM image is not. Consider using a pinned version in the quick start to avoid breaking changes for users following the guide.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-guide.md` around lines 46 - 74, Replace the unpinned vLLM
image tag vllm/vllm-openai:latest in the cluster config with a specific, tested
version (e.g., vllm/vllm-openai:<stable-version>) to avoid breaking changes;
update the example cluster config (the YAML block creating the cluster config
and the server-type/server image references) and add a short note indicating the
chosen pinned version and why it was pinned, keeping nvcr.io/nvidia/nemo:24.07
unchanged if you want to keep it pinned.
scripts/k8s-tests/pipeline_smoke_test.py (1)

50-55: Remove unused noqa directive.

Ruff reports noqa: E402 on line 50 is unnecessary since the rule isn't enabled.

-from nemo_skills.pipeline.utils.declarative import (  # noqa: E402
+from nemo_skills.pipeline.utils.declarative import (
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 50 - 55, Remove the
unnecessary noqa directive from the import block: eliminate "  # noqa: E402"
from the line importing Command, CommandGroup, HardwareConfig, and Pipeline (the
import statement that references nemo_skills.pipeline.utils.declarative) so the
import reads normally without the trailing noqa comment.
nemo_skills/pipeline/backends/slurm.py (1)

128-141: Redundant conditional for requested_gpus.

ResourceSpec.gpus is typed as int with default 0 — it can never be None. The conditional if main_container.resources.gpus is not None else None always evaluates to the first branch.

Simplification
-            requested_gpus = main_container.resources.gpus if main_container.resources.gpus is not None else None
+            requested_gpus = main_container.resources.gpus
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/slurm.py` around lines 128 - 141,
requested_gpus is computed with a redundant None-check because ResourceSpec.gpus
is always an int (default 0); replace the conditional in the submission block so
requested_gpus is assigned directly from main_container.resources.gpus and pass
that to add_task (referencing main_container, resources.gpus, requested_gpus,
and add_task) — remove the unnecessary "if ... is not None else None" logic.
nemo_skills/pipeline/backends/factory.py (1)

112-141: Broad except Exception catches ValueError for unknown executors, masking clear errors with fallback noise.

When _create_backend raises ValueError("Unknown executor 'kuberntes'") due to a typo, this catch block logs a misleading "Failed to initialize" warning and potentially attempts a fallback before re-raising. The user sees confusing fallback chatter before the actual error.

Consider narrowing the catch to RuntimeError (initialization/health failures) and letting ValueError (invalid config) propagate immediately.

Suggested fix
         try:
             backend = BackendFactory._create_backend(executor, primary_config)

             # Health check
             if backend.health_check():
                 LOG.info(f"Successfully initialized {executor} backend")
                 return backend
             else:
                 raise RuntimeError(f"{executor} backend health check failed")

-        except Exception as e:
+        except RuntimeError as e:
             LOG.warning(f"Failed to initialize {executor} backend: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/factory.py` around lines 112 - 141, The
try/except around BackendFactory._create_backend is too broad and hides
ValueError from invalid executor names; change the except to catch only
RuntimeError so initialization/health failures are handled and fallback logic
(using fallback_executor and backend.health_check()) runs, but let ValueError
(and other configuration errors) propagate immediately without attempting
fallback—i.e., remove or avoid catching Exception from
BackendFactory._create_backend and only handle RuntimeError for the fallback
path.
nemo_skills/pipeline/backends/local.py (1)

232-267: Log follow mechanism relies on CPython GIL for thread safety.

The job.logs[container] list is written by _collect_logs threads (via append) and read by get_logs (via len and index access). This is safe under CPython's GIL but not formally guaranteed by the Python spec. Acceptable for a local dev/test backend, but worth noting if this is ever ported to other Python implementations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/local.py` around lines 232 - 267, The log
follow logic reads and iterates job.logs (in get_logs) while _collect_logs
appends to it from another thread, which relies on the CPython GIL; make access
explicit and thread-safe by adding a per-job lock (e.g., job.logs_lock) and
acquiring it whenever reading or writing job.logs: update get_logs (function
get_logs, references to job.logs and JobStatus.RUNNING) to acquire job.logs_lock
around calls to job.logs.get(), len(), and index access, and update the
log-producing code (_collect_logs or wherever append is done) to acquire the
same job.logs_lock before appending.
scripts/k8s-tests/image-build/build-and-load.sh (1)

128-129: Mutable :latest tag for Kaniko executor image.

gcr.io/kaniko-project/executor:latest is a mutable tag — successive runs may use different Kaniko versions, making builds non-reproducible and potentially introducing breaking changes.

Consider pinning to a specific digest or release tag (e.g., v1.23.2).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 128 - 129, The
Kaniko container image uses the mutable tag
'gcr.io/kaniko-project/executor:latest' (the "kaniko" container definition)
which can change between runs; replace that image reference with a fixed release
tag or immutable digest (for example 'gcr.io/kaniko-project/executor:v1.23.2' or
the image@sha256:... digest) to ensure reproducible builds and pin the Kaniko
version used by the build-and-load.sh pipeline.
nemo_skills/pipeline/backends/kubernetes.py (1)

700-716: Node selector heuristic relies on pool name containing "gpu"/"cpu".

The substring matching "gpu" in pool_name.lower() / "cpu" in pool_name.lower() is fragile — a pool named "bigcpu-with-gpu-label" would match both branches. If users don't follow this naming convention, GPU jobs could land on CPU pools or vice versa.

Consider documenting this convention clearly in the cluster config example, or allowing explicit pool-to-resource mapping in config.

nemo_skills/pipeline/utils/declarative.py (2)

682-714: Dependency resolution logic is duplicated between _run_nemo_run and _run_kubernetes.

The dependency parsing and resolution block (lines 682–714) is structurally identical to lines 503–557 in _run_nemo_run, with only minor behavioral differences (e.g., K8s skips external string deps). This duplication increases maintenance burden — any future fix to dependency handling must be applied in both places.

Consider extracting a shared _resolve_dependencies(job_spec, job_name_to_handle, ...) helper that returns the resolved dependency objects, with the backend-specific behavior (external deps handling) parameterized by executor type.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 682 - 714, The
dependency parsing block duplicated in _run_kubernetes and _run_nemo_run should
be extracted into a single helper (e.g. _resolve_dependencies(job_spec,
job_name_to_handle, original_job_name, run_after, allow_external=True)) that
returns the dependency_handles list and logs; update both _run_kubernetes and
_run_nemo_run to call this helper, passing allow_external=False (or
executor="kubernetes") for _run_kubernetes so string/external deps are skipped
with a warning and allow_external=True (or executor="nemo") for _run_nemo_run to
preserve existing behavior; ensure the helper validates dict deps (raises the
same ValueError when name missing or job not processed) and preserves the
original logging messages.

790-807: Multi-node semantics: all containers run on all nodes with maximum num_nodes.

Taking the max num_nodes across groups (line 806) means that if one group requests 2 nodes and another requests 1, all containers (including those from the 1-node group) will run on both nodes in the Indexed Job. This is fine for distributed training containers but may be surprising for auxiliary services (sandbox, client) that don't need multi-node replication.

This is a reasonable K8s simplification given that Indexed Jobs use a single pod template, but worth documenting if mixed num_nodes across groups becomes a supported pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 790 - 807, The code
takes the maximum num_nodes across groups (variable num_nodes computed from
group.hardware.num_nodes) which causes all containers to run on every node in
the Kubernetes IndexedJob pod template; add an explicit comment/docstring near
the loop over groups and/or emit a runtime warning/log when groups request
different num_nodes values (compare each group.hardware.num_nodes against the
current max) so callers are informed that auxiliary groups will be replicated to
the max, and document this IndexedJob semantic in the declarative mapping
(referencing groups, num_nodes, and group.hardware.num_nodes).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (22)
scripts/k8s-tests/image-build/kaniko-build.sh (1)

36-37: REPO_ROOT (and by extension SCRIPT_DIR) is unused — dead code.

REPO_ROOT is never referenced after its definition, and SCRIPT_DIR is only used to define REPO_ROOT. Shellcheck SC2034 confirms this.

✂️ Proposed fix
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/kaniko-build.sh` around lines 36 - 37, Remove
the dead variables SCRIPT_DIR and REPO_ROOT from kaniko-build.sh: these are
defined but never used (SCRIPT_DIR only serves to compute REPO_ROOT). Delete the
two lines that set SCRIPT_DIR and REPO_ROOT (the symbols SCRIPT_DIR and
REPO_ROOT) so there is no unused-variable noise and ShellCheck SC2034 is
resolved; if the script actually needs the repository path later, instead
compute and use it where required rather than defining them unused at the top.
nemo_skills/pipeline/utils/cluster.py (1)

63-73: Remove the duplicate Slurm format description from the docstring.

Lines 65–68 already enumerate the Slurm formats under the new "Supports both:" section; lines 70–72 repeat the same formats verbatim. Keep the SLURM link and drop the redundant text.

✂️ Proposed cleanup
     """
-    Parse a timeout string into a timedelta object.
-
-    Supports both:
-    - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds",
-      "days-hours", "days-hours:minutes", "days-hours:minutes:seconds"
-    - Simple duration format used in K8s configs: "6h", "30m", "45s", "2d"
-
-    Time format for SLURM: "minutes", "minutes:seconds", "hours:minutes:seconds",
-    "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds"
-    https://slurm.schedmd.com/sbatch.html#OPT_time
+    Parse a timeout string into a timedelta.
+
+    Supports:
+    - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds",
+      "days-hours", "days-hours:minutes", "days-hours:minutes:seconds"
+      https://slurm.schedmd.com/sbatch.html#OPT_time
+    - Simple duration: "6h", "30m", "45s", "2d"
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/cluster.py` around lines 63 - 73, The docstring
for the timeout parsing utility (in the function/method that describes SLURM and
K8s formats—look for the parse_timeout or similar function in cluster.py)
repeats the SLURM format list twice; remove the redundant second repetition (the
lines that repeat "Time format for SLURM: ..." and the list that follows) while
keeping the initial "Supports both:" section and retaining the SLURM reference
link (https://slurm.schedmd.com/sbatch.html#OPT_time) so the docstring only
lists the formats once and remains accurate and concise.
scripts/k8s-tests/validate_sft_e2e_multinode.sh (1)

196-207: kubectl wait --for=condition=complete blocks for the full timeout when the job fails.

If a pod crashes immediately, the job transitions to the Failed condition (not Complete), and kubectl wait continues polling until ${TIMEOUT}s (default 900 s) before exiting non-zero. Use --for=condition=failed OR watch both conditions to get an early-exit on failure (supported since Kubernetes 1.23):

♻️ Proposed improvement
-if ! kubectl wait --for=condition=complete --timeout="${TIMEOUT}s" "job/$JOB_NAME" -n "$NAMESPACE"; then
+if ! kubectl wait \
+       --for=condition=complete \
+       --for=condition=failed \
+       --timeout="${TIMEOUT}s" \
+       "job/$JOB_NAME" -n "$NAMESPACE"; then

Note: Multiple --for conditions act as OR and require Kubernetes ≥ 1.23. After the wait, check which condition was reached if you need to distinguish failure from timeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh` around lines 196 - 207, The
script's kubectl wait call blocks until the full ${TIMEOUT} if the Job fails
because it only waits for condition=complete; update the wait invocation that
references JOB_NAME and TIMEOUT to include both conditions (e.g.,
--for=condition=complete --for=condition=failed) so the command returns as soon
as the Job either completes or fails (Kubernetes ≥1.23 required), and after the
wait check the Job status (inspect the job object or use kubectl get
job/$JOB_NAME -o jsonpath='{.status}') to distinguish failure vs timeout and
proceed to print pod logs and exit appropriately.
nemo_skills/pipeline/backends/base.py (1)

116-121: Consider validating that command is non-empty.

ContainerSpec.__post_init__ validates name and image but doesn't check command. An empty command list ([]) would produce an invalid container spec that would fail at submission time with a less clear error from the Kubernetes API.

♻️ Suggested addition
     def __post_init__(self):
         if not self.name:
             raise ValueError("Container name cannot be empty")
         if not self.image:
             raise ValueError("Container image cannot be empty")
+        if not self.command:
+            raise ValueError("Container command cannot be empty")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/base.py` around lines 116 - 121,
ContainerSpec.__post_init__ currently checks name and image but not command;
update __post_init__ to validate that the command attribute is present and
non-empty (e.g., not None and len(command) > 0 or command != []), and raise a
ValueError with a clear message like "Container command cannot be empty" when
validation fails; locate the ContainerSpec class and modify its __post_init__
method to include this check alongside the existing name and image checks.
nemo_skills/pipeline/utils/declarative.py (2)

428-456: Timestamp-based uniqueness has a small collision window.

int(time.time()) % 100000 wraps every ~27.7 hours, so two pipeline runs exactly 27.7h apart could collide. This is documented and acceptable for typical usage, but if multiple pipelines run concurrently within the same second, they'll also get the same suffix. Consider incorporating os.getpid() or a random component if concurrent pipeline launches are expected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 428 - 456, The
_make_unique_job_name function currently uses time-based suffix int(time.time())
% 100000 which can collide for concurrent launches or runs ~27.7h apart; update
the suffix to include an additional uniqueness component (e.g., append the
process id from os.getpid() or a short random hex from secrets.token_hex)
combined with the timestamp so suffix becomes e.g. "-{timestamp}{pid_or_rand}".
Ensure you adjust suffix_len calculation and the max_base_len truncation logic
accordingly so the final name still respects max_length and strip trailing
hyphens from the truncated base_name as before.

826-832: num_tasks mapped to cpus conflates processes-per-node with CPU cores.

HardwareConfig.num_tasks represents "Number of tasks (processes) per node" (Slurm-style MPI tasks), but ResourceSpec.cpus represents CPU core requests for Kubernetes scheduling. These are different concepts — a job with num_tasks=1 may still need multiple CPU cores, and a job with num_tasks=4 (for data loading) doesn't necessarily need exactly 4 CPU cores.

Consider using a separate cpus field on HardwareConfig or computing a reasonable default for Kubernetes (e.g., max(num_tasks, 4) for GPU workloads).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 826 - 832, The code
incorrectly maps HardwareConfig.num_tasks (Slurm-style "tasks per node") to
ResourceSpec.cpus, conflating processes with CPU core requests; update the
mapping in the block that constructs resources (where group.hardware or
HardwareConfig() is used and ResourceSpec(...) is created) to use a proper CPU
source: add and prefer a new HardwareConfig.cpus field if present, otherwise
compute a safer default (e.g., max(hardware.cpus if set else 0,
hardware.num_tasks, 4) for GPU workloads or another sensible heuristic), and set
ResourceSpec.cpus to that computed value instead of hardware.num_tasks so CPU
core requests reflect cores, not MPI tasks.
scripts/k8s-tests/pipeline_smoke_test.py (2)

197-216: Dry-run path accesses private _convert_groups_to_job_spec method.

This is acceptable in a smoke test/validation script, but note that this couples the test to an internal API. If the method signature or name changes, this script will break silently (no import-time error). A brief comment noting this dependency would help future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 197 - 216, The dry-run
path calls the private pipeline._convert_groups_to_job_spec method (after
pipeline.run(dry_run=True)), which couples this smoke test to an internal API;
add a short inline comment immediately above the call to
_convert_groups_to_job_spec explaining that this is intentionally using a
private/internal method and warning maintainers that changes to that method name
or signature will break the test, and optionally mention why it's safe here
(validation-only) so future developers understand the dependency.

236-252: get_backend and JobStatus imported inside function body — verify lazy import necessity.

The import at line 236 is inside main() to avoid import-time failures when the Kubernetes backend isn't installed. This is reasonable for a script that may run in dry-run mode without the backend. However, wait_for_completion at line 242 uses a hardcoded timeout of 900s that isn't configurable from the CLI.

♻️ Consider making the wait timeout configurable
     parser.add_argument("--image", default=os.environ.get("PYTORCH_IMAGE", "nvcr.io/nvidia/pytorch:25.03-py3"))
+    parser.add_argument("--timeout", type=int, default=900, help="Wait timeout in seconds (default: 900)")
     args = parser.parse_args()

Then use args.timeout at line 242.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 236 - 252, Replace the
hardcoded wait timeout (900) passed to backend.wait_for_completion with a
CLI-configurable value: add a command-line argument (e.g.,
parser.add_argument('--timeout', type=int, default=900)) where args is built,
then call backend.wait_for_completion(handle, timeout=args.timeout); keep the
lazy imports (get_backend, JobStatus) inside the function as they are to avoid
import-time failures.
scripts/k8s-tests/validate_sft_e2e.sh (1)

303-311: Failure path doesn't save logs to the file before exiting.

When the job fails, the script prints the last 50 lines of pod logs to stdout (line 308) but doesn't save them to $LOG_FILE like the success path does (lines 314-317). This makes post-mortem debugging harder since the log file won't exist on failure.

♻️ Suggested fix
 if ! kubectl wait --for=condition=complete --timeout="${TIMEOUT}s" "job/$JOB_NAME" -n "$NAMESPACE"; then
     echo ""
     echo "FAILED — Job did not complete. Logs:"
     POD=$(kubectl get pods -l "job-name=$JOB_NAME" -n "$NAMESPACE" -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
     if [ -n "$POD" ]; then
-        kubectl logs "$POD" -n "$NAMESPACE" --tail=50
+        LOG_FILE="${SCRIPT_DIR}/logs/sft-e2e-$(date +%Y%m%d-%H%M%S).log"
+        mkdir -p "$(dirname "$LOG_FILE")"
+        kubectl logs "$POD" -n "$NAMESPACE" > "$LOG_FILE" 2>&1 || true
+        echo "Full logs saved to: $LOG_FILE"
+        tail -50 "$LOG_FILE"
     fi
     exit 1
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 303 - 311, The failure
branch that handles a non-completing job should capture and append the pod logs
into the same $LOG_FILE used by the success path; update the block that runs
kubectl logs (the branch that runs kubectl wait --for=condition=complete ... for
job/$JOB_NAME and then fetches POD via kubectl get pods) so that it writes the
diagnostic output into $LOG_FILE (create the file if necessary) — e.g., redirect
the kubectl logs output and any echo messages into $LOG_FILE (and still print to
stdout if desired) so post-mortem artifacts are preserved for JOB_NAME and
NAMESPACE. Ensure you use the same $LOG_FILE variable and preserve the current
--tail=50 behavior when saving logs.
scripts/k8s-tests/smoke_test_single_node.sh (1)

202-213: kubectl wait --for=condition=complete won't distinguish job failure from timeout.

kubectl wait --for=condition=complete returns non-zero both when the job fails and when the wait times out. The differentiation at lines 207-213 only checks .status.failed but doesn't handle the case where the job is still running (timed-out wait) vs. actually failed. Consider also checking condition=failed explicitly for a clearer status report.

♻️ Suggested improvement
 # Wait for completion
-kubectl wait --for=condition=complete --timeout=${TIMEOUT_SECONDS}s \
-    -n "$NAMESPACE" "job/$JOB_NAME" 2>/dev/null && JOB_STATUS="succeeded" || JOB_STATUS="failed"
-
-# Check if it actually failed vs timed out
-if [ "$JOB_STATUS" = "failed" ]; then
-    # Check if job failed or is still running
-    FAILED=$(kubectl get job "$JOB_NAME" -n "$NAMESPACE" -o jsonpath='{.status.failed}' 2>/dev/null || echo "0")
-    if [ "${FAILED:-0}" = "0" ]; then
-        echo "Job timed out waiting for completion"
-    fi
-fi
+if kubectl wait --for=condition=complete --timeout=${TIMEOUT_SECONDS}s \
+    -n "$NAMESPACE" "job/$JOB_NAME" 2>/dev/null; then
+    JOB_STATUS="succeeded"
+elif kubectl wait --for=condition=failed --timeout=5s \
+    -n "$NAMESPACE" "job/$JOB_NAME" 2>/dev/null; then
+    JOB_STATUS="failed"
+else
+    JOB_STATUS="timed-out"
+    echo "Job timed out waiting for completion"
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/smoke_test_single_node.sh` around lines 202 - 213, The
current wait treats any non-zero exit as "failed" but only checks
.status.failed; update the post-wait logic to distinguish an actual job failure
vs a timeout/still-running job by querying the job status with kubectl get job
"$JOB_NAME" -n "$NAMESPACE" -o jsonpath and examining .status.failed and
.status.active (or .status.conditions[*].type for "Failed"/"Complete").
Concretely: after the kubectl wait block, if JOB_STATUS="failed" run a kubectl
get job and if .status.failed > 0 set a clear failure message, else if
.status.active > 0 report "Job timed out/still running", otherwise inspect
.status.conditions for a Failed condition to report failure; use the existing
JOB_NAME, JOB_STATUS and FAILED variables and the same namespace variable to
implement these checks.
scripts/k8s-tests/check_nccl_logs.py (1)

133-137: Parameterize the tuple return type hint.

tuple without parameters doesn't communicate the return shape. Since you're already importing from typing, this is a quick win for readability.

♻️ Suggested fix
 def validate_result(
     result: NCCLCheckResult,
     expected_nodes: Optional[int] = None,
     expected_gpus_per_node: Optional[int] = None,
-) -> tuple:
+) -> Tuple[bool, List[str]]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/check_nccl_logs.py` around lines 133 - 137, The return type
for validate_result is currently unparameterized as tuple; update it to a
specific Tuple[...] annotation (e.g., Tuple[bool, Optional[str]] if the function
returns a success boolean and an optional message) by importing Tuple from
typing if necessary and replacing the signature def validate_result(...) ->
tuple: with def validate_result(...) -> Tuple[bool, Optional[str]]: (or the
exact Tuple[...] that matches the actual return values in validate_result).
scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml (1)

43-73: Minor: NCCL_DEBUG=INFO set redundantly in both the script and container env.

Line 46 (export NCCL_DEBUG=INFO in the script) and line 98 ({name: NCCL_DEBUG, value: "INFO"} in the container env) both set the same variable. The container env is sufficient and the script export is redundant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml` around lines 43 - 73,
The run_sft.sh script redundantly exports NCCL_DEBUG=INFO while the container
environment already sets NCCL_DEBUG; remove the export line from the run_sft.sh
content (the line "export NCCL_DEBUG=INFO") to avoid duplicate environment
configuration, keeping the container env entry ({name: NCCL_DEBUG, value:
"INFO"}) as the single source of truth.
docs/kubernetes-sft-runbook.md (1)

44-49: The sleep 15 approach for log capture is fragile.

The driver-check command uses sleep 15 before reading logs, which is a race condition — the pod might not have completed in time. Consider using kubectl wait instead:

Suggested improvement
-sleep 15 && kubectl logs driver-check && kubectl delete pod driver-check --ignore-not-found
+kubectl wait --for=condition=Ready pod/driver-check --timeout=60s && \
+  kubectl logs driver-check && kubectl delete pod driver-check --ignore-not-found
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-sft-runbook.md` around lines 44 - 49, Replace the fragile
"sleep 15" pattern with an explicit kubectl wait for the driver-check pod; after
running kubectl run driver-check, call kubectl wait
--for=condition=ContainersReady pod/driver-check --timeout=30s (or
--for=condition=Ready if you prefer) and then run kubectl logs driver-check and
kubectl delete pod driver-check --ignore-not-found to ensure logs are captured
reliably; update references to kubectl run driver-check, kubectl logs
driver-check, and kubectl delete pod driver-check accordingly.
scripts/k8s-tests/image-build/build-nemo-rl.yaml (1)

53-53: Consider pinning image tags for reproducibility.

Both alpine/git:latest (line 53) and gcr.io/kaniko-project/executor:latest (line 101) use mutable latest tags. For build reproducibility, consider pinning to specific versions (e.g., alpine/git:v2.45.2, gcr.io/kaniko-project/executor:v1.23.2).

Also applies to: 101-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml` at line 53, Replace mutable
"latest" image tags with fixed versioned tags for reproducibility: update the
image fields referencing "alpine/git:latest" and
"gcr.io/kaniko-project/executor:latest" to specific, pinned versions (e.g.,
"alpine/git:v2.45.2" and "gcr.io/kaniko-project/executor:v1.23.2") wherever
those image strings appear in the manifest so builds use immutable, reproducible
images.
scripts/k8s-tests/image-build/build-and-load.sh (2)

128-129: Kaniko image not version-pinned.

gcr.io/kaniko-project/executor:latest can change without notice, potentially breaking builds. Consider pinning to a specific digest or version tag for reproducibility.

Suggested fix
-    image: gcr.io/kaniko-project/executor:latest
+    image: gcr.io/kaniko-project/executor:v1.23.2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 128 - 129, The
kaniko executor image is using the mutable tag
"gcr.io/kaniko-project/executor:latest"; update the image reference in the build
job (the "kaniko" container definition) to a specific version tag or immutable
digest (for example a released version tag or an `@sha256`:... digest) to ensure
reproducible builds and prevent breaking changes from "latest". Locate the
"kaniko" container entry in build-and-load.sh and replace the image value with a
pinned tag or digest.

234-255: Loader pods run privileged with full host root mount — document the security implications.

The loader pod uses privileged: true and mounts / from the host as read-only. While necessary for ctr import, this is a significant security surface. Consider adding a brief inline comment documenting why this is needed and the security trade-off.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 234 - 255, The
loader pod created by the kubectl run command (named "load-$NODE") is set to
privileged: true and mounts the host root via volume "host-root" (mountPath
"/host"), which is a high-risk security surface; add a concise inline comment
above this pod definition explaining that privileged mode and the host-root
mount are required to allow running ctr import against the host containerd
socket (volume "containerd-sock") but they increase attack surface and should
only be used in trusted test clusters, and note any mitigation (readOnly
host-root, namespace -n "$NAMESPACE", and why those are insufficient); reference
the kubectl run block, the "loader" container, securityContext.privileged, and
the "host-root"/"containerd-sock" volumes in the comment.
docs/kubernetes-nemo-rl-sft-runbook.md (1)

75-84: Add language identifiers to fenced code blocks.

Three fenced code blocks (lines 75, 116, 137) lack a language identifier, which triggers markdownlint MD040. Since these show expected output, use ```text.

Suggested fix

Line 75:

-```
+```text

Line 116:

-```
+```text

Line 137:

-```
+```text

Also applies to: 116-127, 137-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes-nemo-rl-sft-runbook.md` around lines 75 - 84, The three
fenced code blocks showing run output (the blocks containing "▶ Setting up
data... … === REAL NEMO-RL SFT COMPLETED ===" and the other two similar
expected-output blocks) are missing a language identifier and trigger
markdownlint MD040; update each opening triple-backtick to use ```text so the
blocks are explicitly plain text (e.g., replace ``` with ```text for the fenced
blocks that contain the expected output sequences).
scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml (1)

48-49: Redundant self-assignment of MASTER_ADDR.

MASTER_ADDR=${MASTER_ADDR} on line 49 just reassigns the environment variable to itself. The comment says "use the headless service DNS which resolves to node IPs", but the value already comes from the pod env (line 153–154). This line can be removed or replaced with a comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml` around
lines 48 - 49, The env entry MASTER_ADDR=${MASTER_ADDR} is a redundant
self-assignment; remove this line from the container/env block (or replace it
with a clarifying comment) so the pod uses the existing MASTER_ADDR provided
earlier, and ensure any intended headless-service DNS is explicitly assigned if
needed (search for MASTER_ADDR and the env block in the manifest to locate the
exact spot).
nemo_skills/pipeline/backends/slurm.py (1)

128-156: Stored task reference from closed context manager is unused — acceptable for now.

The task object (line 145) is created inside the with get_exp(...) block and stored in self._jobs. After the block exits, this reference may be stale. However, since the status methods (get_status, wait_for_completion, cancel_job) currently return placeholders and don't use the stored task, this is safe for the transitional implementation. When implementing the TODOs, avoid relying on this stored task reference — use sacct or re-open the experiment instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/slurm.py` around lines 128 - 156, The code
stores a `task` (from add_task) into `self._jobs` while inside the `with
get_exp(...)` context which may become stale after the context closes; update
future implementations of `get_status`, `wait_for_completion`, and `cancel_job`
to not rely on this stored `task` reference (symbols: task, get_exp, self._jobs)
— instead query Slurm directly (e.g., via `sacct`, `squeue`, or re-open the
experiment with `get_exp` when needed) or store only stable identifiers
(experiment name / Slurm job id) in `self._jobs` and use those to retrieve live
status/actions.
nemo_skills/pipeline/backends/kubernetes.py (1)

268-268: Consider strict=True on zip() for defensive safety.

zip(spec.containers, pod_spec.containers) should always have equal lengths since pod_spec.containers is built from spec.containers on lines 209–211. However, adding strict=True would catch any future divergence.

-        for container_spec, container in zip(spec.containers, pod_spec.containers):
+        for container_spec, container in zip(spec.containers, pod_spec.containers, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/kubernetes.py` at line 268, The zip over
containers should be made strict to catch length mismatches: change the loop
using zip(spec.containers, pod_spec.containers) to call zip(spec.containers,
pod_spec.containers, strict=True) so any future divergence between
spec.containers and pod_spec.containers raises immediately; update the loop
where pod_spec and spec containers are iterated and ensure the runtime target
supports Python 3.10+ (or add a compatibility guard if not).
scripts/k8s-tests/run_sft_k8s_real.py (1)

30-102: Training script is embedded as a large multi-line string — fragile and hard to maintain.

The entire training script (lines 51–101) is constructed as a concatenated string. This is difficult to read, test independently, and modify. While acceptable for a test/validation script, consider extracting it to a separate file if this script grows further.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-tests/run_sft_k8s_real.py` around lines 30 - 102, The train_cmd
in main() embeds a large multi-line Python training script into a shell heredoc
(writing to /tmp/train_pipeline.py), which is fragile and hard to maintain;
extract that embedded script into a standalone Python file (e.g.,
sft_train_pipeline.py) and update train_cmd to run that file instead (keep the
same environment exports and torchrun invocation), or alternatively load the
script text from a separate resource at runtime and inject it into
/tmp/train_pipeline.py; update references to the heredoc and
/tmp/train_pipeline.py in train_cmd so main() uses the external script
(train_cmd variable) while preserving the torchrun flags (--nproc_per_node,
--nnodes, --node_rank, --master_addr, --master_port) and any environment
variable usages (NODE_RANK, MASTER_ADDR).
nemo_skills/pipeline/backends/__init__.py (1)

29-66: register_backend decorator is the public extension point but is not re-exported.

Users who want to register a custom backend must reach into nemo_skills.pipeline.backends.factory directly, bypassing the unified public surface. Since BackendFactory and get_backend are already re-exported, register_backend belongs alongside them.

♻️ Proposed change
-from nemo_skills.pipeline.backends.factory import BackendFactory, get_backend
+from nemo_skills.pipeline.backends.factory import BackendFactory, get_backend, register_backend

 __all__ = [
     ...
     # Factory
     "BackendFactory",
     "get_backend",
+    "register_backend",
     ...
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/backends/__init__.py` around lines 29 - 66, The public
API currently re-exports BackendFactory and get_backend but omits the
register_backend decorator; update nemo_skills.pipeline.backends.__init__ by
importing register_backend from nemo_skills.pipeline.backends.factory and adding
"register_backend" to the __all__ list so users can register custom backends via
the unified public surface alongside BackendFactory and get_backend.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster_configs/kubernetes/rbac.yaml`:
- Around line 62-65: The RBAC entry for batch/jobs in rbac.yaml grants the
"patch" verb (apiGroups: ["batch"], resources: ["jobs"], verbs includes "patch")
but the simplified example in docs/kubernetes-guide.md only lists "create,
delete, get, list, watch"; update the docs example to include "patch" or add a
brief note that the full RBAC (cluster_configs/kubernetes/rbac.yaml) includes
additional verbs such as "patch" so the guide and the manifest remain
consistent.

In `@cluster_configs/kubernetes/README.md`:
- Line 7: The README statement "Kubernetes cluster 1.24+ (for native job
dependencies)" is misleading because the pipeline uses declarative.py in
sequential mode rather than Kubernetes native job dependency features; update
the README to either remove the parenthetical or change it to state the actual
minimum Kubernetes version required (or that sequential submission is used) and
reference that declarative.py uses sequential mode so readers aren’t led to
expect native K8s job dependencies.

In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 131-141: The fallback path currently ignores a false return from
fallback_backend.health_check() so add explicit handling: after calling
BackendFactory._create_backend(...) check the boolean result and if False call
LOG.error with a descriptive message including fallback_executor and any
available state, then raise a specific exception (or re-raise the primary
exception wrapped with context) so the outer handler reports that the fallback
was attempted and failed; ensure you reference BackendFactory._create_backend,
fallback_backend.health_check(), LOG.error and the RuntimeError re-raise so the
new error path surfaces in logs and exceptions.
- Around line 103-108: The check after fetching cluster_config["executor"]
currently raises a misleading message when the key exists but the value is
falsy; update the guard around executor (the local variable) to validate the
value (e.g., ensure it's not None/empty and is a string) and raise a clearer
ValueError such as "cluster_config['executor'] must be a non-empty string" (or
similar) so callers know the value is invalid rather than the key missing; also
ensure this validation happens before calling executor.lower() to avoid
AttributeError.
- Around line 113-141: The current broad except around
BackendFactory._create_backend swallows ValueError for unknown executors and
re-raises RuntimeError; change the error handling so ValueError from
BackendFactory._create_backend (or any explicit "unknown executor" error) is
allowed to propagate unchanged: either remove ValueError from the general except
or add an early except ValueError: raise to re-raise it immediately before
attempting fallback logic and before converting other exceptions into
RuntimeError; keep the existing fallback flow for non-ValueError exceptions (use
BackendFactory._create_backend, backend.health_check(), fallback_executor, and
the final RuntimeError only for other failures).

In `@nemo_skills/pipeline/backends/local.py`:
- Around line 138-145: The for-loop in the method using enumerate is introducing
an unused loop variable i; change the loop to either drop enumerate (for (proc,
container_spec) in zip(processes, spec.containers):) or rename i to _ so the
unused variable warning is resolved; update the loop that creates threads
calling self._collect_logs and appends to job._log_threads accordingly.
- Around line 191-197: The catch-all except in _collect_logs (and similarly in
health_check) is too broad and may hide programming errors; restrict the handler
to the specific I/O/subprocess exceptions expected instead. Update _collect_logs
to only catch IOError/OSError (and BrokenPipeError if relevant) when iterating
proc.stdout and log the error with container_name and exception details; update
health_check to catch only subprocess.SubprocessError,
subprocess.CalledProcessError, and OSError where subprocess calls are made.
Ensure you reference the same variables (job, container_name, proc, LOG) and
remove the bare `except Exception` blocks so unexpected errors propagate.

In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Around line 99-100: The Dockerfile uses the wrong source path casing: replace
the COPY instruction that references "skills/" with "Skills/" so it matches the
init container build context (change the COPY skills/ /opt/NeMo-Skills/ line to
use COPY Skills/ /opt/NeMo-Skills/); update the COPY line in
Dockerfile.nemo-rl-k8s so the build context prepared by build-nemo-rl.yaml
(which clones to /context/Skills/) is copied correctly.

In `@scripts/k8s-tests/image-build/kaniko-build.sh`:
- Around line 186-202: The copy pod can leak if kubectl wait or kubectl cp
fails; inside the build_image function add a trap that always deletes the copy
pod (referencing the pod name pattern "copy-${name}") on EXIT or ERR so kubectl
delete pod -n "$NAMESPACE" --ignore-not-found is executed even when kubectl wait
or kubectl cp fail; ensure the trap cleans up using the same pod identifier and
namespace variables and is set immediately after creating the pod (before
kubectl wait) so cleanup runs on any early exit.

In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml`:
- Around line 125-132: The DNS wait loop that runs getent hosts
sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local may exit after
60 retries without indicating failure, then proceeds to call torchrun; modify
the script so after the for loop you check whether the hostname resolved (e.g.,
test last getent result or attempt a final getent) and if it failed, print a
clear error including the hostname and exit non‑zero instead of invoking
torchrun --nproc_per_node=2 --nnodes=2 (which uses NODE_RANK, MASTER_ADDR and
/scripts/train.py); if the host resolved, continue to run torchrun as before.
- Around line 100-151: The pod spec for the sft-multi-node job is missing a
securityContext which causes scanners to flag running as root/allowing privilege
escalation; add a minimal pod-level securityContext (e.g., runAsNonRoot: true,
runAsUser: 1000) and a container-level securityContext inside the trainer
container (e.g., allowPrivilegeEscalation: false, readOnlyRootFilesystem: true,
capabilities: drop: [ "ALL" ]) to the spec referenced by the pod template (look
for spec -> containers -> name: trainer and the surrounding pod spec) so the
manifest enforces non-root execution and disables privilege escalation while
keeping the test runnable.

In `@scripts/k8s-tests/sft_smoke_test.py`:
- Around line 38-45: Remove the ineffective "noqa: E402" directives from the
import statements — specifically the import of nemo_run and the from-import of
nemo_skills.pipeline.utils.declarative (the names Command, CommandGroup,
HardwareConfig, Pipeline) — since RUF100 indicates E402 isn't enabled; simply
delete the ",  # noqa: E402" portions so the imports remain unchanged but
without the unused noqa markers.

---

Duplicate comments:
In `@docs/kubernetes-guide.md`:
- Around line 680-683: Add language specifiers to the three fenced code blocks
that contain plain log/ASCII output: the block showing "Job name
'Qwen2.5_Math_7B' is not Kubernetes-compliant..." should begin with ```text, the
block containing "Pipeline has job dependencies but sequential=False..." should
begin with ```text, and the ASCII diagram that starts with
"┌─────────────────────────" should also begin with ```text; update each opening
fence accordingly so the blocks are treated as plain text/ASCII in the rendered
docs.

In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 975-1020: The get_logs function currently only reads logs from
pods.items[0], dropping logs from other pods in multi-node jobs; update get_logs
(in nemo_skills.pipeline.backends.kubernetes.KubernetesBackend) to either accept
an optional pod_index (int) or pod_name parameter and use it to select the
correct pod, or iterate over all pods.items to yield logs from each pod (for
follow=True you can stream each pod separately or document that follow streams a
single pod by index). Replace references to pods.items[0] with logic that: 1)
validates the requested index/pod name against pods.items and 2) loops over
pods.items (or picks the chosen pod) before calling read_namespaced_pod_log,
ensuring both the follow and non-follow branches handle multiple pods correctly
and include the pod identity in yielded lines for clarity.

In `@nemo_skills/pipeline/nemo_rl/sft.py`:
- Around line 557-580: The Kubernetes branch in the SFT runner currently drops
Slurm-specific params without feedback; update the conditional that routes to
_run_sft_kubernetes to detect if any Slurm-only arguments (qos, time_min,
exclusive, sbatch_kwargs, reuse_code, reuse_code_exp, _reuse_exp,
_task_dependencies) were provided and either log a warning via
LOG.warn/LOG.warning explaining these params are ignored when executor ==
"kubernetes" or raise a ValueError (follow project convention) instead of
silently ignoring them; reference the dispatcher code that checks
cluster_config.get("executor") and the _run_sft_kubernetes call to add this
validation and messaging before returning.
- Around line 614-639: The Slurm path add_task calls that perform CPU-only
conversion/averaging (e.g., the add_task that runs get_checkpoint_convert_cmd
with task_name f"{expname}-convert-final-ckpt" and the other add_task calls that
invoke get_checkpoint_convert_cmd or the averaging command) are incorrectly
passing num_gpus=num_gpus; change those to num_gpus=0 so the Slurm tasks mirror
the K8s path. Locate the add_task invocations that run checkpoint
conversion/averaging (identify by get_checkpoint_convert_cmd,
get_average_checkpoints_cmd or task_names containing "convert" or "average") and
set num_gpus=0, leaving other parameters unchanged.

In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 693-696: The current loop over job_dependencies silently logs and
skips string dependencies (the dep variable) which can break execution order on
Kubernetes; replace the LOG.warning branch in the loop that inspects
job_dependencies (where dep is checked with isinstance(dep, str)) so it raises a
clear exception (e.g., ValueError or a custom exception) instead of skipping,
including the dependency name and guidance that external string dependencies are
unsupported on Kubernetes; ensure the change surfaces in the same function
handling job_dependencies so callers immediately fail-fast when a string
dependency is provided.

In `@scripts/k8s-tests/image-build/build-and-load.sh`:
- Around line 19-26: Add a trap-based cleanup similar to the validation scripts:
implement a cleanup function (e.g., named cleanup) in build-and-load.sh that
deletes any intermediate pods created by this script (targeting pod names/labels
such as build-prep, build-nemo-rl-local, cache-copy and load-*) and any
temporary resources, then register it with trap 'cleanup' on EXIT INT TERM;
update places where pods are created (the build/run sections) to ensure they
either set identifying labels or predictable names so the cleanup can reliably
find and delete them.

In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Around line 38-39: Replace the insecure repo addition and deprecated key
import: stop using the echo ... /etc/apt/sources.list.d/nvidia-devtools.list
command that writes an http:// URL and stop using apt-key adv --fetch-keys ...
7fa2af80.pub; instead install the cuda-keyring package and add the NVIDIA
devtools repository entry with an https:// URL that includes a signed-by=
reference to the keyring file (so the sources list entry references the
installed cuda-keyring key), removing any apt-key usage and legacy key fetches.

In `@scripts/k8s-tests/image-build/import-image.yaml`:
- Around line 88-89: The DaemonSet import script currently ends with "sleep 10",
which causes the pod to exit after 10s and Kubernetes to restart it indefinitely
(violating the pass criteria "no restarts"); update the import container to use
"sleep infinity" instead of "sleep 10" so the pod stays alive after the import,
and ensure you add a cleanup step to delete the DaemonSet after verification (or
convert this flow to a per-node Job pattern) so the import does not run
repeatedly.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 101-110: The script incorrectly treats non-zero RAY_EXIT from the
blocking Ray worker as fatal; update the worker start/exit handling around the
ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block
invocation so the non-zero exit is suppressed per runbook (e.g., add the
shell-safe suppression like appending || true to the ray start command or remove
the RAY_EXIT check/exit path), and ensure the echo "Worker done (Ray head shut
down)." still runs; reference RAY_EXIT, MASTER_ADDR, NUM_GPUS and the ray start
--block command when making the change.

In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 51-52: The pipeline currently silences pip install errors by
redirecting stderr to /dev/null in the train_cmd string; update the train_cmd
(the string assembled in train_cmd) to remove the "2>/dev/null" redirection so
pip install errors are preserved in container logs (or alternatively redirect
stderr to a log file or tee it) and ensure the command still fails on non-zero
exit so diagnostic output remains available for debugging.
- Around line 167-182: The loop over result currently only collects logs and
fails on JobStatus.FAILED, letting NON-TERMINAL/indeterminate statuses (UNKNOWN,
CANCELLED) pass as success; update the logic in the for loop that calls
backend.wait_for_completion(handle, ...) so that any status not equal to
JobStatus.SUCCEEDED is treated as a non-success: always collect logs via
backend.get_logs(handle) for SUCCEEDED/FAILED/UNKNOWN/CANCELLED, call
backend.cleanup(handle) as now, and exit with non-zero (sys.exit(1)) for any
status != JobStatus.SUCCEEDED (i.e., include JobStatus.UNKNOWN and
JobStatus.CANCELLED alongside FAILED). Ensure you reference the same symbols:
result, backend.wait_for_completion, JobStatus, backend.get_logs,
backend.cleanup, and sys.exit.

In `@scripts/k8s-tests/sft_smoke_test.py`:
- Around line 213-222: wait_for_completion can return non-terminal statuses
(RUNNING/PENDING/UNKNOWN) on timeout but the loop currently treats them as
success; update the logic after calling backend.wait_for_completion(handle,
timeout=1800) to treat any status not in (JobStatus.SUCCEEDED, JobStatus.FAILED)
as a failure: always print and collect logs via backend.get_logs(handle) for
non-SUCCEEDED outcomes and call sys.exit(1) when status != JobStatus.SUCCEEDED
(i.e., for FAILED or timeout/non-terminal statuses). Ensure you reference the
existing symbols backend.wait_for_completion, JobStatus, backend.get_logs and
sys.exit in the updated branch so timeouts no longer pass silently.

---

Nitpick comments:
In `@docs/kubernetes-nemo-rl-sft-runbook.md`:
- Around line 75-84: The three fenced code blocks showing run output (the blocks
containing "▶ Setting up data... … === REAL NEMO-RL SFT COMPLETED ===" and the
other two similar expected-output blocks) are missing a language identifier and
trigger markdownlint MD040; update each opening triple-backtick to use ```text
so the blocks are explicitly plain text (e.g., replace ``` with ```text for the
fenced blocks that contain the expected output sequences).

In `@docs/kubernetes-sft-runbook.md`:
- Around line 44-49: Replace the fragile "sleep 15" pattern with an explicit
kubectl wait for the driver-check pod; after running kubectl run driver-check,
call kubectl wait --for=condition=ContainersReady pod/driver-check --timeout=30s
(or --for=condition=Ready if you prefer) and then run kubectl logs driver-check
and kubectl delete pod driver-check --ignore-not-found to ensure logs are
captured reliably; update references to kubectl run driver-check, kubectl logs
driver-check, and kubectl delete pod driver-check accordingly.

In `@nemo_skills/pipeline/backends/__init__.py`:
- Around line 29-66: The public API currently re-exports BackendFactory and
get_backend but omits the register_backend decorator; update
nemo_skills.pipeline.backends.__init__ by importing register_backend from
nemo_skills.pipeline.backends.factory and adding "register_backend" to the
__all__ list so users can register custom backends via the unified public
surface alongside BackendFactory and get_backend.

In `@nemo_skills/pipeline/backends/base.py`:
- Around line 116-121: ContainerSpec.__post_init__ currently checks name and
image but not command; update __post_init__ to validate that the command
attribute is present and non-empty (e.g., not None and len(command) > 0 or
command != []), and raise a ValueError with a clear message like "Container
command cannot be empty" when validation fails; locate the ContainerSpec class
and modify its __post_init__ method to include this check alongside the existing
name and image checks.

In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Line 268: The zip over containers should be made strict to catch length
mismatches: change the loop using zip(spec.containers, pod_spec.containers) to
call zip(spec.containers, pod_spec.containers, strict=True) so any future
divergence between spec.containers and pod_spec.containers raises immediately;
update the loop where pod_spec and spec containers are iterated and ensure the
runtime target supports Python 3.10+ (or add a compatibility guard if not).

In `@nemo_skills/pipeline/backends/slurm.py`:
- Around line 128-156: The code stores a `task` (from add_task) into
`self._jobs` while inside the `with get_exp(...)` context which may become stale
after the context closes; update future implementations of `get_status`,
`wait_for_completion`, and `cancel_job` to not rely on this stored `task`
reference (symbols: task, get_exp, self._jobs) — instead query Slurm directly
(e.g., via `sacct`, `squeue`, or re-open the experiment with `get_exp` when
needed) or store only stable identifiers (experiment name / Slurm job id) in
`self._jobs` and use those to retrieve live status/actions.

In `@nemo_skills/pipeline/utils/cluster.py`:
- Around line 63-73: The docstring for the timeout parsing utility (in the
function/method that describes SLURM and K8s formats—look for the parse_timeout
or similar function in cluster.py) repeats the SLURM format list twice; remove
the redundant second repetition (the lines that repeat "Time format for SLURM:
..." and the list that follows) while keeping the initial "Supports both:"
section and retaining the SLURM reference link
(https://slurm.schedmd.com/sbatch.html#OPT_time) so the docstring only lists the
formats once and remains accurate and concise.

In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 428-456: The _make_unique_job_name function currently uses
time-based suffix int(time.time()) % 100000 which can collide for concurrent
launches or runs ~27.7h apart; update the suffix to include an additional
uniqueness component (e.g., append the process id from os.getpid() or a short
random hex from secrets.token_hex) combined with the timestamp so suffix becomes
e.g. "-{timestamp}{pid_or_rand}". Ensure you adjust suffix_len calculation and
the max_base_len truncation logic accordingly so the final name still respects
max_length and strip trailing hyphens from the truncated base_name as before.
- Around line 826-832: The code incorrectly maps HardwareConfig.num_tasks
(Slurm-style "tasks per node") to ResourceSpec.cpus, conflating processes with
CPU core requests; update the mapping in the block that constructs resources
(where group.hardware or HardwareConfig() is used and ResourceSpec(...) is
created) to use a proper CPU source: add and prefer a new HardwareConfig.cpus
field if present, otherwise compute a safer default (e.g., max(hardware.cpus if
set else 0, hardware.num_tasks, 4) for GPU workloads or another sensible
heuristic), and set ResourceSpec.cpus to that computed value instead of
hardware.num_tasks so CPU core requests reflect cores, not MPI tasks.

In `@scripts/k8s-tests/check_nccl_logs.py`:
- Around line 133-137: The return type for validate_result is currently
unparameterized as tuple; update it to a specific Tuple[...] annotation (e.g.,
Tuple[bool, Optional[str]] if the function returns a success boolean and an
optional message) by importing Tuple from typing if necessary and replacing the
signature def validate_result(...) -> tuple: with def validate_result(...) ->
Tuple[bool, Optional[str]]: (or the exact Tuple[...] that matches the actual
return values in validate_result).

In `@scripts/k8s-tests/image-build/build-and-load.sh`:
- Around line 128-129: The kaniko executor image is using the mutable tag
"gcr.io/kaniko-project/executor:latest"; update the image reference in the build
job (the "kaniko" container definition) to a specific version tag or immutable
digest (for example a released version tag or an `@sha256`:... digest) to ensure
reproducible builds and prevent breaking changes from "latest". Locate the
"kaniko" container entry in build-and-load.sh and replace the image value with a
pinned tag or digest.
- Around line 234-255: The loader pod created by the kubectl run command (named
"load-$NODE") is set to privileged: true and mounts the host root via volume
"host-root" (mountPath "/host"), which is a high-risk security surface; add a
concise inline comment above this pod definition explaining that privileged mode
and the host-root mount are required to allow running ctr import against the
host containerd socket (volume "containerd-sock") but they increase attack
surface and should only be used in trusted test clusters, and note any
mitigation (readOnly host-root, namespace -n "$NAMESPACE", and why those are
insufficient); reference the kubectl run block, the "loader" container,
securityContext.privileged, and the "host-root"/"containerd-sock" volumes in the
comment.

In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml`:
- Line 53: Replace mutable "latest" image tags with fixed versioned tags for
reproducibility: update the image fields referencing "alpine/git:latest" and
"gcr.io/kaniko-project/executor:latest" to specific, pinned versions (e.g.,
"alpine/git:v2.45.2" and "gcr.io/kaniko-project/executor:v1.23.2") wherever
those image strings appear in the manifest so builds use immutable, reproducible
images.

In `@scripts/k8s-tests/image-build/kaniko-build.sh`:
- Around line 36-37: Remove the dead variables SCRIPT_DIR and REPO_ROOT from
kaniko-build.sh: these are defined but never used (SCRIPT_DIR only serves to
compute REPO_ROOT). Delete the two lines that set SCRIPT_DIR and REPO_ROOT (the
symbols SCRIPT_DIR and REPO_ROOT) so there is no unused-variable noise and
ShellCheck SC2034 is resolved; if the script actually needs the repository path
later, instead compute and use it where required rather than defining them
unused at the top.

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 48-49: The env entry MASTER_ADDR=${MASTER_ADDR} is a redundant
self-assignment; remove this line from the container/env block (or replace it
with a clarifying comment) so the pod uses the existing MASTER_ADDR provided
earlier, and ensure any intended headless-service DNS is explicitly assigned if
needed (search for MASTER_ADDR and the env block in the manifest to locate the
exact spot).

In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml`:
- Around line 43-73: The run_sft.sh script redundantly exports NCCL_DEBUG=INFO
while the container environment already sets NCCL_DEBUG; remove the export line
from the run_sft.sh content (the line "export NCCL_DEBUG=INFO") to avoid
duplicate environment configuration, keeping the container env entry ({name:
NCCL_DEBUG, value: "INFO"}) as the single source of truth.

In `@scripts/k8s-tests/pipeline_smoke_test.py`:
- Around line 197-216: The dry-run path calls the private
pipeline._convert_groups_to_job_spec method (after pipeline.run(dry_run=True)),
which couples this smoke test to an internal API; add a short inline comment
immediately above the call to _convert_groups_to_job_spec explaining that this
is intentionally using a private/internal method and warning maintainers that
changes to that method name or signature will break the test, and optionally
mention why it's safe here (validation-only) so future developers understand the
dependency.
- Around line 236-252: Replace the hardcoded wait timeout (900) passed to
backend.wait_for_completion with a CLI-configurable value: add a command-line
argument (e.g., parser.add_argument('--timeout', type=int, default=900)) where
args is built, then call backend.wait_for_completion(handle,
timeout=args.timeout); keep the lazy imports (get_backend, JobStatus) inside the
function as they are to avoid import-time failures.

In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 30-102: The train_cmd in main() embeds a large multi-line Python
training script into a shell heredoc (writing to /tmp/train_pipeline.py), which
is fragile and hard to maintain; extract that embedded script into a standalone
Python file (e.g., sft_train_pipeline.py) and update train_cmd to run that file
instead (keep the same environment exports and torchrun invocation), or
alternatively load the script text from a separate resource at runtime and
inject it into /tmp/train_pipeline.py; update references to the heredoc and
/tmp/train_pipeline.py in train_cmd so main() uses the external script
(train_cmd variable) while preserving the torchrun flags (--nproc_per_node,
--nnodes, --node_rank, --master_addr, --master_port) and any environment
variable usages (NODE_RANK, MASTER_ADDR).

In `@scripts/k8s-tests/smoke_test_single_node.sh`:
- Around line 202-213: The current wait treats any non-zero exit as "failed" but
only checks .status.failed; update the post-wait logic to distinguish an actual
job failure vs a timeout/still-running job by querying the job status with
kubectl get job "$JOB_NAME" -n "$NAMESPACE" -o jsonpath and examining
.status.failed and .status.active (or .status.conditions[*].type for
"Failed"/"Complete"). Concretely: after the kubectl wait block, if
JOB_STATUS="failed" run a kubectl get job and if .status.failed > 0 set a clear
failure message, else if .status.active > 0 report "Job timed out/still
running", otherwise inspect .status.conditions for a Failed condition to report
failure; use the existing JOB_NAME, JOB_STATUS and FAILED variables and the same
namespace variable to implement these checks.

In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh`:
- Around line 196-207: The script's kubectl wait call blocks until the full
${TIMEOUT} if the Job fails because it only waits for condition=complete; update
the wait invocation that references JOB_NAME and TIMEOUT to include both
conditions (e.g., --for=condition=complete --for=condition=failed) so the
command returns as soon as the Job either completes or fails (Kubernetes ≥1.23
required), and after the wait check the Job status (inspect the job object or
use kubectl get job/$JOB_NAME -o jsonpath='{.status}') to distinguish failure vs
timeout and proceed to print pod logs and exit appropriately.

In `@scripts/k8s-tests/validate_sft_e2e.sh`:
- Around line 303-311: The failure branch that handles a non-completing job
should capture and append the pod logs into the same $LOG_FILE used by the
success path; update the block that runs kubectl logs (the branch that runs
kubectl wait --for=condition=complete ... for job/$JOB_NAME and then fetches POD
via kubectl get pods) so that it writes the diagnostic output into $LOG_FILE
(create the file if necessary) — e.g., redirect the kubectl logs output and any
echo messages into $LOG_FILE (and still print to stdout if desired) so
post-mortem artifacts are preserved for JOB_NAME and NAMESPACE. Ensure you use
the same $LOG_FILE variable and preserve the current --tail=50 behavior when
saving logs.

- fix backend validation/error handling and resource mapping edge cases

- harden k8s manifests/scripts (securityContext, DNS fail-fast, cleanup, pinned images)

- update runbooks/examples to 25.04 and remove dgx-29 references in docs

- full pre-commit passed; real 2-node x 2-GPU MNMG validation passed

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- add missing docstrings in changed pipeline backends/utils and k8s test scripts

- update nemo-rl k8s Dockerfile to run as non-root user in final image

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- remove hardcoded nemo-skills/nemo-rl:latest in build-and-load.sh

- make immutable tag builds/loads deterministic across nodes

Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
@esnvidia
Copy link
Author

Validation rerun with immutable image tag (no shortcut path)

  • Updated scripts/k8s-tests/image-build/build-and-load.sh so local Kaniko build uses IMAGE_TAG (removed hardcoded
    :latest).
  • Rebuilt and loaded image with immutable tag:
    • nemo-skills/nemo-rl:pr1271-20260222-235229
    • Imported on: dgx-29, dgx-30, dgx-31
    • Validation pod succeeded (nemo_skills OK: 0.7.0)
  • Real MNMG SFT pipeline check (2 nodes x 2 GPUs) with that exact tag:
    • PYTHONPATH=. .venv/bin/python scripts/k8s-tests/run_sft_k8s_real.py --num-nodes 2 --num-gpus 2 --image nemo- skills/nemo-rl:pr1271-20260222-235229
    • Job: sft-train-622846688
    • Result: succeeded
    • Pass signal: REAL SFT K8S PIPELINE: ALL JOBS COMPLETED
  • pre-commit run --all-files: all hooks passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant