Add Kubernetes multi-node SFT backend validation assets and docs#1271
Add Kubernetes multi-node SFT backend validation assets and docs#1271esnvidia wants to merge 13 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a ComputeBackend abstraction and factory plus Kubernetes, Slurm, and Local backend implementations; integrates Kubernetes routing into the declarative pipeline; introduces extensive Kubernetes docs, cluster manifests, image-build and test tooling, pre-commit K8s validation hooks, and related unit tests and utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Pipeline / User
participant Factory as BackendFactory
participant K8s as KubernetesBackend
participant API as Kubernetes API
participant DNS as DNS / Headless Service
participant Pod as Job Pod
User->>Factory: get_backend(cluster_config)
Factory-->>K8s: initialize(cluster_config)
K8s->>API: load kubeconfig / health_check
API-->>K8s: OK
Factory-->>User: backend instance
User->>K8s: submit_job(JobSpec)
alt multi-node
K8s->>API: create Headless Service
API-->>K8s: Service created
end
K8s->>API: create Job / Indexed Job
API-->>K8s: Job created
K8s-->>User: JobHandle
loop monitor
User->>K8s: get_status / wait_for_completion
K8s->>API: query Job / Pod status
API-->>K8s: status
K8s-->>User: JobStatus
end
Pod->>DNS: resolve MASTER_ADDR (headless service)
DNS-->>Pod: rank-0 IP
Pod->>Pod: init distributed group (NCCL)
User->>K8s: get_logs(handle, follow=True)
K8s->>API: stream pod logs
API-->>K8s: log lines
K8s-->>User: log iterator
User->>K8s: cleanup(handle)
K8s->>API: delete Job
alt multi-node
K8s->>API: delete Headless Service
end
API-->>K8s: resources deleted
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (18)
scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s-11-12 (1)
11-12:⚠️ Potential issue | 🟡 Minor
NEMO_SKILLS_PATHbuild arg documented but never declared or used.The header comment at line 12 mentions
NEMO_SKILLS_PATHas a build arg, but noARG NEMO_SKILLS_PATHis declared. Line 100 hardcodesCOPY skills/ /opt/NeMo-Skills/. Either remove the misleading comment or wire up the ARG.Suggested fix (remove misleading comment)
# Build args: # NEMO_RL_COMMIT: NeMo-RL commit to build from (default: pinned) -# NEMO_SKILLS_PATH: path to local NeMo-Skills code in build context (default: /skills)Also applies to: 99-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s` around lines 11 - 12, The header documents a build arg NEMO_SKILLS_PATH that isn't declared or used; either remove the misleading comment or wire it up by adding ARG NEMO_SKILLS_PATH with a sensible default and use it in the Dockerfile COPY (replace COPY skills/ /opt/NeMo-Skills/ with COPY ${NEMO_SKILLS_PATH} /opt/NeMo-Skills/), ensuring the variable name NEMO_SKILLS_PATH is declared before use (ARG NEMO_SKILLS_PATH=/skills) so the build arg is recognized.scripts/k8s-tests/validate_sft_e2e.sh-177-178 (1)
177-178:⚠️ Potential issue | 🟡 MinorCheckpoint numbering mismatch between comments and print statements.
The inline comments and the
print()labels are out of sync:
Comment (line) Says Print output 177 "Validation checkpoint 3" Checkpoint 4: SFT training208 "Validation checkpoint 4" Checkpoint 5: NCCL all-reduceThis will confuse anyone correlating log output with the source.
Suggested fix
-# === Validation checkpoint 3: SFT-style training === -print(f"\n=== Checkpoint 4: SFT training (Rank {rank}) ===") +# === Validation checkpoint 4: SFT-style training === +print(f"\n=== Checkpoint 4: SFT training (Rank {rank}) ===")-# === Validation checkpoint 4: NCCL all-reduce === -print(f"\n=== Checkpoint 5: NCCL all-reduce (Rank {rank}) ===") +# === Validation checkpoint 5: NCCL all-reduce === +print(f"\n=== Checkpoint 5: NCCL all-reduce (Rank {rank}) ===")Also applies to: 208-209
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 177 - 178, The inline checkpoint comments and the print() labels are out of sync; update either the comment text ("Validation checkpoint 3"/"Validation checkpoint 4") or the corresponding print messages (print(f"...Checkpoint 4: SFT training (Rank {rank})...") and the print for "Checkpoint 5: NCCL all-reduce") so they match consistently (e.g., change the print strings to "Checkpoint 3..." and "Checkpoint 4..." or rename the comments to match); ensure both the comment preceding the block and the print(...) string for SFT training and the subsequent NCCL all-reduce block use the same checkpoint numbers.scripts/k8s-tests/sft_smoke_test.py-200-203 (1)
200-203:⚠️ Potential issue | 🟡 MinorMissing
Nonecheck onpipeline.run()result before iterating.If
pipeline.run()returnsNone(e.g., backend error),result.items()on line 202 will raiseAttributeError. The sibling scriptpipeline_smoke_test.py(line 221) guards against this.Suggested fix
result = pipeline.run(dry_run=False) + if result is None: + print("ERROR: Pipeline returned None (unexpected)") + sys.exit(1) print(f"\nSubmitted {len(result)} job(s)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/sft_smoke_test.py` around lines 200 - 203, The code calls pipeline.run(dry_run=False) and immediately iterates result.items(), but pipeline.run may return None causing an AttributeError; add a None-check after result = pipeline.run(...) (e.g., if result is None: log an error or raise/exit) before the print and for loop that reference result.items(), ensuring you reference the pipeline.run call and the subsequent iteration over result.items() in your fix.scripts/k8s-tests/validate_sft_e2e_multinode.sh-249-270 (1)
249-270:⚠️ Potential issue | 🟡 MinorCleanup is skipped when validation fails.
The
exit 1at line 262 prevents the cleanup block (lines 265–269) from executing, leaving the Job, Service, and ConfigMap behind. Consider using atrapfor cleanup to ensure resources are always released.Suggested fix
Add a cleanup trap near the top of the script (after variable setup):
+cleanup() { + echo "Cleaning up..." + kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null + kubectl delete svc "$SVC" -n "$NAMESPACE" --ignore-not-found 2>/dev/null + kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null +} +trap cleanup EXITThen remove the explicit cleanup block at lines 265–269 and the
echo "Done."(the trap handles it). Theexit 1at line 262 will now trigger cleanup automatically.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh` around lines 249 - 270, The script currently exits early on failure and skips resource cleanup; define a cleanup function (e.g., cleanup() that kubectl deletes JOB_NAME, SVC, and the configmap in NAMESPACE and prints completion) and register it with trap 'EXIT' (and optionally ERR) near the top after variables are set so any exit runs cleanup; then remove the explicit cleanup block at the end (the echo "Done." and the kubectl delete lines) since the trap will handle resource deletion when exit 1 is triggered.scripts/k8s-tests/validate_sft_e2e.sh-350-351 (1)
350-351:⚠️ Potential issue | 🟡 MinorCleanup is incomplete — ConfigMap is leaked, and cleanup is skipped on failure.
- The ConfigMap
sft-e2e-script(created at line 236) is never deleted.- Multiple
exit 1paths (lines 303, 347) bypass the cleanup block entirely.Use a
trapand include the ConfigMap deletion.Suggested fix — add near top of script after variable setup
+cleanup() { + kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null + kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null +} +trap cleanup EXITThen replace lines 350–351:
-# Cleanup -kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 350 - 351, Add a robust cleanup via a trap so the job and the ConfigMap "sft-e2e-script" are always removed even if the script exits early; implement a cleanup function (e.g., cleanup()) that deletes the Job ($JOB_NAME) and the ConfigMap "sft-e2e-script" in $NAMESPACE (use --ignore-not-found and redirect stderr as needed), register it with trap 'cleanup' EXIT INT TERM near the top after variables are set, and remove or replace any standalone kubectl delete lines that were bypassed by early exit so all deletions happen inside cleanup.scripts/k8s-tests/validate_sft_e2e_multinode.sh-69-69 (1)
69-69:⚠️ Potential issue | 🟡 MinorHardcoded ConfigMap name can cause collisions between concurrent runs.
The job name uses a unique timestamp suffix (
mn-sft-test-$(date +%s | tail -c 6)), but the ConfigMap name is alwayssft-e2e-script. If two instances of this script run concurrently in the same namespace, one will overwrite the other's ConfigMap.Suggested fix — derive ConfigMap name from job name
-kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null -kubectl create configmap sft-e2e-script -n "$NAMESPACE" --from-literal=train.py=' +CM_NAME="${JOB_NAME}-script" +kubectl delete configmap "$CM_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null +kubectl create configmap "$CM_NAME" -n "$NAMESPACE" --from-literal=train.py='And update the volume reference at line 181:
- - name: script - configMap: {name: sft-e2e-script} + - name: script + configMap: {name: ${CM_NAME}}And cleanup at line 269:
-kubectl delete configmap sft-e2e-script -n "$NAMESPACE" --ignore-not-found 2>/dev/null +kubectl delete configmap "$CM_NAME" -n "$NAMESPACE" --ignore-not-found 2>/dev/null🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh` at line 69, Derive a unique ConfigMap name from the job name instead of using the hardcoded "sft-e2e-script": create a variable like CONFIGMAP_NAME="${JOB_NAME}-script" (where JOB_NAME is set as mn-sft-test-$(date +%s | tail -c 6)), use kubectl create configmap "$CONFIGMAP_NAME" ... when creating it, replace any hardcoded references (the volume definition/mount that currently points to sft-e2e-script) to reference "$CONFIGMAP_NAME" (the volume name and items), and update the cleanup command to kubectl delete configmap "$CONFIGMAP_NAME" --ignore-not-found so concurrent runs don’t collide.nemo_skills/pipeline/backends/base.py-61-69 (1)
61-69:⚠️ Potential issue | 🟡 MinorMissing cross-validation:
memory_request_gb>memory_limit_gb.When both
memory_request_gbandmemory_limit_gbare specified, the request should not exceed the limit. Kubernetes will reject a pod spec where the request exceeds the limit, so validating here gives a clearer error message.Proposed fix
if self.memory_limit_gb is not None and self.memory_limit_gb <= 0: raise ValueError(f"memory_limit_gb must be positive, got {self.memory_limit_gb}") + if ( + self.memory_request_gb is not None + and self.memory_limit_gb is not None + and self.memory_request_gb > self.memory_limit_gb + ): + raise ValueError( + f"memory_request_gb ({self.memory_request_gb}) cannot exceed " + f"memory_limit_gb ({self.memory_limit_gb})" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/base.py` around lines 61 - 69, Add a cross-validation in __post_init__ to ensure memory_request_gb does not exceed memory_limit_gb: after the existing non-negative/positive checks for memory_request_gb and memory_limit_gb, if both self.memory_request_gb and self.memory_limit_gb are not None and self.memory_request_gb > self.memory_limit_gb, raise a ValueError with a clear message like "memory_request_gb (X) cannot be greater than memory_limit_gb (Y)". This uses the existing __post_init__ method and the fields memory_request_gb and memory_limit_gb.docs/kubernetes-guide.md-562-570 (1)
562-570:⚠️ Potential issue | 🟡 MinorRemove
cluster_config=cluster_configfrom theGenerationClientScriptexample.
GenerationClientScriptdoes not have acluster_configparameter. OnlyServerScriptandSandboxScriptaccept it. Passing it here will cause aTypeErrorat runtime.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-guide.md` around lines 562 - 570, The GenerationClientScript instantiation incorrectly passes a non-existent parameter cluster_config which will raise a TypeError; remove the cluster_config=cluster_config argument from the GenerationClientScript call (leave cluster_config only on ServerScript or SandboxScript where supported) so the call to GenerationClientScript(...) uses only its valid parameters like output_dir, servers, model_names, server_types, and extra_arguments.docs/kubernetes-status.md-39-47 (1)
39-47:⚠️ Potential issue | 🟡 MinorContradictory placement: multi-node training listed as "Not Implemented" but status shows ✅ validated.
Line 43 places multi-node training under "Not Implemented" yet shows "✅ Backend + cluster validated". The rest of this PR (runbook, manifests,
_run_sft_kubernetes()) demonstrates a working Indexed Job + Headless Service approach. Consider moving this to "Degraded Functionality" with a note that CLI integration (ns nemo_rl sft --num_nodes) is pending, or updating the section heading.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-status.md` around lines 39 - 47, The "Not Implemented" table incorrectly lists "Multi-node training" as not implemented despite the diff and code (Indexed Job + Headless Service, runbook, and _run_sft_kubernetes()) showing it validated; update the docs by moving the "Multi-node training" row out of the "Not Implemented" section into "Degraded Functionality" (or change the section heading) and add a short note that CLI integration via "ns nemo_rl sft --num_nodes" is still pending while the backend/cluster (Indexed Job + Headless Service) is validated.docs/kubernetes-status.md-54-61 (1)
54-61:⚠️ Potential issue | 🟡 MinorStale reference: "Multi-node training needs MPI Operator" contradicts the validated Indexed Job approach.
Line 56 states MPI Operator is needed, but the PR validates multi-node via Indexed Job + Headless Service without MPI Operator. Similarly, line 85 repeats this. These references appear to be from an earlier design phase and should be updated to reflect the current architecture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-status.md` around lines 54 - 61, Stale docs claim "Multi-node training needs MPI Operator" but the implementation uses Indexed Job + Headless Service; update the table row containing "Multi-node training needs MPI Operator" (the entry tied to `ns nemo_rl sft --num_nodes=8`) and any repeated text (the other occurrence that repeats the MPI Operator claim) to state the validated approach ("Supported via Indexed Job + Headless Service, no MPI Operator required") and remove or rephrase the earlier-design references; ensure any surrounding notes and the lines referencing the same behavior (e.g., the repeated sentence around the later paragraph) are adjusted to reflect the current architecture.scripts/k8s-tests/run_sft_k8s_real.py-37-42 (1)
37-42:⚠️ Potential issue | 🟡 Minor
--skip-conversionis a dead argument: always True, never referenced.
action="store_true"withdefault=Truemakes the flag permanently True with no way to disable it. More importantly,args.skip_conversionis never used inmain(). Either remove this argument or integrate it into the pipeline construction to conditionally skip the conversion job.Suggested fix (remove if unused)
- parser.add_argument( - "--skip-conversion", - action="store_true", - default=True, - help="Skip checkpoint conversion job (default: True for testing)", - )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/run_sft_k8s_real.py` around lines 37 - 42, The parser.add_argument call for "--skip-conversion" is dead (action="store_true" with default=True makes it always true and args.skip_conversion is never used); either remove this argument declaration or wire args.skip_conversion into main() so the conversion job is conditionally skipped. Locate the parser.add_argument for "--skip-conversion" and either delete it, or update main() (where the pipeline/conversion job is constructed) to check args.skip_conversion and skip creating or submitting the conversion job accordingly (use the args.skip_conversion boolean to gate the conversion step).nemo_skills/pipeline/backends/slurm.py-120-120 (1)
120-120:⚠️ Potential issue | 🟡 Minor
" ".join(command)can break commands with arguments containing spaces.If
main_container.commandcontains arguments with spaces (e.g.,["python", "-c", "print('hello world')"]), joining with a space and passing as a single string toadd_taskwill lose the argument boundaries. Consider usingshlex.join()for proper shell quoting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/slurm.py` at line 120, The code builds a command string with cmd = " ".join(main_container.command), which breaks arguments that contain spaces; update the construction of cmd to use shlex.join(main_container.command) (import shlex at top) so arguments are shell-quoted correctly before passing cmd to add_task (or wherever cmd is used), ensuring main_container.command remains an array and argument boundaries are preserved.scripts/k8s-tests/check_nccl_logs.py-95-98 (1)
95-98:⚠️ Potential issue | 🟡 Minor
"IB" in lineis prone to false positives.The substring
"IB"matches many unrelated words in log output (e.g., "LIB", "INHIBIT", "POSSIBLE", "VISIBLE"). This could causeib_detectedto be set incorrectly. Consider using a word-boundary regex or a more specific pattern.Proposed fix
- if "IB" in line and ("NCCL INFO" in line or "transport" in line.lower()): + if re.search(r'\bIB\b|InfiniBand|ibverbs', line) and ("NCCL INFO" in line or "transport" in line.lower()):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/check_nccl_logs.py` around lines 95 - 98, The check is using a plain substring check ("IB" in line) which yields false positives; update the detection in the loop that sets result.ib_detected and appends to result.transport_used to use a word-boundary or more specific regex (e.g., r'\bIB\b' or explicit patterns like "InfiniBand" / "IB:"), and apply case-insensitive matching while preserving the existing NCCL INFO/transport logic so only true "IB" tokens trigger result.ib_detected and appending to result.transport_used.scripts/k8s-tests/smoke_test_multi_node.sh-131-139 (1)
131-139:⚠️ Potential issue | 🟡 MinorDNS wait loop silently continues on resolution failure.
If the master address never resolves, the loop completes all 60 iterations and execution falls through to
torchrun, which will fail with a confusing connection error. Compare with the multinode manifest (real-nemo-rl-sft-multinode-test.yamllines 66-69) which correctly adds a post-loop check andexit 1.Proposed fix
# Wait for DNS resolution of master node echo "Waiting for master DNS resolution..." for i in \$(seq 1 60); do if getent hosts ${MASTER_ADDR} > /dev/null 2>&1; then echo "Master resolved: \$(getent hosts ${MASTER_ADDR})" break fi sleep 2 done + if ! getent hosts ${MASTER_ADDR} > /dev/null 2>&1; then + echo "FAIL: Master DNS not resolvable after 120s" + exit 1 + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 131 - 139, The DNS wait loop using getent hosts ${MASTER_ADDR} may silently fail after 60 iterations and proceed to run torchrun; update the loop logic to check whether resolution succeeded and if not print a clear error and exit non-zero. Specifically, after the for loop that checks getent hosts ${MASTER_ADDR} and breaks on success, add a post-loop check that verifies resolution (e.g., re-run getent or track a success flag) and if unresolved emit a descriptive error mentioning MASTER_ADDR and exit 1 so torchrun is not invoked on a bad master address.scripts/k8s-tests/check_nccl_logs.py-100-103 (1)
100-103:⚠️ Potential issue | 🟡 MinorRegex alternation precedence issue.
r"NET.*Socket|TCP"matches eitherNET.*Socketor justTCPanywhere, because|has the lowest precedence. A bareTCPmatch (e.g., in a log line mentioning "TCP_NODELAY" or any TCP reference) would settcp_fallback = Trueincorrectly.Proposed fix
- if re.search(r"NET.*Socket|TCP", line) and "NCCL" in line: + if re.search(r"NET.*(?:Socket|TCP)", line) and "NCCL" in line:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/check_nccl_logs.py` around lines 100 - 103, The regex r"NET.*Socket|TCP" is too permissive because the alternation applies only to the left term; update the pattern used in the conditional that sets result.tcp_fallback (the if checking re.search(..., line) before setting result.tcp_fallback and appending to result.transport_used) to group the alternation and/or require a more specific TCP token (e.g. use a grouped alternation like "(?:NET.*Socket|TCP/Socket)" or include word boundaries for TCP as "\bTCP\b") so that incidental "TCP" substrings (like TCP_NODELAY) do not trigger tcp_fallback and only the intended "NET...Socket" or "TCP/Socket" matches do.scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml-101-105 (1)
101-105:⚠️ Potential issue | 🟡 Minor
|| trueon Ray worker masks genuine connection failures.If the worker fails to connect to the Ray head (e.g., wrong address, network issue), the pod still exits 0 and the Job reports success. Consider adding a pre-check (e.g., verify Ray head is reachable) or logging the exit code before the
|| trueso failures are at least observable in logs.Suggested improvement
else echo "Starting Ray worker node (joining head at ${MASTER_ADDR}:6379)..." - ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block || true + ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block + RAY_EXIT=$? + if [ $RAY_EXIT -ne 0 ]; then + echo "WARN: Ray worker exited with code $RAY_EXIT (expected when head shuts down)" + fi echo "Worker done (Ray head shut down)." fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml` around lines 101 - 105, The shell wrapper for starting the Ray worker currently masks failures by appending "|| true" to the "ray start --address=\"${MASTER_ADDR}:6379\" --num-gpus=${NUM_GPUS} --block" command; remove the unconditional "|| true" and instead perform a pre-check that the head at ${MASTER_ADDR}:6379 is reachable (e.g., TCP check/wait loop) before invoking ray start, and if ray start fails capture and log its exit code (and exit non‑zero) so connection failures are visible in logs and cause the pod/Job to fail; update the surrounding echo messages to include the exit code when ray start returns non‑zero.nemo_skills/pipeline/backends/local.py-198-213 (1)
198-213:⚠️ Potential issue | 🟡 MinorTimeout check uses
if timeoutwhich treatstimeout=0as "no timeout".
if timeout and (time.time() - start_time) > timeoutevaluates toFalsewhentimeout=0. PerComputeBackend.wait_for_completion,timeout=Nonemeans "wait indefinitely". Usingif timeout is not Nonewould correctly distinguish 0 from None, though 0-second timeouts are admittedly an edge case.Proposed fix
- if timeout and (time.time() - start_time) > timeout: + if timeout is not None and (time.time() - start_time) > timeout:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/local.py` around lines 198 - 213, The timeout check in LocalComputeBackend.wait_for_completion uses "if timeout" which treats timeout=0 as no timeout; change the condition in wait_for_completion (function name: wait_for_completion, local variables: timeout, start_time, status) to explicitly check "if timeout is not None and (time.time() - start_time) > timeout" so a 0-second timeout is honored (i.e., return the current status immediately after the first loop) while None still means wait indefinitely.nemo_skills/pipeline/backends/kubernetes.py-759-779 (1)
759-779:⚠️ Potential issue | 🟡 Minor
_parse_timeoutraises unhandledValueErroron invalid input.If
timeout_stris something like"abc", the finalint(timeout_str)at line 779 will raise a bareValueErrorwith no context about what went wrong. Similarly,"1.5h"would fail becauseint("1.5")raises.Proposed fix
- return int(timeout_str) + try: + return int(timeout_str) + except ValueError: + raise ValueError(f"Invalid timeout format: '{timeout_str}'. Use '6h', '30m', '300s', or integer seconds.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/kubernetes.py` around lines 759 - 779, _parse_timeout currently can raise unhelpful ValueError (e.g., "abc" or "1.5h"); update _parse_timeout to validate and normalize inputs: accept numeric floats for units (e.g., "1.5h") by parsing with float before converting to seconds, wrap all int/float conversions for unit suffixes and HH:MM:SS parsing in try/except to catch ValueError, and raise a new ValueError with a clear message that includes the original timeout_str; ensure HH:MM:SS and MM:SS branches validate the number of parts and numeric content and similarly raise informative errors on failure.
|
Addressed the review feedback and pushed fixes across backend, pipeline, scripts, and docs. What was fixed:
Validation run:
Logs were generated under |
8d9211e to
b766b9a
Compare
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (6)
docs/kubernetes-status.md (1)
57-59: Hardcoded source-line references will become stale.References like
declarative.py:641-647anddeclarative.py:692-694will drift as code changes. Consider replacing with symbol/function names or GitHub permalink anchors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-status.md` around lines 57 - 59, Replace the hardcoded file:line references in the table with stable identifiers: point to the actual function or class names that emit those runtime warnings in declarative.py (e.g., the function that enforces job-dependency sequencing and the function that logs skipped external dependencies) or use a GitHub permalink to a specific commit/anchor rather than `declarative.py:641-647`/`692-694`; update the three table cells ("Job dependencies force sequential", "SSH tunneling not available", "External dependencies skipped") to reference those symbol names or permalink anchors so the docs remain correct as code lines shift.nemo_skills/pipeline/backends/kubernetes.py (1)
264-274: Addstrict=Truetozip()for container list alignment safety.
spec.containersandpod_spec.containersshould always be the same length since the latter is built from the former (lines 205-207). Addingstrict=Trueacts as a safety net against future bugs that could break this invariant.Proposed fix
- for container_spec, container in zip(spec.containers, pod_spec.containers): + for container_spec, container in zip(spec.containers, pod_spec.containers, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/kubernetes.py` around lines 264 - 274, The loop that pairs spec.containers with pod_spec.containers should use zip(..., strict=True) to enforce equal-length alignment; update the loop in the Kubernetes backend where you currently have for container_spec, container in zip(spec.containers, pod_spec.containers): to use zip(spec.containers, pod_spec.containers, strict=True) so any mismatch is raised immediately—this affects the block that calls self._build_container_mounts(container_spec.mounts, existing_volumes), extends pod_spec.volumes with extra_volumes, appends extra_mounts to container_mounts, and assigns container.volume_mounts.scripts/k8s-tests/sft_smoke_test.py (1)
156-169: Both single-node and multi-node torchrun commands use heredoc for script injection — consider a shared builder.The command construction for single-node (lines 164-168) and multi-node (lines 157-162) share most of their structure (NCCL env exports, heredoc, torchrun invocation). The only difference is the
--nnodes,--node_rank, and--master_addrflags. A small helper could reduce duplication, though this is acceptable for a smoke test script.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/sft_smoke_test.py` around lines 156 - 169, Refactor the duplicated torchrun command construction by building a shared base command that includes the NCCL exports, heredoc injection of SFT_TRAIN_SCRIPT and the common torchrun args (e.g., --nproc_per_node from args.gpus and --master_port), then conditionally append the multi-node flags (--nnodes={args.nodes}, --node_rank=${{NODE_RANK:-0}}, --master_addr=${{MASTER_ADDR:-localhost}}) when args.nodes > 1; update the variable cmd to use this builder so the only differences between single-node and multi-node are added conditionally rather than duplicating the full heredoc + torchrun string.nemo_skills/pipeline/backends/local.py (1)
240-258: Minor thread-safety concern inget_logswhen reading from shared list.Lines 242-244 iterate over
job.logs[container]while_collect_logs(running on a daemon thread) concurrently appends to the same list. CPython's GIL makes this safe in practice, but thefollowpath (lines 247-258) properly uses index-based access to avoid issues. The initial yield loop (lines 242-244) could similarly use index-based access for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/local.py` around lines 240 - 258, The initial log-yield loop in get_logs reads job.logs[container] with a for-loop while _collect_logs may append concurrently; replace that for-loop with the same index-based pattern used in the follow path: grab current_logs = job.logs.get(container, []) and use while yielded < len(current_logs): yield current_logs[yielded]; yielded += 1 so iteration is consistent and avoids concurrent-iteration issues with _collect_logs.nemo_skills/pipeline/backends/factory.py (1)
152-174: Registry-checked backends can shadow lazy-loaded built-in backends.If someone registers a backend with the name
"slurm","kubernetes","local", or"none"viaregister_backend, it will silently override the built-in lazy-import path (line 153 takes precedence). This may be intentional for extensibility but is worth documenting or validating.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/factory.py` around lines 152 - 174, Registry entries currently override built-in lazy-loaded backends; to prevent silent shadowing, treat the built-in names {"slurm","kubernetes","local","none"} as reserved: update the register_backend function to validate the provided name against this reserved set and raise a ValueError (or reject) if someone attempts to register one of them, and also add a short comment in the factory function where _BACKEND_REGISTRY is checked to clarify the reserved-name behavior; reference register_backend, _BACKEND_REGISTRY, and the executor variable to locate where to add the validation and comment.nemo_skills/pipeline/utils/declarative.py (1)
260-260:was_modifiedcondition can be simplified.
name != original.lower() or original != original.lower()is equivalent toname != originalsincenameis entirely derived fromoriginalthrough the transformations above.♻️ Proposed simplification
- was_modified = name != original.lower() or original != original.lower() + was_modified = name != original🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` at line 260, The condition computing was_modified is overly complex; replace the expression "was_modified = name != original.lower() or original != original.lower()" with the simpler equality check "was_modified = name != original" because name is derived from original via the prior transformations; update this in the function/method where variables name, original, and was_modified are defined (search for the was_modified assignment in declarative.py) to remove redundant checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/kubernetes-guide.md`:
- Around line 680-683: Add language specifiers to the three fenced code blocks
that currently lack them by changing their opening fences to ```text;
specifically update the blocks containing the log lines "Job name
'Qwen2.5_Math_7B' is not Kubernetes-compliant. Sanitized to 'qwen2-5-math-7b'."
and "Job 'Qwen2.5_Math_7B' will be submitted as 'qwen2-5-math-7b-12345'", and
the block that begins with the ASCII diagram line
"┌─────────────────────────..." (and the other similar code blocks referenced in
the comment) so they are fenced as ```text (or remove the fence for the ASCII
diagram if you prefer plain text).
In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 91-92: Replace the permissive lookup
cluster_config.get("executor") with direct access cluster_config["executor"] in
the KubernetesBackend validation so a missing executor raises KeyError
immediately; update the conditional in the function/method where
KubernetesBackend validates cluster_config (the block using
cluster_config.get("executor") and raising ValueError) to access
cluster_config["executor"] and keep the ValueError check for non-"kubernetes"
values.
- Around line 496-502: The try/except around int(gpu_count) currently swallows
TypeError/ValueError and leaves no trace, causing RDMA resources to be injected
for containers with malformed gpu_count; update the exception handler in the
gpu_count check (the block referencing gpu_count in the Kubernetes backend that
injects RDMA resources) to log a warning or error including the offending
gpu_count value and context (container/pod identifier if available) and then
continue (or skip injection) instead of silently passing, so malformed values
are visible in logs and RDMA injection is skipped for invalid counts.
In `@nemo_skills/pipeline/backends/slurm.py`:
- Line 136: The expression "main_container.resources.gpus or None" treats 0 as
falsy and converts CPU-only (0 GPUs) to None, which lets add_task (exp.py:497)
default to 1 GPU; change the assignment so 0 remains 0 and only actual None
stays None—e.g., replace num_gpus=main_container.resources.gpus or None with an
explicit None-check like num_gpus = main_container.resources.gpus if
main_container.resources.gpus is not None else None so CPU-only jobs are not
promoted to GPU requests.
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 809-819: The callable branch checking callable(script.inline) is
dead because _prepare_command (via prepare_for_execution and script.set_inline)
always leaves script.inline as a string; remove the unreachable branch and
simplify to use script.inline directly in the block that follows
_prepare_command; update the logic in the function containing this diff
(referencing _prepare_command, script.inline and _rewrite_local_paths) so it
only reads cmd_str = script.inline and optionally add a defensive
type-check/assert if you want to guard against future regressions, ensuring no
behavioral change for the Kubernetes path.
- Line 929: The log currently slices the list (container.command[:50]) which
never truncates the joined command; instead join the command list into a single
string (e.g., full_command = ' '.join(container.command)), then truncate
characters from that string and add an ellipsis only when truncated; update the
LOG.info call that references container.command so it logs something like the
first N characters of full_command (and appends "..." only if len(full_command)
> N) to ensure actual truncation of the command string.
- Around line 743-748: The current sequential branch only aborts when status ==
JobStatus.FAILED; change it to treat any non-SUCCEEDED terminal state as
failure: after status = backend.wait_for_completion(handle) check if status !=
JobStatus.SUCCEEDED and then raise RuntimeError with a descriptive message
including job_name and status.value. Update the LOG.info/LOG.error calls around
backend.wait_for_completion(handle) accordingly so the outcome is logged and
dependent jobs are not allowed to proceed when status is not SUCCEEDED.
- Around line 844-846: The code currently sets ports = [script.port] whenever
script has a "port" attribute, which can add invalid values like None or 0 into
the ContainerSpec; update the guard around the assignment (where ports and
script are referenced) to only append script.port when it is truthy/valid (e.g.,
non-None and >0) before passing ports into ContainerSpec so ContainerSpec
receives only valid port numbers.
- Around line 696-700: Replace the safe lookup dep.get("name") with direct
access dep["name"] where the code expects the key to exist: update the
dict-handling branch inside the dependency parsing (the elif block that sets
dep_name from dep) and the corresponding occurrence inside the _run_nemo_run
function so they use dep["name"] (keeping the subsequent if not dep_name check
as-is); this ensures a missing "name" raises a KeyError immediately instead of
returning None.
In `@scripts/k8s-tests/check_nccl_logs.py`:
- Around line 59-60: The NCCL WARN messages are being appended to result.errors
but should populate the declared warnings list and not trigger a FAIL; update
the parsing logic where "NCCL WARN" (currently appended to result.errors) to
append to result.warnings instead, and adjust validate_result to only treat
result.errors as a hard FAIL (ignore or report result.warnings as non-fatal
info). Locate the dataclass fields errors and warnings and the function
validate_result and change the two/three occurrences that add NCCL WARN to use
result.warnings, and ensure validate_result checks result.errors (not warnings)
to decide a FAIL.
In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Line 31: The Dockerfile currently leaves the final image running as root (due
to the Stage 2 "USER root" persisting into Stage 4) and uses deprecated "apt-key
adv"; update the Dockerfile to explicitly drop privileges in the final stage by
creating a least-privilege user (e.g., add user nemo, chown needed dirs, and set
USER nemo) or, if root is genuinely required for GPU verification, add an
explicit comment justifying it and restrict scope to only the verification
steps; also replace any "apt-key adv" usage with the signed-by pattern: fetch
the GPG into /usr/share/keyrings/<name>-archive-keyring.gpg and reference it in
the apt source entry (apply the same changes for the other occurrences flagged
in lines 84-104).
- Around line 38-43: The Dockerfile adds the NVIDIA devtools repo over HTTP,
uses apt-key to fetch the wrong/obsolete CUDA key (7fa2af80.pub) from the
compute/cuda path, and should instead install the modern cuda-keyring and add
the devtools repo via HTTPS with signed-by; replace the apt-key fetch and
plain-HTTP source creation with installing the cuda-keyring (e.g., download and
dpkg -i cuda-keyring.deb) and then write the repo line to
/etc/apt/sources.list.d/nvidia-devtools.list as "deb
[signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg]
https://developer.download.nvidia.com/devtools/repos/ubuntu2404/$(dpkg
--print-architecture) /", followed by apt-get update and the nsight-systems-cli
install, and remove the apt-key adv call and the old key filename.
In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 145-170: Add a minimal pod/container securityContext to the
trainer container to disable privilege escalation and run non-root where
possible: in the pod spec that sets hostNetwork: true and defines the containers
list (container name "trainer", image nemo-skills/nemo-rl-full, command
["/scripts/run_sft_mn.sh"]), add a securityContext block on the container with
allowPrivilegeEscalation: false and runAsNonRoot: true (and optionally specify
runAsUser if the image supports a non-root UID), and if needed add an equivalent
pod-level securityContext to enforce fsGroup/runAsUser constraints so GPU/CUDA
and Ray networking still work; ensure these changes do not conflict with GPU
device access and test the "trainer" container startup.
In `@scripts/k8s-tests/pipeline_smoke_test.py`:
- Around line 92-129: The MULTI_NODE_TRAIN_CMD string uses torchrun with a -c
inline code flag which torchrun doesn't support; change the implementation to
write the same Python snippet to a temporary script file (like the single-node
template does, e.g. /tmp/smoke_train.py via a heredoc or echo >> file) and pass
that script path to torchrun instead of using -c, ensuring the env vars and
torchrun args (--nproc_per_node, --nnodes, --node_rank, --master_addr,
--master_port) remain unchanged and the script contains the existing distributed
setup (dist.init_process_group, DDP, all_reduce, etc.).
In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 174-178: Replace the broad "except Exception" around the log
streaming loop that calls backend.get_logs(handle) with a specific exception
handler (e.g., kubernetes.client.ApiException) or remove the try/except so
unexpected errors propagate; import ApiException from kubernetes.client and
change the block to "except ApiException as e:" and print/log that error,
allowing any other exceptions (AttributeError, TypeError, etc.) to bubble up
instead of being silently swallowed.
- Line 52: Update the pip install invocation in the string that currently
contains "pip install -q accelerate 2>/dev/null &&" so that install stderr is
not discarded; specifically remove the "2>/dev/null" redirection (or replace it
with a safe stderr-preserving redirect to container logs) so any pip error
output from the install step is visible in the container logs.
In `@scripts/k8s-tests/smoke_test_multi_node.sh`:
- Around line 272-277: The interactive read prompt block (read -p ...; if [[
$REPLY =~ ^[Yy]$ ]]; then kubectl delete job "$JOB_NAME" -n "$NAMESPACE"
--ignore-not-found; kubectl delete service "$SERVICE_NAME" -n "$NAMESPACE"
--ignore-not-found; echo "Resources deleted."; fi) will hang in CI; remove this
entire prompt and conditional deletion and rely on the existing cleanup trap to
perform kubectl deletions for JOB_NAME and SERVICE_NAME in NAMESPACE so the
script is non-interactive and safe for automated pipelines.
- Around line 20-30: The script lacks a cleanup trap so the created Kubernetes
Job and Service (JOB_NAME, SERVICE_NAME) can leak on early exit; add a cleanup
function that deletes the Job and Service (using JOB_NAME and SERVICE_NAME) and
register it with trap cleanup EXIT INT TERM immediately after the variable setup
block (after SCRIPT_DIR assignment) so any premature exit will run cleanup and
remove the resources.
---
Duplicate comments:
In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 103-108: The current falsy check after direct indexing is
misleading because cluster_config["executor"] already exists; replace the vague
guard with an explicit validation that the value is a non-empty string: after
assigning executor = cluster_config["executor"], validate with something like if
not isinstance(executor, str) or not executor.strip(): raise
ValueError("cluster_config['executor'] must be a non-empty string"), then call
executor = executor.lower(); this makes the intent clear and prevents calling
.lower() on invalid types or empty strings.
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 470-477: The code now uses direct dict access for executor in two
places (self.cluster_config["executor"] and cluster_config["executor"]); ensure
both locations consistently use direct access and add an explicit, descriptive
KeyError if the key is missing rather than silently defaulting—update the logic
around the conditional that calls _run_kubernetes and _run_nemo_run to check for
"executor" presence and raise a clear error (e.g., "cluster config missing
'executor'") so callers of self._run_kubernetes and self._run_nemo_run get
actionable feedback.
- Around line 879-917: The _parse_timeout function should raise explicit
ValueError for unrecognized formats (addressing the prior silent fall-through);
ensure the implementation in _parse_timeout preserves exception chaining (use
"raise ... from e" for ValueError from int parsing), returns the 6-hour default
when timeout_str is empty, and covers all supported formats ('3600', '6h',
'30m', '06:00:00'); add unit tests for edge cases (empty string, numeric
seconds, 'h'/'m' suffixes, 'HH:MM:SS' and 'MM:SS', invalid strings) to validate
the new behavior and error messages.
---
Nitpick comments:
In `@docs/kubernetes-status.md`:
- Around line 57-59: Replace the hardcoded file:line references in the table
with stable identifiers: point to the actual function or class names that emit
those runtime warnings in declarative.py (e.g., the function that enforces
job-dependency sequencing and the function that logs skipped external
dependencies) or use a GitHub permalink to a specific commit/anchor rather than
`declarative.py:641-647`/`692-694`; update the three table cells ("Job
dependencies force sequential", "SSH tunneling not available", "External
dependencies skipped") to reference those symbol names or permalink anchors so
the docs remain correct as code lines shift.
In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 152-174: Registry entries currently override built-in lazy-loaded
backends; to prevent silent shadowing, treat the built-in names
{"slurm","kubernetes","local","none"} as reserved: update the register_backend
function to validate the provided name against this reserved set and raise a
ValueError (or reject) if someone attempts to register one of them, and also add
a short comment in the factory function where _BACKEND_REGISTRY is checked to
clarify the reserved-name behavior; reference register_backend,
_BACKEND_REGISTRY, and the executor variable to locate where to add the
validation and comment.
In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 264-274: The loop that pairs spec.containers with
pod_spec.containers should use zip(..., strict=True) to enforce equal-length
alignment; update the loop in the Kubernetes backend where you currently have
for container_spec, container in zip(spec.containers, pod_spec.containers): to
use zip(spec.containers, pod_spec.containers, strict=True) so any mismatch is
raised immediately—this affects the block that calls
self._build_container_mounts(container_spec.mounts, existing_volumes), extends
pod_spec.volumes with extra_volumes, appends extra_mounts to container_mounts,
and assigns container.volume_mounts.
In `@nemo_skills/pipeline/backends/local.py`:
- Around line 240-258: The initial log-yield loop in get_logs reads
job.logs[container] with a for-loop while _collect_logs may append concurrently;
replace that for-loop with the same index-based pattern used in the follow path:
grab current_logs = job.logs.get(container, []) and use while yielded <
len(current_logs): yield current_logs[yielded]; yielded += 1 so iteration is
consistent and avoids concurrent-iteration issues with _collect_logs.
In `@nemo_skills/pipeline/utils/declarative.py`:
- Line 260: The condition computing was_modified is overly complex; replace the
expression "was_modified = name != original.lower() or original !=
original.lower()" with the simpler equality check "was_modified = name !=
original" because name is derived from original via the prior transformations;
update this in the function/method where variables name, original, and
was_modified are defined (search for the was_modified assignment in
declarative.py) to remove redundant checks.
In `@scripts/k8s-tests/sft_smoke_test.py`:
- Around line 156-169: Refactor the duplicated torchrun command construction by
building a shared base command that includes the NCCL exports, heredoc injection
of SFT_TRAIN_SCRIPT and the common torchrun args (e.g., --nproc_per_node from
args.gpus and --master_port), then conditionally append the multi-node flags
(--nnodes={args.nodes}, --node_rank=${{NODE_RANK:-0}},
--master_addr=${{MASTER_ADDR:-localhost}}) when args.nodes > 1; update the
variable cmd to use this builder so the only differences between single-node and
multi-node are added conditionally rather than duplicating the full heredoc +
torchrun string.
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (17)
scripts/k8s-tests/smoke_test_single_node.sh-119-124 (1)
119-124:⚠️ Potential issue | 🟡 MinorExact float comparison in all-reduce assertion — use approximate comparison.
Line 124 uses
assert tensor.item() == expectedfor comparing the all-reduce result. While this works for small world sizes (sum of small integers is exact in float32), the equivalent check insmoke_test_multi_node.sh(Line 196) correctly usesabs(...) < 1e-5. Be consistent and prefer the safer approximate comparison.🔧 Proposed fix
-assert tensor.item() == expected, f"All-reduce failed: got {tensor.item()}, expected {expected}" +assert abs(tensor.item() - expected) < 1e-5, f"All-reduce failed: got {tensor.item()}, expected {expected}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/smoke_test_single_node.sh` around lines 119 - 124, The assertion uses an exact float equality check on tensor.item() which can be unstable; change the all-reduce verification after dist.all_reduce (variables: tensor, rank, world_size) to use an approximate comparison like abs(tensor.item() - expected) < 1e-5 (matching smoke_test_multi_node.sh) and keep the same descriptive failure message including the actual and expected values; update the assertion line accordingly so it validates within a small tolerance instead of using ==.scripts/k8s-tests/manifests/multi-node-sft-test.yaml-125-132 (1)
125-132:⚠️ Potential issue | 🟡 MinorDNS wait loop has no post-loop failure check — torchrun will proceed even if DNS never resolved.
The loop at Lines 125–129 retries DNS resolution up to 60 times but doesn't exit on failure. Compare with the equivalent in
smoke_test_multi_node.sh(lines 140–143) which explicitly checks after the loop and exits with an error message. If DNS never resolves,torchrunwill fail with a less obvious connection error.🔧 Proposed fix
for i in $(seq 1 60); do getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local && break echo " retry $i..." sleep 2 done + if ! getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local > /dev/null 2>&1; then + echo "FAIL: Master DNS not resolvable after 120s" + exit 1 + fi torchrun --nproc_per_node=2 --nnodes=2 \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml` around lines 125 - 132, The DNS retry loop before invoking torchrun currently lacks a post-loop failure check, so torchrun may run even if getent never succeeded; after the for loop that runs getent hosts sft-multi-node-0... add an explicit check that the last getent succeeded (or test resolvability of sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local) and if not, print an error message and exit non-zero (mirror the behavior in smoke_test_multi_node.sh), ensuring torchrun --nproc_per_node=2 --nnodes=2 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=29500 /scripts/train.py only runs when DNS resolution succeeded.scripts/k8s-tests/smoke_test_multi_node.sh-108-114 (1)
108-114:⚠️ Potential issue | 🟡 Minor
WORLD_SIZEenv var is set toNUM_NODES, not the actual world size.Line 114 sets
WORLD_SIZEto"${NUM_NODES}", which is the number of nodes, not the total number of ranks (NUM_NODES * NUM_GPUS). WhiletorchrunoverridesWORLD_SIZEfor the spawned Python process, andWORLD_SIZE_TOTALis separately set at Line 213, this env var is misleading and could cause confusion if anything reads it beforetorchrunstarts.🔧 Proposed fix
- name: WORLD_SIZE - value: "${NUM_NODES}" + value: "${WORLD_SIZE}"Where
WORLD_SIZE(already computed at Line 43 asNUM_NODES * NUM_GPUS) would be the correct total. Alternatively, rename this env var toNUM_NODESto avoid confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 108 - 114, The env var WORLD_SIZE in the container manifest is incorrectly assigned NUM_NODES; update the assignment so WORLD_SIZE uses the precomputed total-rank value (the script variable that multiplies NUM_NODES * NUM_GPUS, referenced as the computed WORLD_SIZE/WORLD_SIZE_TOTAL) instead of NUM_NODES, or alternatively rename the manifest key to NUM_NODES to reflect it actually holds node count; change the env entry for name: WORLD_SIZE to reference the total-rank variable (or rename the env var to NUM_NODES) so the intent is unambiguous.scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml-103-108 (1)
103-108:⚠️ Potential issue | 🟡 MinorDead error-handling code:
set -ecauses exit beforeRAY_EXIT=$?is reached on failure.Line 40 sets
set -e, so ifray start --block(Line 103) exits with a non-zero status, the script terminates immediately. TheRAY_EXIT=$?assignment on Line 104 and the subsequent check (Lines 105–108) are only reached when the exit code is 0, making the error branch unreachable.Either disable
set -earound this command or rely onset -eto propagate the failure directly:🔧 Option A: rely on set -e (simplify)
echo "Starting Ray worker node (joining head at ${MASTER_ADDR}:6379)..." ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block - RAY_EXIT=$? - if [ $RAY_EXIT -ne 0 ]; then - echo "FAIL: Ray worker exited with code $RAY_EXIT" - exit $RAY_EXIT - fi echo "Worker done (Ray head shut down)."🔧 Option B: capture exit code (disable set -e locally)
echo "Starting Ray worker node (joining head at ${MASTER_ADDR}:6379)..." + set +e ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block RAY_EXIT=$? + set -e if [ $RAY_EXIT -ne 0 ]; then echo "FAIL: Ray worker exited with code $RAY_EXIT" exit $RAY_EXIT fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml` around lines 103 - 108, The current error-handling after running "ray start --address=\"${MASTER_ADDR}:6379\" --num-gpus=${NUM_GPUS} --block" is dead because "set -e" causes the script to exit before RAY_EXIT is captured; fix by temporarily disabling errexit around that command: before invoking the ray start line, do a local "set +e", run the command, capture the exit code into RAY_EXIT, then restore "set -e" and perform the existing echo+exit check, ensuring the RAY_EXIT variable and check (RAY_EXIT=$? and the if [ $RAY_EXIT -ne 0 ] block) are reachable; alternatively you can remove the manual RAY_EXIT check and rely on global "set -e" to propagate failures (choose one approach and update the block around the "ray start" call and the RAY_EXIT usage accordingly).scripts/k8s-tests/image-build/import-image.yaml-46-89 (1)
46-89:⚠️ Potential issue | 🟡 MinorDaemonSet pod will restart continuously after the
sleep 10completes.Because DaemonSet pods are always restarted by Kubernetes, the
sleep 10; exit 0pattern at the end of the script causes the pod to enter a restart loop. Each restart re-runs the full import sequence. Whilectr images importis idempotent, the pods will show asCrashLoopBackOff(or rapid-restart) inkubectl get pods, which is misleading given the stated "pass criteria: no restarts."Consider replacing
sleep 10withsleep infinityso pods remain inRunningstate after a successful import and are only terminated when the DaemonSet is explicitly deleted:- # Keep running briefly so logs are capturable - sleep 10 + echo "=== Import succeeded. Sleeping until DaemonSet is deleted. ===" + sleep infinity🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/import-image.yaml` around lines 46 - 89, The DaemonSet pod restarts after "sleep 10" causing restart loops; change the command script in the container's command block to stop exiting and instead block indefinitely (e.g., replace the final "sleep 10" with "sleep infinity" or an equivalent long-running wait) so the container stays in Running after successful import; update the script that uses CTR (the shell block that runs "ctr --address /run/containerd/containerd.sock -n k8s.io images import /raid/nemo-rl.tar" and then "sleep 10") to use an indefinite sleep/tail to prevent Kubernetes from restarting the pod.cluster_configs/kubernetes/rbac.yaml-77-95 (1)
77-95:⚠️ Potential issue | 🟡 MinorServices
create/deleteis required for multi-node jobs but increases blast radius; document for operators.The implementation calls
create_namespaced_service()to create headless services for multi-node job discovery (insubmit_job()) anddelete_namespaced_service()to clean them up (incleanup()), so the RBAC verbs are correct and necessary. However, Trivy KSV-0056 is valid: a compromised pod using this service account could create or remove arbitrary services in the namespace. This is intentional for the headless service lifecycle but operators should understand the threat model when applying this RBAC.Also note:
configmapsis not used in the codebase and can be removed from the role entirely. Nopatchverb is needed since ConfigMaps are never created, updated, or modified by the Kubernetes backend.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cluster_configs/kubernetes/rbac.yaml` around lines 77 - 95, The Services permission (verbs "create","delete") is indeed required by submit_job() (it calls create_namespaced_service()) and cleanup() (it calls delete_namespaced_service()), but it increases blast radius—update documentation/operator notes to explicitly call out this threat model and why the service-account needs create/delete for headless service lifecycle; also remove the unused ConfigMaps permission (the "configmaps" rule) from the Role since the codebase does not use or patch ConfigMaps, and keep the existing Secrets and Events read-only verbs as-is.docs/kubernetes-nemo-rl-sft-runbook.md-75-75 (1)
75-75:⚠️ Potential issue | 🟡 MinorAdd language specifiers to the three unlabelled fenced code blocks (MD040).
Lines 75, 116, and 137 open fenced blocks without a language tag, triggering markdownlint MD040. Use
textfor log output and the ASCII diagram.🔧 Proposed fix
-``` +```text ▶ Setting up data...-
+text
Node 0 (Ray head, dgx-30):-``` +```text ✓ Ray cluster initialized with 2 nodesAlso applies to: 116-116, 137-137
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-nemo-rl-sft-runbook.md` at line 75, Three fenced code blocks that begin with "▶ Setting up data...", "Node 0 (Ray head, dgx-30):", and "✓ Ray cluster initialized with 2 nodes" are missing language specifiers, triggering MD040; update each opening fence from ``` to ```text so they become ```text ... ``` to mark them as plain text/console output (search for those exact block starts to locate the three fences and replace the fence only).scripts/k8s-tests/image-build/build-nemo-rl.yaml-53-53 (1)
53-53:⚠️ Potential issue | 🟡 MinorPin Kaniko and init-container images to specific versions.
Both
alpine/git:latestandgcr.io/kaniko-project/executor:latestare unpinned. For a build-artifact job whose explicit purpose is reproducibility,:latestis contradictory — a silent upstream change will produce a different build product without any code change.🔧 Proposed fix
- image: alpine/git:latest + image: alpine/git:2.45.2- image: gcr.io/kaniko-project/executor:latest + image: gcr.io/kaniko-project/executor:v1.23.0Also applies to: 101-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml` at line 53, Replace the unpinned images with immutable versioned tags or digests: change "image: alpine/git:latest" to a specific alpine/git tag or digest (e.g., "alpine/git:2.34.1" or "alpine/git@sha256:...") and replace "gcr.io/kaniko-project/executor:latest" with a fixed kaniko release tag or digest (e.g., "gcr.io/kaniko-project/executor:v1.10.0" or the image@sha256:...). Update any other occurrences (noted around lines 101) so the build uses pinned versions for reproducibility.scripts/k8s-tests/image-build/kaniko-build.sh-37-37 (1)
37-37:⚠️ Potential issue | 🟡 MinorRemove unused
REPO_ROOTvariable (SC2034).
REPO_ROOTis computed but never referenced anywhere in the script.🔧 Proposed fix
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/kaniko-build.sh` at line 37, The variable REPO_ROOT is computed but never used (SC2034); remove the unused assignment REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" from kaniko-build.sh (referencing SCRIPT_DIR if still needed elsewhere) so the script no longer defines an unused variable—delete that line or replace it with a used expression if you intended to reference the repository root elsewhere.cluster_configs/example-kubernetes.yaml-120-126 (1)
120-126:⚠️ Potential issue | 🟡 MinorMisplaced "Fallback backend" comment creates confusing documentation.
Line 120 reads
# Fallback backend if Kubernetes is unavailablebut the immediately following active configuration belongs toimage_pull_policy(line 123). Thefallback_executorconfiguration actually appears at line 126. Reorder the comments to match their respective settings.🔧 Proposed fix
-# Fallback backend if Kubernetes is unavailable # Image pull policy (default: IfNotPresent) # Set to "Never" when using locally-loaded images via ctr import # image_pull_policy: Never +# Fallback backend if Kubernetes is unavailable # Useful for development/testing when K8s cluster is not accessible # fallback_executor: local🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cluster_configs/example-kubernetes.yaml` around lines 120 - 126, The comment "Fallback backend if Kubernetes is unavailable" is misplaced above the image_pull_policy setting; move or swap the comment so it sits immediately above the fallback_executor setting and ensure the image_pull_policy comment remains directly above image_pull_policy. Specifically, locate the comments and keys image_pull_policy and fallback_executor in the snippet and reorder them so each descriptive comment precedes its matching config key (place the fallback backend comment directly above fallback_executor and keep the image pull policy comment directly above image_pull_policy).cluster_configs/example-kubernetes.yaml-148-155 (1)
148-155:⚠️ Potential issue | 🟡 MinorMissing
watchverb forservicesin the inline RBAC example.The inline RBAC comment at line 154 lists
["create", "delete", "get", "list"]for theservicesresource, butdocs/kubernetes-guide.md(line 118) includes"watch"as well. Headless Service lifecycle management during multi-node jobs benefits fromwatchaccess to detect creation/deletion races. The inconsistency will confuse users who copy from this file.🔧 Proposed fix
-# - apiGroups: [""] -# resources: ["services"] -# verbs: ["create", "delete", "get", "list"] # Required for multi-node headless services +# - apiGroups: [""] +# resources: ["services"] +# verbs: ["create", "delete", "get", "list", "watch"] # Required for multi-node headless services🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cluster_configs/example-kubernetes.yaml` around lines 148 - 155, The inline RBAC example is missing the "watch" verb on the services resource; update the commented block that defines the resources for "services" (the line showing resources: ["services"] and verbs: ["create", "delete", "get", "list"]) to include "watch" in the verbs array so it matches docs/kubernetes-guide.md and allows headless Service lifecycle watches during multi-node jobs.scripts/k8s-tests/image-build/build-and-load.sh-19-29 (1)
19-29:⚠️ Potential issue | 🟡 MinorNo
trap-based cleanup — helper pods may linger on failure.The script creates multiple ephemeral pods (
build-prep,build-nemo-rl-local,cache-copy,load-$NODE) but has notrap ... EXIThandler. If the script exits early due toset -e, these pods remain in the cluster. Compare withvalidate_sft_e2e.sh(line 61) andvalidate_sft_e2e_multinode.sh(lines 54-58), which both use trap-based cleanup.Consider adding a trap that deletes known helper pods on exit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 19 - 29, The script lacks a trap-based cleanup so helper pods (build-prep, build-nemo-rl-local, cache-copy, load-$NODE) can linger on failure; add a cleanup function (e.g., cleanup()) that runs on EXIT via trap cleanup EXIT and uses kubectl -n "$NAMESPACE" delete pod for those known pod names (use patterns or the variables that construct load-$NODE) and any temp resources, then call that trap near the top (after NAMESPACE/CACHE_PATH/IMAGE_TAG are set) so any early exit from the script (set -e) will remove the ephemeral pods.nemo_skills/pipeline/backends/kubernetes.py-264-274 (1)
264-274:⚠️ Potential issue | 🟡 Minor
zip()withoutstrict=Truesilently ignores mismatched container lists.If
spec.containersandpod_spec.containers(which is thecontainerslist built above) have different lengths — for instance, if_build_containerfilters or skips a spec — the loop would silently process only the shorter list, leaving some containers without their mounts.Proposed fix
- for container_spec, container in zip(spec.containers, pod_spec.containers): + for container_spec, container in zip(spec.containers, pod_spec.containers, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/kubernetes.py` around lines 264 - 274, The loop pairing spec.containers and pod_spec.containers uses zip without strict=True so mismatched lengths are silently ignored; update the code that iterates over container_spec and container to either validate that len(spec.containers) == len(pod_spec.containers) and raise a clear error (including related symbols container_spec, container, spec.containers, pod_spec.containers) before calling _build_container_mounts, or replace zip(...) with zip(..., strict=True) to force an exception on mismatch; ensure the change lives alongside the existing container_mounts handling and volume_mounts assignment so any mismatch is detected instead of silently skipped.scripts/k8s-tests/pipeline_smoke_test.py-246-250 (1)
246-250:⚠️ Potential issue | 🟡 MinorBroad
except Exceptionswallows unexpected log-retrieval errors.Per coding guidelines, don't catch exceptions not normally expected to be raised.
get_logs()already handlesApiExceptioninternally. Let other exceptions propagate.Proposed fix
- try: - for line in backend.get_logs(handle): - print(line, end="" if line.endswith("\n") else "\n") - except Exception as e: - print(f"Failed to get logs: {e}") + for line in backend.get_logs(handle): + print(line, end="" if line.endswith("\n") else "\n")As per coding guidelines: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 246 - 250, The broad "except Exception as e" around the log retrieval loop should be removed so unexpected errors from backend.get_logs(handle) can propagate; edit the block containing backend.get_logs(handle) and the print loop (the try/except around that loop) to directly iterate and print lines (for line in backend.get_logs(handle): print(...)) without catching Exception, since get_logs already handles ApiException internally.scripts/k8s-tests/sft_smoke_test.py-213-222 (1)
213-222:⚠️ Potential issue | 🟡 MinorMissing
backend.cleanup(handle)after job completion.Unlike
run_sft_k8s_real.py(line 181), this script doesn't callbackend.cleanup(handle)after waiting for completion. For multi-node jobs, this means the headless service will not be cleaned up.Proposed fix
if status in (JobStatus.SUCCEEDED, JobStatus.FAILED): print("\n--- Logs ---") for line in backend.get_logs(handle): print(line, end="" if line.endswith("\n") else "\n") + backend.cleanup(handle) if status == JobStatus.FAILED: sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/sft_smoke_test.py` around lines 213 - 222, The loop handling job completion never calls backend.cleanup(handle), so headless services for multi-node jobs are left running; after calling backend.wait_for_completion(handle, timeout=1800) and after printing logs (the block that iterates backend.get_logs(handle)), invoke backend.cleanup(handle) for every finished job (both SUCCEEDED and FAILED) and ensure cleanup is executed before exiting on failure (i.e., call backend.cleanup(handle) prior to sys.exit(1) in the JobStatus.FAILED branch).scripts/k8s-tests/pipeline_smoke_test.py-239-254 (1)
239-254:⚠️ Potential issue | 🟡 MinorMissing
backend.cleanup(handle)after job completion.Same issue as in
sft_smoke_test.py— headless services for multi-node jobs won't be cleaned up.run_sft_k8s_real.pydoes callcleanup().Proposed fix
if status == JobStatus.FAILED: print(f"\nERROR: Job '{name}' failed") sys.exit(1) + + backend.cleanup(handle)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 239 - 254, Add a cleanup call for each job handle after completion: after using backend.wait_for_completion(handle) and after printing logs from backend.get_logs(handle) (or on exception), ensure backend.cleanup(handle) is invoked so headless services are removed; update the loop over result items (variables name and handle) to call backend.cleanup(handle) in a finally block (or otherwise guaranteed path) and keep existing failure logic that sys.exit(1) when status == JobStatus.FAILED.nemo_skills/pipeline/utils/declarative.py-446-456 (1)
446-456:⚠️ Potential issue | 🟡 MinorSame-second job submissions will collide due to timestamp-only suffix.
int(time.time()) % 100000produces the same value for all calls within the same second. If two pipelines (or a retry) submit jobs in the same second, they get identical names, causing a Kubernetes conflict.Consider mixing in a random component or an atomic counter.
Proposed fix
import time + import random - suffix = f"-{int(time.time()) % 100000:05d}" + suffix = f"-{int(time.time()) % 100000:05d}-{random.randint(0, 999):03d}" suffix_len = len(suffix) # 6 characters(Adjust
suffix_lencomment andmax_lengthmath accordingly.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` around lines 446 - 456, The suffix generation in declarative.py (variables suffix, suffix_len, base_name, max_base_len, max_length) uses int(time.time()) % 100000 which collides for same-second submissions; modify the suffix to include a high-resolution or random component (e.g., use time.time_ns() % N or append a short random hex from uuid4()/secrets.randbits()) and update suffix_len accordingly, then recalculate max_base_len = max_length - suffix_len and adjust truncation logic for base_name; ensure the log/comment about suffix length is corrected to reflect the new suffix size.
🧹 Nitpick comments (11)
scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml (1)
46-46: Nit:NCCL_DEBUG=INFOis set in both the shell script (Line 46) and the container env (Line 98).The script's
export NCCL_DEBUG=INFOoverrides the env var from the Job spec anyway, so the container-level env entry is redundant. Consider keeping it in only one place.Also applies to: 97-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml` at line 46, The NCCL_DEBUG setting is duplicated: the shell script contains "export NCCL_DEBUG=INFO" and the Kubernetes Job manifest also sets an env var NCCL_DEBUG in the container spec; remove the redundancy by keeping NCCL_DEBUG=INFO in only one place—either delete the shell script export or remove the container env entry for NCCL_DEBUG in the Job spec—ensuring the remaining location (shell export or container env) is the single source of truth for NCCL_DEBUG.scripts/k8s-tests/image-build/import-image.yaml (1)
42-108: Privileged security context is intentional — document the trust boundary.Trivy and Checkov flag
privileged: true,hostPID: true, and the lack ofreadOnlyRootFilesystem. These are required by design:
privileged: true— needed to access/run/containerd/containerd.sockfrom the container.hostPID: true— required for thensenter -t 1fallback path.- Writable root FS — needed for
apk addat runtime.One hardening improvement that does not require relaxed privileges: mount only the socket file rather than the entire containerd directory:
- - name: containerd-sock - hostPath: - path: /run/containerd + - name: containerd-sock + hostPath: + path: /run/containerd/containerd.sock + type: SocketThis limits the host surface exposed to the container from the whole
/run/containerd/directory to just the socket. The--addressflag in thectrinvocation already points to the full path so no other changes are needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/import-image.yaml` around lines 42 - 108, Change the containerd volume to mount only the socket file instead of the whole directory: update the volume named containerd-sock (and its corresponding volumeMount) so hostPath.path points to /run/containerd/containerd.sock and the mountPath is /run/containerd/containerd.sock, and set hostPath.type to Socket if supported; leave the ctr invocation (--address /run/containerd/containerd.sock) and the volumeMount name containerd-sock unchanged so the existing CTR usage continues to work.nemo_skills/pipeline/backends/local.py (1)
130-137: Rename unused loop variableito_.
ifromenumerate()is never referenced inside the loop body. Ruff B007 flags this correctly.✏️ Proposed fix
- for i, (proc, container_spec) in enumerate(zip(processes, spec.containers)): + for _, (proc, container_spec) in enumerate(zip(processes, spec.containers)):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/local.py` around lines 130 - 137, The loop in the local backend uses enumerate(...) but never uses the index variable `i`; rename `i` to `_` in the for statement (for _, (proc, container_spec) in enumerate(zip(processes, spec.containers)): ) to silence the unused-variable warning (Ruff B007) in the method that starts log-collector threads and appends to job._log_threads in nemo_skills.pipeline.backends.local (the block creating threading.Thread and calling self._collect_logs).cluster_configs/example-kubernetes.yaml (1)
88-92:NCCL_DEBUG=INFOin the defaultenv_varswill flood logs on every run.
NCCL_DEBUG=INFOlogs a status line for every collective call; at training batch frequency, this produces megabytes of noise per job. Either comment it out (example-only) or downgrade toWARN.🔧 Proposed fix
env_vars: - HF_HOME=/models/hf-cache - - NCCL_DEBUG=INFO - # - CUDA_VISIBLE_DEVICES=all + # - NCCL_DEBUG=INFO # Uncomment to debug NCCL connectivity issues + # - CUDA_VISIBLE_DEVICES=all🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cluster_configs/example-kubernetes.yaml` around lines 88 - 92, The default env_vars block currently sets NCCL_DEBUG=INFO which will generate excessive log noise; change this in the example by either commenting out NCCL_DEBUG or lowering it to NCCL_DEBUG=WARN in the env_vars list (refer to the env_vars entry and the NCCL_DEBUG key) so the example config doesn't flood logs on every run.scripts/k8s-tests/image-build/kaniko-build.sh (1)
186-202: Copy-helper pod leaks on failure due to missing cleanup trap.With
set -euo pipefail, any failure after thekubectl run "copy-${name}"call (e.g.,kubectl waittimeout orkubectl cperror) will exit the function immediately, leaving the helper pod running for its full 600-second sleep. Adding atrapensures cleanup:🔧 Proposed fix
+ trap "kubectl delete pod 'copy-${name}' -n '$NAMESPACE' --ignore-not-found" EXIT kubectl run "copy-${name}" --image=busybox:1.36 --restart=Never \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/kaniko-build.sh` around lines 186 - 202, The helper pod "copy-${name}" can be leaked if any command fails after kubectl run; add a cleanup trap immediately after creating the pod to always delete it on EXIT/ERR (use kubectl delete pod "copy-${name}" -n "$NAMESPACE" --ignore-not-found) so failures in kubectl wait or kubectl cp don’t leave the sleep pod running; ensure the trap is set before calling kubectl wait and that the existing explicit delete at the end remains or is idempotent with the trap to avoid double-delete errors.scripts/k8s-tests/image-build/build-and-load.sh (2)
129-129: Kaniko image pinned to:latest— non-reproducible builds.
gcr.io/kaniko-project/executor:latestcan change between runs, potentially causing silent build behavior changes. Pin to a specific digest or version tag for reproducibility.- image: gcr.io/kaniko-project/executor:latest + image: gcr.io/kaniko-project/executor:v1.23.2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-and-load.sh` at line 129, The Kaniko executor image is pinned to the floating tag "gcr.io/kaniko-project/executor:latest" which makes builds non-reproducible; update the image reference used in the image declaration (currently "image: gcr.io/kaniko-project/executor:latest") to a fixed version or digest (e.g., a specific tag like vX.Y.Z or an immutable digest `@sha256`:...) so the build uses an exact, reproducible Kaniko binary across runs.
234-255: Privileged container with host root mount — expected for containerd import but worth documenting.The loader pod runs as
privileged: truewith the entire host filesystem mounted at/host. This is required forctr importbut represents a significant security surface. The script header documents the acceptance criteria well, but consider adding a brief inline comment noting why privileged access is needed here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 234 - 255, The loader pod is started with privileged: true and mounts the entire host at the volume "host-root" (mounted to /host) to allow containerd operations; add a concise inline comment next to the kubectl run/--overrides block (near the "loader" container, "securityContext": {"privileged": true} and the "host-root" volume/volumeMount) explaining that privileged access and the host-root mount are required for running ctr import against the node's containerd socket, and that this is an intentional tradeoff for the image import step.scripts/k8s-tests/validate_sft_e2e.sh (1)
326-341: Pass-criteria checks are solid but thenemo_skillsgrep is very broad.
grep -q "nemo_skills" "$LOG_FILE"will match any occurrence of the string in the log, including error messages like"FAIL: nemo_skills core import error". Since the "E2E SFT VALIDATION PASSED" check on line 326 already confirms success, this secondary check is redundant but harmless. Consider tightening it to match the specific success line if you want higher-fidelity validation.-if ! grep -q "nemo_skills" "$LOG_FILE"; then +if ! grep -q "nemo_skills version:" "$LOG_FILE"; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 326 - 341, The grep for "nemo_skills" is too broad and can match error lines; tighten the check in validate_sft_e2e.sh by replacing the broad grep -q "nemo_skills" "$LOG_FILE" with a pattern that matches the specific success message emitted on successful import (or remove the check since "E2E SFT VALIDATION PASSED" already guarantees success); update the conditional that sets PASS=false to use the new success-specific pattern and reference LOG_FILE and the existing "E2E SFT VALIDATION PASSED" check to keep behavior consistent.nemo_skills/pipeline/backends/factory.py (1)
152-174: Registry check before built-in backends could shadow built-in names.If someone registers a custom backend under the name
"kubernetes", it would take precedence over the built-in lazy import path (line 153-154 runs before line 162-165). This is likely intentional for extensibility, but it means a malicious or buggy plugin could silently replace the realKubernetesBackend. Consider documenting this precedence or logging a warning when a registered backend shadows a built-in name.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/factory.py` around lines 152 - 174, The current factory returns _BACKEND_REGISTRY[executor] before considering built-ins, which lets a registered name shadow built-ins like "kubernetes", "slurm", "local"/"none" (symbols: _BACKEND_REGISTRY, executor, KubernetesBackend, SlurmBackend, LocalBackend). Modify the branch that checks _BACKEND_REGISTRY to detect when executor equals one of the built-in names and emit a warning (use logging.getLogger(__name__).warning or warnings.warn) indicating a registry entry is shadowing a built-in, then proceed to return the registry backend; alternatively, if you prefer built-ins to win, flip the order to check built-in names first and fall back to _BACKEND_REGISTRY, making the change around the existing registry check in factory.py.nemo_skills/pipeline/backends/slurm.py (2)
97-99: Placeholder methods are fine for a transitional wrapper, but consider raisingNotImplementedErrorinstead of silently returning misleading values.
get_statusreturnsRUNNINGunconditionally,wait_for_completiondelegates toget_status(so also returnsRUNNING), andcancel_jobreturnsFalse. A caller relying on these methods will get incorrect data silently. RaisingNotImplementedErrorwould make it explicit that these are unimplemented, matching the documented "TRANSITIONAL WRAPPER" status.Also applies to: 156-169, 171-178, 180-188, 190-203, 205-207, 209-216
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/slurm.py` around lines 97 - 99, Several methods in the Slurm backend are placeholders that return misleading default values; update each placeholder method (e.g., property name, get_status, wait_for_completion, cancel_job and the other unimplemented methods referenced in the review) to raise NotImplementedError instead of returning fixed values like "RUNNING" or False so callers know they are unimplemented; locate and replace the bodies of methods named get_status, wait_for_completion, cancel_job and any other transitional-wrapper methods in the SlurmBackend class to raise NotImplementedError with a brief message (e.g., "transitional wrapper: not implemented") to make the behavior explicit.
128-153: Context-managed experiment storestaskthat may reference a closed experiment.
get_expis used as a context manager (line 129). After thewithblock exits, the experiment may be finalized. However,task(stored inself._jobsat line 143) was created within that context and may hold a reference to the now-closed experiment. Sinceget_status,wait_for_completion, andcancel_jobare all placeholders that don't actually usetask, this isn't immediately exploitable — but it will become a problem once those methods are implemented.Consider documenting this constraint or restructuring so the experiment is reopened when needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/slurm.py` around lines 128 - 153, The code stores a Task object created inside the get_exp context (via add_task) into self._jobs (job_info["task"]) which may reference a finalized/closed experiment; instead, stop storing the live Task from inside the with-block and either (a) persist only serializable identifiers/metadata (e.g., experiment name, task id, container image, cmd) in job_info so later methods can re-open the experiment via get_exp(spec.name, self.config) and reconstruct or look up the Task, or (b) ensure any future uses of the stored task (in get_status, wait_for_completion, cancel_job) re-enter get_exp to reopen the experiment before touching the Task; update job_info creation around add_task/get_exp and corresponding consumer methods (get_status, wait_for_completion, cancel_job) to follow one of these approaches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/backends/integration.py`:
- Around line 269-302: run_job_and_wait currently submits a job (via get_backend
-> submit_job) and waits for completion (backend.wait_for_completion) but never
calls backend.cleanup(handle) and only returns JobStatus, leaking resources and
preventing caller cleanup; fix by either invoking backend.cleanup(handle) after
wait_for_completion (ensure cleanup runs even on exceptions/timeouts) or change
the function signature to return both (status, handle) so the caller can call
backend.cleanup(handle); update references to run_job_and_wait callers to handle
the new return value if you choose the second option.
In `@nemo_skills/pipeline/backends/local.py`:
- Around line 95-98: The constructor (__init__) silently swallows a missing
required key by using cluster_config.get("executor") when determining
use_docker; change this to use direct access (cluster_config["executor"] ==
"local") so a KeyError is raised if "executor" is absent, ensuring a loud
fail-fast behavior; update the assignment to self.use_docker in the
LocalBackend/__init__ (referencing cluster_config and use_docker) and, if
desired, wrap the access in a try/except that raises a clearer, custom error
message before populating self._jobs.
In `@scripts/k8s-tests/smoke_test_single_node.sh`:
- Around line 236-243: The interactive read prompt blocks CI; remove the prompt
and conditional and perform non-interactive cleanup instead: delete the read -p
... echo and the if [[ $REPLY =~ ... ]]; then ... fi block and either call
kubectl delete job "$JOB_NAME" -n "$NAMESPACE" --ignore-not-found
unconditionally (recommended when a cleanup trap exists) or gate deletion behind
an explicit env var (e.g., CLEANUP=true) so CI runs won’t hang; update
references to JOB_NAME and NAMESPACE and keep the existing kubectl delete
command.
- Around line 20-28: The script lacks a cleanup trap so the Kubernetes Job
(JOB_NAME) can be left running on early exits; add a cleanup function (e.g.,
cleanup_job) that deletes the Job using kubectl delete job "$JOB_NAME" -n
"$NAMESPACE" (and optionally waits) and register it with trap 'cleanup_job' EXIT
or trap 'cleanup_job ERR EXIT' after JOB_NAME and NAMESPACE are set; ensure the
trap is installed early (right after JOB_NAME/TIMEOUT_SECONDS/SCRIPT_DIR are
defined) so any premature exit will call cleanup_job.
In `@scripts/k8s-tests/validate_k8s_manifests.sh`:
- Around line 30-40: The current fallback unconditionally switches to kubectl
--validate=false which can hide real schema errors; change the loop that runs
kubectl apply --dry-run=client --validate=true -f "$file" to capture its
stderr/output and exit code into variables, and only retry with --validate=false
when the captured output matches known schema-unavailability or connectivity
patterns (e.g., "unable to retrieve the complete list", "failed to download",
"no matches for kind", "unable to connect", "connection refused", "x509",
"context deadline exceeded"); if strict validation failed for any other reason
(indicative of a real schema error like misspelled fields) print the captured
error and exit non-zero instead of falling back. Ensure you reference the
existing kubectl apply invocation and the files[] loop when implementing this
conditional retry.
---
Duplicate comments:
In `@docs/kubernetes-guide.md`:
- Around line 680-683: Several fenced code blocks (e.g., the log snippet
beginning "Job name 'Qwen2.5_Math_7B' is not Kubernetes-compliant. Sanitized to
'qwen2-5-math-7b'." and the ASCII diagram block) are missing language
specifiers; update each open fenced block at the shown locations (lines around
the quoted log, the block spanning 701-707, and the ASCII diagram 715-758) to
use a language tag of text (```text) so they comply with MD040 and render
correctly. Ensure you add ```text at the start of each fenced block and keep the
closing ``` unchanged.
In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 91-92: Replace the optional lookup with direct dictionary access:
in the Kubernetes backend check using cluster_config.get("executor") (in the
KubernetesBackend/related initialization code) change it to
cluster_config["executor"] != "kubernetes" so a missing "executor" key raises a
KeyError and fails fast; keep the existing ValueError for the wrong value case,
ensuring the check uses the direct access expression.
- Around line 497-502: The except block currently swallows TypeError/ValueError
when parsing gpu_count, which lets malformed values (e.g., "8x") fall through
and cause RDMA injection; change the handler to treat parse failures as invalid
counts by logging a warning that includes the offending gpu_count and the
container identifier (e.g., container name or id) and then continue to skip RDMA
resource injection. Update the except clause that catches TypeError/ValueError
around the int(gpu_count) check (referencing gpu_count and the current
container/RDMA injection logic) to emit a clear log message and perform continue
instead of pass.
In `@nemo_skills/pipeline/backends/slurm.py`:
- Around line 136-139: Replace the falsy-or and hardcoded node count: set
num_gpus to the actual GPU value without treating 0 as None by using an explicit
None check on main_container.resources.gpus (e.g., num_gpus =
main_container.resources.gpus if main_container.resources.gpus is not None else
None) and set num_nodes from the job spec instead of hardcoding 1 by using
spec.num_nodes with an explicit None fallback (e.g., num_nodes = spec.num_nodes
if spec.num_nodes is not None else 1) in the call that builds the Slurm task
(the call that currently passes num_gpus=main_container.resources.gpus or None
and num_nodes=1).
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 743-748: The sequential branch currently only raises on
JobStatus.FAILED, allowing other non-SUCCEEDED terminal states to continue;
update the check after backend.wait_for_completion(handle) (references:
sequential, backend.wait_for_completion, status, JobStatus, job_name) to treat
any terminal state other than JobStatus.SUCCEEDED as a fatal error—i.e., if
status != JobStatus.SUCCEEDED then raise RuntimeError with the job_name and
status included in the message so CANCELLED/UNKNOWN/timeouts also abort the
pipeline.
- Around line 696-700: The code uses dep.get("name") when the name is required,
which masks missing-key errors; replace dep.get("name") with direct access
dep["name"] in the dict-handling branches (the occurrences handling dict
dependency in declarative.py, including the branch that raises ValueError and
the identical pattern earlier around the other referenced spot) so a KeyError is
raised immediately if "name" is absent and fail-fast behavior is preserved; keep
the existing ValueError check/raise logic unchanged if you still want a custom
message after accessing the key.
- Around line 843-846: The code sets ports = [script.port] without validation,
which can pass None or 0 into ContainerSpec.ports; change the assignment to
validate script.port (e.g., ensure it's an int and > 0) and only append it if
valid (otherwise leave ports empty or filter out invalid values). Update the
block that currently checks hasattr(script, "port") and sets ports to use a
guard like isinstance(script.port, int) and script.port > 0 (or coerce/raise if
desired) so ContainerSpec receives only valid port numbers.
- Around line 809-819: The branch checking callable(script.inline) is dead
because _prepare_command already resolves callables via prepare_for_execution;
remove that conditional and simplify the logic in the block after
_prepare_command to directly set cmd_str = script.inline (and drop the
tuple-handling path), ensuring any downstream code uses the resolved string
returned by _prepare_command.
- Line 929: The log currently slices the list container.command[:50] which
doesn't truncate the joined command string; change logging to join the command
into a string first and then truncate characters (e.g., cmd_str = "
".join(container.command); msg = cmd_str[:50] + "..." if len(cmd_str) > 50 else
cmd_str) and use LOG.info to print that truncated msg (refer to LOG and
container.command in declarative.py).
In `@scripts/k8s-tests/check_nccl_logs.py`:
- Around line 124-127: The script currently appends both "NCCL WARN" and "NCCL
ERROR" to result.errors causing benign WARNs to be treated as failures; change
the parsing in the loop so lines containing "NCCL WARN" are appended to
result.warnings (the existing warnings list declared on the result object) and
only "NCCL ERROR" go to result.errors, and then update validate_result (the
validate_result function that reports PASS/FAIL) to include and print warnings
without marking the run as FAIL (i.e., surface result.warnings as warnings in
the output but only treat entries in result.errors as failures).
In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Line 31: The Dockerfile currently leaves USER root set in the base stage and
never overrides it, so the final release stage still runs as root; update the
Dockerfile to create a non-root user (e.g., add group/user via groupadd/useradd
or use existing app user) in an appropriate stage (base or release) and switch
to that user before the final image is produced (replace or override the
existing USER root in the release stage), ensuring processes in the release
stage run as the non-root account instead of root.
- Around line 38-43: Replace the plain-HTTP repo and deprecated apt-key usage:
instead of echoing an http:// URL and using apt-key adv to fetch 7fa2af80.pub,
download the NVIDIA repository public key over HTTPS, dearmor it into a keyring
(e.g. /usr/share/keyrings/nvidia-devtools-archive-keyring.gpg) using curl | gpg
--dearmor, and write the repo entry to
/etc/apt/sources.list.d/nvidia-devtools.list with the
"signed-by=/usr/share/keyrings/nvidia-devtools-archive-keyring.gpg" option; then
run apt-get update and install nsight-systems-cli as before (remove the apt-key
adv call and the old 7fa2af80.pub reference).
In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 145-165: The pod spec uses hostNetwork: true for Ray; harden the
trainer container by adding a container-level securityContext to prevent
privilege escalation: set allowPrivilegeEscalation: false (and optionally drop
capabilities/runAsNonRoot if compatible) on the container named "trainer" (the
container block with image nemo-skills/nemo-rl-full and command
"/scripts/run_sft_mn.sh"); ensure this change does not conflict with GPU access
or Ray by testing, and keep the change minimal to avoid altering networking
behavior.
In `@scripts/k8s-tests/pipeline_smoke_test.py`:
- Around line 92-129: The MULTI_NODE_TRAIN_CMD currently embeds Python via an
unsupported -c flag for torchrun; replace the inline -c approach with the same
pattern used by the single-node template: write the Python block to a temp
script (e.g., /tmp/smoke_train.py) before invoking torchrun and then call
torchrun with that script path (keep environment exports, torchrun args
--nproc_per_node/--nnodes/--node_rank/--master_addr/--master_port intact);
update MULTI_NODE_TRAIN_CMD to reference the temp file and ensure whatever code
path assembles commands writes the Python payload to the same temp file name as
_SINGLE_NODE_TRAIN_TEMPLATE does.
In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 174-178: The broad except in the log streaming around
backend.get_logs(handle) hides unexpected errors; replace the blanket "except
Exception as e" with either no catch (let errors like AttributeError/TypeError
propagate) or catch only the expected kubernetes ApiException (e.g., catch
kubernetes.client.rest.ApiException) so only known K8s API errors are handled
and other bugs surface; update the try/except around the loop using
backend.get_logs(handle) (and the print(...) logic) accordingly.
- Line 52: The pip install command in scripts/k8s-tests/run_sft_k8s_real.py
hides stderr by using "pip install -q accelerate 2>/dev/null &&"; remove the
"2>/dev/null" redirection so installation errors are visible in container logs
(e.g., use "pip install -q accelerate &&" or explicitly log stderr), ensuring
failures still propagate via the existing && chain.
In `@scripts/k8s-tests/smoke_test_multi_node.sh`:
- Around line 270-278: The interactive cleanup block using read -p that prompts
for deletion (referencing JOB_NAME, SERVICE_NAME and NAMESPACE) must be removed
because it blocks in non-interactive CI; delete the entire prompt and
conditional (the read, echo, and if [[ $REPLY =~ ... ]] ... fi) and rely on the
existing cleanup/trap logic instead so the script performs non-interactive
cleanup automatically.
- Around line 20-30: Add a cleanup function and register a trap immediately
after the variable setup so the K8s Job and Service are deleted on early exit;
implement a function (e.g., cleanup) that checks and calls kubectl delete job
"$JOB_NAME" --namespace "$NAMESPACE" and kubectl delete svc "$SERVICE_NAME"
--namespace "$NAMESPACE" (ignore not-found errors), then register it with trap
'cleanup' EXIT ERR INT TERM so any failure or termination removes the created
resources; ensure JOB_NAME and SERVICE_NAME are referenced exactly as defined in
the script.
---
Nitpick comments:
In `@cluster_configs/example-kubernetes.yaml`:
- Around line 88-92: The default env_vars block currently sets NCCL_DEBUG=INFO
which will generate excessive log noise; change this in the example by either
commenting out NCCL_DEBUG or lowering it to NCCL_DEBUG=WARN in the env_vars list
(refer to the env_vars entry and the NCCL_DEBUG key) so the example config
doesn't flood logs on every run.
In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 152-174: The current factory returns _BACKEND_REGISTRY[executor]
before considering built-ins, which lets a registered name shadow built-ins like
"kubernetes", "slurm", "local"/"none" (symbols: _BACKEND_REGISTRY, executor,
KubernetesBackend, SlurmBackend, LocalBackend). Modify the branch that checks
_BACKEND_REGISTRY to detect when executor equals one of the built-in names and
emit a warning (use logging.getLogger(__name__).warning or warnings.warn)
indicating a registry entry is shadowing a built-in, then proceed to return the
registry backend; alternatively, if you prefer built-ins to win, flip the order
to check built-in names first and fall back to _BACKEND_REGISTRY, making the
change around the existing registry check in factory.py.
In `@nemo_skills/pipeline/backends/local.py`:
- Around line 130-137: The loop in the local backend uses enumerate(...) but
never uses the index variable `i`; rename `i` to `_` in the for statement (for
_, (proc, container_spec) in enumerate(zip(processes, spec.containers)): ) to
silence the unused-variable warning (Ruff B007) in the method that starts
log-collector threads and appends to job._log_threads in
nemo_skills.pipeline.backends.local (the block creating threading.Thread and
calling self._collect_logs).
In `@nemo_skills/pipeline/backends/slurm.py`:
- Around line 97-99: Several methods in the Slurm backend are placeholders that
return misleading default values; update each placeholder method (e.g., property
name, get_status, wait_for_completion, cancel_job and the other unimplemented
methods referenced in the review) to raise NotImplementedError instead of
returning fixed values like "RUNNING" or False so callers know they are
unimplemented; locate and replace the bodies of methods named get_status,
wait_for_completion, cancel_job and any other transitional-wrapper methods in
the SlurmBackend class to raise NotImplementedError with a brief message (e.g.,
"transitional wrapper: not implemented") to make the behavior explicit.
- Around line 128-153: The code stores a Task object created inside the get_exp
context (via add_task) into self._jobs (job_info["task"]) which may reference a
finalized/closed experiment; instead, stop storing the live Task from inside the
with-block and either (a) persist only serializable identifiers/metadata (e.g.,
experiment name, task id, container image, cmd) in job_info so later methods can
re-open the experiment via get_exp(spec.name, self.config) and reconstruct or
look up the Task, or (b) ensure any future uses of the stored task (in
get_status, wait_for_completion, cancel_job) re-enter get_exp to reopen the
experiment before touching the Task; update job_info creation around
add_task/get_exp and corresponding consumer methods (get_status,
wait_for_completion, cancel_job) to follow one of these approaches.
In `@scripts/k8s-tests/image-build/build-and-load.sh`:
- Line 129: The Kaniko executor image is pinned to the floating tag
"gcr.io/kaniko-project/executor:latest" which makes builds non-reproducible;
update the image reference used in the image declaration (currently "image:
gcr.io/kaniko-project/executor:latest") to a fixed version or digest (e.g., a
specific tag like vX.Y.Z or an immutable digest `@sha256`:...) so the build uses
an exact, reproducible Kaniko binary across runs.
- Around line 234-255: The loader pod is started with privileged: true and
mounts the entire host at the volume "host-root" (mounted to /host) to allow
containerd operations; add a concise inline comment next to the kubectl
run/--overrides block (near the "loader" container, "securityContext":
{"privileged": true} and the "host-root" volume/volumeMount) explaining that
privileged access and the host-root mount are required for running ctr import
against the node's containerd socket, and that this is an intentional tradeoff
for the image import step.
In `@scripts/k8s-tests/image-build/import-image.yaml`:
- Around line 42-108: Change the containerd volume to mount only the socket file
instead of the whole directory: update the volume named containerd-sock (and its
corresponding volumeMount) so hostPath.path points to
/run/containerd/containerd.sock and the mountPath is
/run/containerd/containerd.sock, and set hostPath.type to Socket if supported;
leave the ctr invocation (--address /run/containerd/containerd.sock) and the
volumeMount name containerd-sock unchanged so the existing CTR usage continues
to work.
In `@scripts/k8s-tests/image-build/kaniko-build.sh`:
- Around line 186-202: The helper pod "copy-${name}" can be leaked if any
command fails after kubectl run; add a cleanup trap immediately after creating
the pod to always delete it on EXIT/ERR (use kubectl delete pod "copy-${name}"
-n "$NAMESPACE" --ignore-not-found) so failures in kubectl wait or kubectl cp
don’t leave the sleep pod running; ensure the trap is set before calling kubectl
wait and that the existing explicit delete at the end remains or is idempotent
with the trap to avoid double-delete errors.
In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml`:
- Line 46: The NCCL_DEBUG setting is duplicated: the shell script contains
"export NCCL_DEBUG=INFO" and the Kubernetes Job manifest also sets an env var
NCCL_DEBUG in the container spec; remove the redundancy by keeping
NCCL_DEBUG=INFO in only one place—either delete the shell script export or
remove the container env entry for NCCL_DEBUG in the Job spec—ensuring the
remaining location (shell export or container env) is the single source of truth
for NCCL_DEBUG.
In `@scripts/k8s-tests/validate_sft_e2e.sh`:
- Around line 326-341: The grep for "nemo_skills" is too broad and can match
error lines; tighten the check in validate_sft_e2e.sh by replacing the broad
grep -q "nemo_skills" "$LOG_FILE" with a pattern that matches the specific
success message emitted on successful import (or remove the check since "E2E SFT
VALIDATION PASSED" already guarantees success); update the conditional that sets
PASS=false to use the new success-specific pattern and reference LOG_FILE and
the existing "E2E SFT VALIDATION PASSED" check to keep behavior consistent.
093ff4c to
d331162
Compare
Implement a new ComputeBackend abstraction layer to support Kubernetes as an alternative to Slurm for job execution. New files: - nemo_skills/pipeline/backends/base.py: Abstract interface and data classes (ComputeBackend, JobSpec, ContainerSpec, ResourceSpec, JobHandle, JobStatus) - nemo_skills/pipeline/backends/kubernetes.py: Native K8s Job API backend with multi-container Pod support for server+client inference pattern - nemo_skills/pipeline/backends/local.py: Local/Docker execution backend - nemo_skills/pipeline/backends/slurm.py: Transitional wrapper around existing nemo-run integration (limited functionality, see docstring) - nemo_skills/pipeline/backends/factory.py: Backend factory with fallback - nemo_skills/pipeline/backends/integration.py: Helper utilities and architecture documentation for bridging with nemo-run - cluster_configs/example-kubernetes.yaml: Example K8s cluster config with setup instructions for namespace, RBAC, PVCs, GPU scheduling - tests/test_backends.py: Comprehensive unit tests (52 passing) The Kubernetes backend maps Slurm heterogeneous jobs to multi-container Pods where containers communicate via localhost, enabling the existing server+client inference pattern to work on K8s clusters. Note: The nemo-run based get_executor() remains the recommended interface for Slurm workloads. See integration.py for migration guidance. Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Update BaseJobScript.hostname_ref() to return 'localhost' for Kubernetes backend, since containers in the same Pod share the network namespace. Changes: - scripts.py: Add 'backend' field to BaseJobScript, update hostname_ref() to check backend type and return 'localhost' for Kubernetes - declarative.py: Set backend field on scripts when running pipelines - test_declarative_pipeline.py: Add tests for Kubernetes hostname behavior This enables the server+client inference pattern to work on Kubernetes where multi-container Pods communicate via localhost instead of Slurm's SLURM_MASTER_NODE environment variable. Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Phase 3 enterprise features: - cluster_configs/kubernetes/rbac.yaml: ServiceAccount, Role, RoleBinding templates for NeMo-Skills job management permissions - cluster_configs/kubernetes/storage.yaml: PVC templates for models, data, results, and HF cache with cloud-specific examples - cluster_configs/kubernetes/image-pull-secret.yaml: Instructions for creating image pull secrets for NGC and other private registries - cluster_configs/kubernetes/README.md: Setup guide and troubleshooting Observability improvements: - kubernetes.py: Add structured logging with extra context fields (job_name, namespace, job_id, uid) for better log aggregation Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Documentation: - docs/kubernetes-migration.md: Step-by-step guide for migrating from Slurm to Kubernetes including config conversion examples, storage setup, verification steps, and troubleshooting Validation utilities: - validate_kubernetes_config(): Validates K8s cluster config - validate_slurm_config(): Validates Slurm cluster config - validate_cluster_config(): Auto-detects executor and validates These functions return a list of error messages, enabling pre-flight checks before job submission. Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Add 10 tests for cluster configuration validation utilities: - Kubernetes config validation (valid, full, missing namespace, missing containers, invalid storage) - Slurm config validation (valid, missing account, missing partition) - Generic validate_cluster_config routing tests Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Implement backend routing in Pipeline.run(): - Route to _run_kubernetes() for executor='kubernetes' - Route to _run_nemo_run() for Slurm/local (existing behavior) The _run_kubernetes() method: - Converts CommandGroups to JobSpecs for KubernetesBackend - Maps Commands to ContainerSpecs (multi-container Pods) - Handles job dependencies between pipeline stages - Supports sequential execution mode - Includes dry_run for validation without submission Add 6 tests for Pipeline Kubernetes integration: - test_pipeline_routes_to_kubernetes - test_pipeline_converts_command_group_to_job_spec - test_pipeline_multi_container_conversion - test_parse_timeout_formats - test_slurm_still_uses_nemo_run - test_kubernetes_does_not_use_nemo_run Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
Refine Kubernetes backend with production-ready memory scheduling (auto-calculated requests without hard limits), K8s-compliant job name sanitization with unique suffixes, and auto-sequential mode for pipelines with job dependencies. Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- add Kubernetes backend multi-node support wiring and SFT routing updates - add k8s test scripts, manifests, image-build tooling, and runbooks - update cluster configs/RBAC/docs for multi-node service requirements - expand backend and pipeline utility tests - add pre-commit policy for multi-document k8s YAML and manifest validation hook Signed-off-by: Emanuel Scoullos <escoullos@pleiades-mgmt-01.cm.cluster> Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- replace invalid ResourceSpec memory_gb usages with memory_request_gb in backend integration helpers - enforce required executor key access (fail fast) in backend factory, slurm backend, and declarative pipeline paths - fix Kubernetes multi-node status handling so jobs only succeed when complete conditions/completions are met - support shell variants (/bin/bash, sh, /bin/sh) for NODE_RANK/RANK export injection in distributed commands - incorporate ContainerSpec.mounts into Kubernetes manifests and avoid shared mutable volume_mount lists across containers - harden timeout parsing and error behavior in Kubernetes/declarative helpers - fix Local backend Docker GPU arg quoting and timeout semantics - improve Slurm backend command quoting (shlex.join), required ssh_tunnel key access, and avoid storing stale context-managed experiment handles - update k8s validation/smoke scripts (dist init, torchrun nnodes/node_rank wiring, DNS failure checks, trap-based cleanup, unique ConfigMap names) - tighten NCCL log parser transport regexes and clean up stale/incorrect docs/examples Validation: - pre-commit: passed (`./.venv/bin/pre-commit run --all-files`) - scoped backend tests: passed (`env PYTHONPATH=. ./.venv/bin/pytest -q tests/test_backends.py ...`) - live K8s scripts: - `scripts/k8s-tests/validate_sft_e2e.sh --gpus 1 --timeout 900` passed - `scripts/k8s-tests/validate_sft_e2e_multinode.sh --nodes 2 --gpus 2 --timeout 1200` passed Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- fix backend executor/fallback validation and run_job_and_wait cleanup - harden declarative/slurm/kubernetes behavior and NCCL warning handling - make single/multi-node smoke scripts CI-safe with automatic cleanup - tighten kubectl manifest-validation fallback rules Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
d331162 to
866fe61
Compare
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (34)
nemo_skills/pipeline/backends/base.py (1)
92-121: Consider validating thatcommandis non-empty.
nameandimageare validated, but an emptycommandlist would pass silently and likely cause a backend error later. If the intent is to allow container-default entrypoints, a comment would clarify.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/base.py` around lines 92 - 121, In ContainerSpec.__post_init__, validate the command field: if command is empty (e.g., falsy or len(command) == 0) raise a ValueError describing that the container command cannot be empty, or alternatively add a clear docstring/comment on ContainerSpec explaining that an empty command is allowed only to use the image's default entrypoint; update tests/usage accordingly; reference the ContainerSpec dataclass and its __post_init__ method and the command attribute when making this change.docs/kubernetes-migration.md (1)
279-281: Minor: image tag inconsistency with the rest of the PR.The test-pull example uses
pytorch:24.01-py3while the runbook (kubernetes-sft-runbook.md) and other manifests referencepytorch:25.03-py3. Consider aligning to avoid confusion for readers following both docs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-migration.md` around lines 279 - 281, The kubectl test-pull example uses image tag nvcr.io/nvidia/pytorch:24.01-py3 which is inconsistent with the rest of the PR; update the image in the kubectl run command (the line containing --image=nvcr.io/nvidia/pytorch:24.01-py3) to use the same tag used elsewhere (pytorch:25.03-py3) and verify the change aligns with references in kubernetes-sft-runbook.md and other manifests.docs/kubernetes-sft-runbook.md (1)
170-176: Points 4 and 5 are redundant.Both describe using podAntiAffinity to force pods onto separate nodes. Consider merging them into a single bullet that also mentions the optional nodeAffinity for heterogeneous GPU counts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-sft-runbook.md` around lines 170 - 176, Merge bullets 4 and 5 into a single item: replace the duplicate podAntiAffinity entries with one bullet that explains applying podAntiAffinity to force pods onto separate nodes and add a note that nodeAffinity can be used optionally for clusters with heterogeneous GPU counts; reference the existing implementation name KubernetesBackend._build_pod_anti_affinity() to ensure the wording matches the code and remove the redundant line so the list becomes: Headless Service, Indexed Job, distributed env vars, podAntiAffinity (+ optional nodeAffinity).scripts/k8s-tests/check_nccl_logs.py (2)
133-137: Return type hint is underspecified.
-> tupledoesn't convey the structure. Per coding guidelines, type hints should be used for simple types.♻️ Suggested fix
def validate_result( result: NCCLCheckResult, expected_nodes: Optional[int] = None, expected_gpus_per_node: Optional[int] = None, -) -> tuple: +) -> tuple[bool, list[str]]:As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/check_nccl_logs.py` around lines 133 - 137, The return type hint for validate_result is underspecified as "-> tuple"; change it to an explicit typing annotation that matches the actual return structure (e.g. Tuple[bool, List[str]] or Tuple[bool, List[str]] if the function returns a success flag and a list of error/messages). Update the function signature in validate_result to use typing.Tuple and typing.List (add the necessary imports from typing) and adjust any related docstring/type comments to reflect the precise return type (NCCLCheckResult remains the input type).
58-58: Unparameterizedsettype hint.♻️ Suggested fix
- ranks_seen: set = field(default_factory=set) + ranks_seen: set[int] = field(default_factory=set)As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/check_nccl_logs.py` at line 58, The field ranks_seen uses an unparameterized set type; update its annotation to a parameterized typing.Set (e.g., Set[int] or Set[str] depending on stored rank type) and keep the default_factory as an empty set, e.g. change ranks_seen: set = field(default_factory=set) to ranks_seen: Set[int] = field(default_factory=set) and add the corresponding import from typing (from typing import Set) if missing so the type hint is explicit and parameterized.scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml (1)
91-116: Consider adding minimal security hardening to the pod spec.Static analysis (Trivy KSV-0014/KSV-0118, Checkov CKV_K8S_20/CKV_K8S_23) flags the missing security context. While this is a test manifest and GPU training jobs genuinely need filesystem writes, adding
allowPrivilegeEscalation: falseis a low-cost improvement that doesn't break functionality.♻️ Suggested addition
- name: trainer image: nemo-skills/nemo-rl-full:latest imagePullPolicy: Never + securityContext: + allowPrivilegeEscalation: false env:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml` around lines 91 - 116, Add a minimal securityContext to the pod/containers to harden the manifest: set allowPrivilegeEscalation: false at the container level (container name "trainer") and optionally add runAsNonRoot/allowPrivilegeEscalation combination in the pod spec to keep GPU writable behavior; update the spec under the "trainer" container (and/or top-level pod spec) to include securityContext with allowPrivilegeEscalation: false so the job still runs the existing command ("bash", "/scripts/run_sft.sh") and mounts but prevents privilege escalation.nemo_skills/pipeline/utils/cluster.py (1)
62-77: Duplicate Slurm-format description in docstring.Lines 66–68 describe the simple duration format and lines 70–72 repeat the original Slurm-format description that's already on lines 66–68 (the "Slurm format:" bullet). The two blocks overlap, making the docstring read redundantly.
♻️ Suggested cleanup
""" - Parse a timeout string into a timedelta object. - - Supports both: - - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds", - "days-hours", "days-hours:minutes", "days-hours:minutes:seconds" - - Simple duration format used in K8s configs: "6h", "30m", "45s", "2d" - - Time format for SLURM: "minutes", "minutes:seconds", "hours:minutes:seconds", - "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds" + Parse a timeout string into a timedelta object. + + Supports: + - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds", + "days-hours", "days-hours:minutes", "days-hours:minutes:seconds" + - Simple duration format used in K8s configs: "6h", "30m", "45s", "2d" + https://slurm.schedmd.com/sbatch.html#OPT_time """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/cluster.py` around lines 62 - 77, Docstring contains a duplicated description of the Slurm-format timeout; update the docstring in nemo_skills/pipeline/utils/cluster.py to remove the redundant block so it only lists the two supported formats once (keep one "Slurm format: ..." bullet and one "simple duration format" bullet), ensuring references to _parse_simple_duration_timeout and the subsequent code that normalizes value remain accurate and unchanged.scripts/k8s-tests/sft_smoke_test.py (1)
209-211: Second backend instance created for log retrieval.
Pipeline.run()already creates a backend internally. Creating another one here duplicates initialization (K8s API client, auth, etc.). Consider havingPipeline.run()return or expose its backend, or storing it as an attribute.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/sft_smoke_test.py` around lines 209 - 211, The test is creating a second backend via get_backend(cluster_config) which duplicates initialization; instead remove the extra backend creation and use the backend produced by the Pipeline instance (either by accessing a Pipeline attribute like pipeline.backend or by changing Pipeline.run() to return its backend and capturing that value). Locate usages of Pipeline.run() and replace the standalone get_backend call and the backend variable with the pipeline-provided backend (referencing Pipeline.run() and get_backend to find the code to change).scripts/k8s-tests/image-build/import-image.yaml (1)
42-47: Privileged + hostPID are required for this use case.The static analysis tools flag
hostPID: trueandprivileged: true. These are necessary for thensenterfallback (line 77) and direct containerd socket access. Since this is a development/test utility (not a production workload), the elevated privileges are acceptable. Consider adding an inline comment noting why these are needed, to prevent future reviewers from removing them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/import-image.yaml` around lines 42 - 47, Add an inline YAML comment next to hostPID: true and under the import container's securityContext (privileged: true) explaining these elevated settings are intentional for this development/test utility because the fixture needs host PID namespace for the nsenter fallback and privileged access for direct containerd socket operations; place the comment near the container entry (container name "import") so future reviewers understand they are required and not accidental.cluster_configs/kubernetes/storage.yaml (1)
1-1: Copyright year inconsistency.This file uses
2024while other new files in this PR use2026. Minor nit — align if the file was authored as part of this PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cluster_configs/kubernetes/storage.yaml` at line 1, Update the copyright year in the header comment that reads "# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved." to match the rest of the PR (change 2024 to 2026) so the file header is consistent with other new files.cluster_configs/kubernetes/rbac.yaml (1)
72-75:pods/loglikely only needsget—listandwatchare unnecessary.The
kubectl logsequivalent only requires thegetverb onpods/log. Thelistandwatchverbs are not used for log streaming and grant slightly broader access than needed.Suggested tightening
# Pod logs - required for log streaming - apiGroups: [""] resources: ["pods/log"] - verbs: ["get", "list", "watch"] + verbs: ["get"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cluster_configs/kubernetes/rbac.yaml` around lines 72 - 75, The RBAC rule granting access to the "pods/log" resource is overly permissive; update the rule that currently lists resources: ["pods/log"] and verbs: ["get", "list", "watch"] so it only includes verbs: ["get"] for "pods/log", removing "list" and "watch" to tighten permissions.scripts/k8s-tests/manifests/multi-node-sft-test.yaml (2)
69-81: Headless Service selector uses onlyjob-namelabel.The Service selector (line 80) matches
job-name: sft-multi-node, which is correct for this isolated test. Just note that the smoke test scripts use a dual-label selector (app: nemo-skills+job-name) which is a slightly more specific pattern. Consistency across manifests would help if these are used as reference examples.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml` around lines 69 - 81, The Service "sft-multi-node-workers" currently selects only on label "job-name: sft-multi-node"; update its spec.selector to include the additional label "app: nemo-skills" to match the dual-label pattern used by the smoke test scripts, and ensure the corresponding Pods/Jobs that back this Service also include "app: nemo-skills" in their metadata.labels (e.g., update the PodTemplate/Job spec that creates the sft-multi-node pods) so the headless Service can resolve the same set of endpoints consistently.
118-132: DNS wait loop lacks a post-loop failure check.Unlike
smoke_test_multi_node.sh(lines 155–158), the DNS loop here falls through silently if resolution never succeeds after 60 retries.torchrunwill then fail with a less obvious connection error. Adding an explicit check improves debuggability.Suggested addition after the DNS loop
for i in $(seq 1 60); do getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local && break echo " retry $i..." sleep 2 done + if ! getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local > /dev/null 2>&1; then + echo "FAIL: Master DNS not resolvable after 120s" + exit 1 + fi torchrun --nproc_per_node=2 --nnodes=2 \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml` around lines 118 - 132, The DNS wait loop that uses getent hosts to wait for sft-multi-node-0 may fall through silently; after the for loop finishes, add an explicit check that the host was resolved (e.g., re-run getent hosts sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local or test a flag) and if it failed, print a clear error including NODE_RANK and MASTER_ADDR and exit non-zero so torchrun is not started; modify the script block around the NODE_RANK export / getent loop (the section that sets NODE_RANK from JOB_COMPLETION_INDEX and then calls torchrun) to perform this post-loop failure check and exit with a descriptive message when resolution did not succeed.scripts/k8s-tests/smoke_test_multi_node.sh (2)
252-254:kubectl waitfailure conflates timeout with job failure.If the job times out (exceeds
activeDeadlineSeconds) or thekubectl waititself times out, both paths setJOB_STATUS="failed". Unlike the single-node script (lines 207–213 there), there's no follow-up check here to distinguish timeout from actual pod failure. Consider adding the samekubectl get job ... -o jsonpath='{.status.failed}'check for consistent diagnostic output.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 252 - 254, The current kubectl wait invocation sets JOB_STATUS="failed" for any non-success path, conflating a wait timeout with actual job pod failures; update the post-wait logic around the "kubectl wait --for=condition=complete ... job/$JOB_NAME" invocation to, on failure, run a follow-up check using "kubectl get job $JOB_NAME -n $NAMESPACE -o jsonpath='{.status.failed}'" (and/or inspect ".status.conditions" / ".status.active") to decide whether to set JOB_STATUS="timeout" (or preserve a distinct timeout indicator) versus "failed" based on the job's failed count or condition reasons; locate the logic controlling JOB_STATUS and add this conditional check referencing JOB_NAME, JOB_STATUS, NAMESPACE and the kubectl get job JSONPath to distinguish timeout from actual pod failure.
123-133: MisleadingWORLD_SIZEenv var — contains node count, not total ranks.The env var
WORLD_SIZEis set to${NUM_NODES}(line 129), butWORLD_SIZEconventionally means total ranks (nodes × GPUs). While torchrun will overrideWORLD_SIZEto the correct total before the Python script runs, the value is printed in the pre-torchrun echo at line 143 (WORLD_SIZE: \${WORLD_SIZE} (nodes)), which could confuse debugging. Consider renaming toNUM_NODESorNNODESto avoid shadowing the conventional meaning.Suggested rename
- name: WORLD_SIZE - value: "${NUM_NODES}" + value: "${WORLD_SIZE}"Or rename to avoid ambiguity:
- - name: WORLD_SIZE - value: "${NUM_NODES}" + - name: NUM_NODES + value: "${NUM_NODES}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/smoke_test_multi_node.sh` around lines 123 - 133, The env var WORLD_SIZE is misleading because it contains the node count (NUM_NODES) not total ranks; rename it to NUM_NODES (or NNODES) in the Kubernetes env block and update all script references (the env entry name WORLD_SIZE, the pre-torchrun echo that prints WORLD_SIZE: ${WORLD_SIZE} (nodes), and any other uses) so the container env conveys node-count clearly while letting torchrun still override WORLD_SIZE at runtime; ensure you change the env key in the env: list and corresponding variable expansions throughout the script to NUM_NODES (or NNODES) and verify no logic relies on the old WORLD_SIZE name.scripts/k8s-tests/smoke_test_single_node.sh (1)
133-139: Exact floating-point comparison is fragile — use tolerance like the multi-node script.Line 138 uses
assert tensor.item() == expected, while the multi-node script (smoke_test_multi_node.shline 211) usesabs(tensor.item() - expected) < 1e-5. Although the sum of small integers should be exact in float32, using a tolerance is safer and more consistent across the test suite.Suggested fix
-assert tensor.item() == expected, f"All-reduce failed: got {tensor.item()}, expected {expected}" +assert abs(tensor.item() - expected) < 1e-5, f"All-reduce failed: got {tensor.item()}, expected {expected}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/smoke_test_single_node.sh` around lines 133 - 139, Replace the fragile exact float equality in the all-reduce check: instead of asserting tensor.item() == expected, use a tolerance-based comparison (e.g., abs(tensor.item() - expected) < 1e-5) to validate the result of dist.all_reduce on tensor and keep the assertion message (referencing tensor, expected, and dist.all_reduce) for clarity and consistency with the multi-node test.scripts/k8s-tests/image-build/build-nemo-rl.yaml (1)
84-84: Editable install (-e .) in a container image is unnecessary — prefer regular install.
pip install -e .creates a symlink-based install pointing to/opt/NeMo-Skills/. In a container image this provides no benefit over a regular install and creates a fragile dependency on the source directory remaining intact. Use a standard install for image builds.Suggested fix
- RUN pip install --no-cache-dir -e . + RUN pip install --no-cache-dir .🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml` at line 84, Replace the editable install invocation in the image build step: the RUN pip install --no-cache-dir -e . line should be changed to a standard non-editable install (e.g., RUN pip install --no-cache-dir .) so the container does not rely on symlinked source files; update the single RUN step that contains "pip install -e ." to use the regular install flag and verify any build-time dependencies still install correctly.docs/kubernetes-status.md (2)
56-59: Hardcoded source line numbers will go stale quickly.References like
declarative.py:641-647andintegration.py:57-92will become inaccurate after any refactor. Consider referencing by function/class name or using stable anchors (e.g., TODO tags with searchable identifiers) instead.Also applies to: 71-81
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-status.md` around lines 56 - 59, Replace fragile hardcoded source line references (e.g., "declarative.py:641-647" and "integration.py:57-92") with stable identifiers: refer to the actual function or class names that implement the behavior (e.g., the function that enforces job dependencies in declarative.py and the integration test/class in integration.py) or add searchable anchors/TODO tags (unique IDs) in the source that the docs can point to; update the doc entries for the CLI examples (ns nemo_rl sft, ns start_server --create_tunnel) and the table rows mentioned so they reference those function/class names or anchor IDs instead of line numbers, and ensure the same change is applied to the other affected block (lines referenced as 71-81).
169-178: Remove the placeholder ISSUE-001.This placeholder with all-TBD fields adds no actionable information. The template above it (lines 139–165) is sufficient. Keeping a blank placeholder risks becoming permanently stale.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-status.md` around lines 169 - 178, Remove the placeholder issue block titled "### [ISSUE-001] Placeholder for First Issue" (the entire block with Severity/Category/Discovered/Status all set to TBD); delete that header and its subsequent description lines so only the real template (the section above) remains and no empty/stale ISSUE-001 entry is left in the document.nemo_skills/pipeline/nemo_rl/sft.py (1)
199-219: Add type hints to_run_sft_kubernetessignature.The function has many parameters without type hints. For consistency with the coding guidelines and readability, consider adding basic type hints.
Suggested signature
def _run_sft_kubernetes( - cluster_config, - train_cmd, - expname, - output_dir, - log_dir, - num_gpus, - num_nodes, - dependent_jobs, - partition, - final_hf_path, - conversion_step, - average_steps, - remove_checkpoints_after_average, - backend, - max_position_embeddings, - installation_command, - dry_run, - run_after, - skip_hf_home_check=None, + cluster_config: dict, + train_cmd: str, + expname: str, + output_dir: str, + log_dir: str, + num_gpus: int, + num_nodes: int, + dependent_jobs: int, + partition: str, + final_hf_path: str, + conversion_step: str, + average_steps: str, + remove_checkpoints_after_average: bool, + backend: str, + max_position_embeddings: int, + installation_command: str, + dry_run: bool, + run_after: list, + skip_hf_home_check: bool = None, ):As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/nemo_rl/sft.py` around lines 199 - 219, The _run_sft_kubernetes function signature lacks type hints; update it to annotate simple parameter types and the return type for clarity and consistency (e.g., annotate cluster_config as dict, train_cmd as str or list[str], expname/output_dir/log_dir/run_after/installation_command/final_hf_path as Optional[str], num_gpus/num_nodes/average_steps/conversion_step as int, dependent_jobs as Optional[list[str]] or list[str], partition/backend as Optional[str], max_position_embeddings as Optional[int], remove_checkpoints_after_average/dry_run/skip_hf_home_check as bool, and return type as None or an appropriate type). Apply these annotations directly on the _run_sft_kubernetes signature and update any imports (typing.Optional, typing.List) if needed.scripts/k8s-tests/image-build/kaniko-build.sh (3)
37-37:REPO_ROOTis defined but never used.ShellCheck SC2034. Remove or export if intended for external use.
Fix
-REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/kaniko-build.sh` at line 37, REPO_ROOT is declared but unused (ShellCheck SC2034); either delete the REPO_ROOT assignment (remove the line using REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)") or export it if callers expect it externally (change to export REPO_ROOT=...), and then run a quick search for REPO_ROOT and SCRIPT_DIR references to confirm no other code relies on it before committing.
106-109:node_selectorvariable value is unused — only its non-empty status matters.Line 108 builds a JSON string for
node_selector, but the heredoc at line 135 uses${node_selector:+nodeName: ${NODE}}, which only checks ifnode_selectoris non-empty and then directly uses$NODE. The JSON value is never consumed. Simplify by using a boolean flag or just checking$NODEdirectly:Suggested fix
- # Node selector - local node_selector="" - if [ -n "$NODE" ]; then - node_selector="\"nodeName\": \"$NODE\"," - fi ... - ${node_selector:+nodeName: ${NODE}} + ${NODE:+nodeName: ${NODE}}Also applies to: 133-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/kaniko-build.sh` around lines 106 - 109, The node_selector variable is built as a JSON fragment but never used; the heredoc uses the parameter expansion ${node_selector:+nodeName: ${NODE}} so only the non-empty status matters. Simplify by removing node_selector and change conditionals to check NODE directly (or introduce a simple boolean flag like has_node), and update the heredoc insertion to use ${NODE:+nodeName: ${NODE}} (or ${has_node:+nodeName: ${NODE}}) and remove any code that constructs node_selector to avoid dead/unused variables; adjust all similar occurrences (including the other spot referenced) to the same pattern.
144-144: Kaniko executor image uses unpinned:latesttag.
gcr.io/kaniko-project/executor:latestcan break builds when Kaniko releases a new version with breaking changes. Pin to a specific digest or version tag for reproducibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/kaniko-build.sh` at line 144, Replace the unpinned Kaniko executor image reference "gcr.io/kaniko-project/executor:latest" with a fixed, reproducible identifier by specifying either a stable version tag (e.g., "gcr.io/kaniko-project/executor:vX.Y.Z") or an immutable image digest ("gcr.io/kaniko-project/executor@sha256:...") in the same place where the image is declared so builds don't break when Kaniko publishes new releases.docs/kubernetes-nemo-rl-sft-runbook.md (1)
75-84: Add language specifiers to fenced code blocks.Three code blocks (lines 75, 116, 137) are missing language tags. Use
```textfor log output and the ASCII diagram to satisfy MD040 and improve rendering.Also applies to: 116-127, 137-145
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-nemo-rl-sft-runbook.md` around lines 75 - 84, The three fenced code blocks containing plain log/ASCII output (the block starting with "▶ Setting up data... ▶ Setting up compute cluster... ▶ Setting up model...", the subsequent multi-line run/log block showing steps like "Setting up model... Qwen2.5-0.5B loads via DTensorPolicyWorkerV2" and the ASCII diagram block) must include a language specifier for Markdown linting; change their fences from ``` to ```text so they render as plain text and satisfy MD040 (e.g., replace the three code fences that currently contain run logs/ASCII art with ```text).scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s (1)
93-96: Silent error suppression forprefetch_venvs.pyandgenerate_fingerprint.py.Both lines use
|| echo "... not critical, continuing"to swallow failures. Ifprefetch_venvs.pyfails, the resulting image will have missing Ray venvs, causing silent runtime failures later. Consider at minimum logging a warning to stderr, or letting the build fail so the issue is caught early.If these are truly optional, document why they can be skipped (e.g., "only needed for offline mode") rather than blanket-suppressing errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s` around lines 93 - 96, The RUN lines silently swallow failures for prefetch_venvs.py and generate_fingerprint.py which can hide serious build/runtime issues; either remove the "|| echo ..." so the build fails on error, or replace the suppression with an explicit stderr warning and a clear comment explaining optionality (e.g., why prefetch_venvs.py can be skipped such as "only needed for offline mode" and what missing Ray venvs imply). Update the RUN invocations referencing UV_LINK_MODE and nemo_rl/utils/prefetch_venvs.py and tools/generate_fingerprint.py to either fail-fast or emit a visible stderr warning and add a short inline comment documenting the rationale if you intentionally allow continuation.docs/kubernetes-guide.md (1)
46-74: Quick Start uses unpinned:latestimage tags.The quick start example uses
vllm/vllm-openai:latestandnvcr.io/nvidia/nemo:24.07. Whilenemo:24.07is pinned, the vLLM image is not. Consider using a pinned version in the quick start to avoid breaking changes for users following the guide.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-guide.md` around lines 46 - 74, Replace the unpinned vLLM image tag vllm/vllm-openai:latest in the cluster config with a specific, tested version (e.g., vllm/vllm-openai:<stable-version>) to avoid breaking changes; update the example cluster config (the YAML block creating the cluster config and the server-type/server image references) and add a short note indicating the chosen pinned version and why it was pinned, keeping nvcr.io/nvidia/nemo:24.07 unchanged if you want to keep it pinned.scripts/k8s-tests/pipeline_smoke_test.py (1)
50-55: Remove unusednoqadirective.Ruff reports
noqa: E402on line 50 is unnecessary since the rule isn't enabled.-from nemo_skills.pipeline.utils.declarative import ( # noqa: E402 +from nemo_skills.pipeline.utils.declarative import (🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 50 - 55, Remove the unnecessary noqa directive from the import block: eliminate " # noqa: E402" from the line importing Command, CommandGroup, HardwareConfig, and Pipeline (the import statement that references nemo_skills.pipeline.utils.declarative) so the import reads normally without the trailing noqa comment.nemo_skills/pipeline/backends/slurm.py (1)
128-141: Redundant conditional forrequested_gpus.
ResourceSpec.gpusis typed asintwith default0— it can never beNone. The conditionalif main_container.resources.gpus is not None else Nonealways evaluates to the first branch.Simplification
- requested_gpus = main_container.resources.gpus if main_container.resources.gpus is not None else None + requested_gpus = main_container.resources.gpus🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/slurm.py` around lines 128 - 141, requested_gpus is computed with a redundant None-check because ResourceSpec.gpus is always an int (default 0); replace the conditional in the submission block so requested_gpus is assigned directly from main_container.resources.gpus and pass that to add_task (referencing main_container, resources.gpus, requested_gpus, and add_task) — remove the unnecessary "if ... is not None else None" logic.nemo_skills/pipeline/backends/factory.py (1)
112-141: Broadexcept ExceptioncatchesValueErrorfor unknown executors, masking clear errors with fallback noise.When
_create_backendraisesValueError("Unknown executor 'kuberntes'")due to a typo, this catch block logs a misleading "Failed to initialize" warning and potentially attempts a fallback before re-raising. The user sees confusing fallback chatter before the actual error.Consider narrowing the catch to
RuntimeError(initialization/health failures) and lettingValueError(invalid config) propagate immediately.Suggested fix
try: backend = BackendFactory._create_backend(executor, primary_config) # Health check if backend.health_check(): LOG.info(f"Successfully initialized {executor} backend") return backend else: raise RuntimeError(f"{executor} backend health check failed") - except Exception as e: + except RuntimeError as e: LOG.warning(f"Failed to initialize {executor} backend: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/factory.py` around lines 112 - 141, The try/except around BackendFactory._create_backend is too broad and hides ValueError from invalid executor names; change the except to catch only RuntimeError so initialization/health failures are handled and fallback logic (using fallback_executor and backend.health_check()) runs, but let ValueError (and other configuration errors) propagate immediately without attempting fallback—i.e., remove or avoid catching Exception from BackendFactory._create_backend and only handle RuntimeError for the fallback path.nemo_skills/pipeline/backends/local.py (1)
232-267: Log follow mechanism relies on CPython GIL for thread safety.The
job.logs[container]list is written by_collect_logsthreads (viaappend) and read byget_logs(vialenand index access). This is safe under CPython's GIL but not formally guaranteed by the Python spec. Acceptable for a local dev/test backend, but worth noting if this is ever ported to other Python implementations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/local.py` around lines 232 - 267, The log follow logic reads and iterates job.logs (in get_logs) while _collect_logs appends to it from another thread, which relies on the CPython GIL; make access explicit and thread-safe by adding a per-job lock (e.g., job.logs_lock) and acquiring it whenever reading or writing job.logs: update get_logs (function get_logs, references to job.logs and JobStatus.RUNNING) to acquire job.logs_lock around calls to job.logs.get(), len(), and index access, and update the log-producing code (_collect_logs or wherever append is done) to acquire the same job.logs_lock before appending.scripts/k8s-tests/image-build/build-and-load.sh (1)
128-129: Mutable:latesttag for Kaniko executor image.
gcr.io/kaniko-project/executor:latestis a mutable tag — successive runs may use different Kaniko versions, making builds non-reproducible and potentially introducing breaking changes.Consider pinning to a specific digest or release tag (e.g.,
v1.23.2).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 128 - 129, The Kaniko container image uses the mutable tag 'gcr.io/kaniko-project/executor:latest' (the "kaniko" container definition) which can change between runs; replace that image reference with a fixed release tag or immutable digest (for example 'gcr.io/kaniko-project/executor:v1.23.2' or the image@sha256:... digest) to ensure reproducible builds and pin the Kaniko version used by the build-and-load.sh pipeline.nemo_skills/pipeline/backends/kubernetes.py (1)
700-716: Node selector heuristic relies on pool name containing "gpu"/"cpu".The substring matching
"gpu" in pool_name.lower()/"cpu" in pool_name.lower()is fragile — a pool named"bigcpu-with-gpu-label"would match both branches. If users don't follow this naming convention, GPU jobs could land on CPU pools or vice versa.Consider documenting this convention clearly in the cluster config example, or allowing explicit pool-to-resource mapping in config.
nemo_skills/pipeline/utils/declarative.py (2)
682-714: Dependency resolution logic is duplicated between_run_nemo_runand_run_kubernetes.The dependency parsing and resolution block (lines 682–714) is structurally identical to lines 503–557 in
_run_nemo_run, with only minor behavioral differences (e.g., K8s skips external string deps). This duplication increases maintenance burden — any future fix to dependency handling must be applied in both places.Consider extracting a shared
_resolve_dependencies(job_spec, job_name_to_handle, ...)helper that returns the resolved dependency objects, with the backend-specific behavior (external deps handling) parameterized by executor type.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` around lines 682 - 714, The dependency parsing block duplicated in _run_kubernetes and _run_nemo_run should be extracted into a single helper (e.g. _resolve_dependencies(job_spec, job_name_to_handle, original_job_name, run_after, allow_external=True)) that returns the dependency_handles list and logs; update both _run_kubernetes and _run_nemo_run to call this helper, passing allow_external=False (or executor="kubernetes") for _run_kubernetes so string/external deps are skipped with a warning and allow_external=True (or executor="nemo") for _run_nemo_run to preserve existing behavior; ensure the helper validates dict deps (raises the same ValueError when name missing or job not processed) and preserves the original logging messages.
790-807: Multi-node semantics: all containers run on all nodes with maximumnum_nodes.Taking the max
num_nodesacross groups (line 806) means that if one group requests 2 nodes and another requests 1, all containers (including those from the 1-node group) will run on both nodes in the Indexed Job. This is fine for distributed training containers but may be surprising for auxiliary services (sandbox, client) that don't need multi-node replication.This is a reasonable K8s simplification given that Indexed Jobs use a single pod template, but worth documenting if mixed num_nodes across groups becomes a supported pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` around lines 790 - 807, The code takes the maximum num_nodes across groups (variable num_nodes computed from group.hardware.num_nodes) which causes all containers to run on every node in the Kubernetes IndexedJob pod template; add an explicit comment/docstring near the loop over groups and/or emit a runtime warning/log when groups request different num_nodes values (compare each group.hardware.num_nodes against the current max) so callers are informed that auxiliary groups will be replicated to the max, and document this IndexedJob semantic in the declarative mapping (referencing groups, num_nodes, and group.hardware.num_nodes).
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (22)
scripts/k8s-tests/image-build/kaniko-build.sh (1)
36-37:REPO_ROOT(and by extensionSCRIPT_DIR) is unused — dead code.
REPO_ROOTis never referenced after its definition, andSCRIPT_DIRis only used to defineREPO_ROOT. Shellcheck SC2034 confirms this.✂️ Proposed fix
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/kaniko-build.sh` around lines 36 - 37, Remove the dead variables SCRIPT_DIR and REPO_ROOT from kaniko-build.sh: these are defined but never used (SCRIPT_DIR only serves to compute REPO_ROOT). Delete the two lines that set SCRIPT_DIR and REPO_ROOT (the symbols SCRIPT_DIR and REPO_ROOT) so there is no unused-variable noise and ShellCheck SC2034 is resolved; if the script actually needs the repository path later, instead compute and use it where required rather than defining them unused at the top.nemo_skills/pipeline/utils/cluster.py (1)
63-73: Remove the duplicate Slurm format description from the docstring.Lines 65–68 already enumerate the Slurm formats under the new "Supports both:" section; lines 70–72 repeat the same formats verbatim. Keep the SLURM link and drop the redundant text.
✂️ Proposed cleanup
""" - Parse a timeout string into a timedelta object. - - Supports both: - - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds", - "days-hours", "days-hours:minutes", "days-hours:minutes:seconds" - - Simple duration format used in K8s configs: "6h", "30m", "45s", "2d" - - Time format for SLURM: "minutes", "minutes:seconds", "hours:minutes:seconds", - "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds" - https://slurm.schedmd.com/sbatch.html#OPT_time + Parse a timeout string into a timedelta. + + Supports: + - Slurm format: "minutes", "minutes:seconds", "hours:minutes:seconds", + "days-hours", "days-hours:minutes", "days-hours:minutes:seconds" + https://slurm.schedmd.com/sbatch.html#OPT_time + - Simple duration: "6h", "30m", "45s", "2d" """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/cluster.py` around lines 63 - 73, The docstring for the timeout parsing utility (in the function/method that describes SLURM and K8s formats—look for the parse_timeout or similar function in cluster.py) repeats the SLURM format list twice; remove the redundant second repetition (the lines that repeat "Time format for SLURM: ..." and the list that follows) while keeping the initial "Supports both:" section and retaining the SLURM reference link (https://slurm.schedmd.com/sbatch.html#OPT_time) so the docstring only lists the formats once and remains accurate and concise.scripts/k8s-tests/validate_sft_e2e_multinode.sh (1)
196-207:kubectl wait --for=condition=completeblocks for the full timeout when the job fails.If a pod crashes immediately, the job transitions to the
Failedcondition (notComplete), andkubectl waitcontinues polling until${TIMEOUT}s(default 900 s) before exiting non-zero. Use--for=condition=failedOR watch both conditions to get an early-exit on failure (supported since Kubernetes 1.23):♻️ Proposed improvement
-if ! kubectl wait --for=condition=complete --timeout="${TIMEOUT}s" "job/$JOB_NAME" -n "$NAMESPACE"; then +if ! kubectl wait \ + --for=condition=complete \ + --for=condition=failed \ + --timeout="${TIMEOUT}s" \ + "job/$JOB_NAME" -n "$NAMESPACE"; thenNote: Multiple
--forconditions act as OR and require Kubernetes ≥ 1.23. After the wait, check which condition was reached if you need to distinguish failure from timeout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh` around lines 196 - 207, The script's kubectl wait call blocks until the full ${TIMEOUT} if the Job fails because it only waits for condition=complete; update the wait invocation that references JOB_NAME and TIMEOUT to include both conditions (e.g., --for=condition=complete --for=condition=failed) so the command returns as soon as the Job either completes or fails (Kubernetes ≥1.23 required), and after the wait check the Job status (inspect the job object or use kubectl get job/$JOB_NAME -o jsonpath='{.status}') to distinguish failure vs timeout and proceed to print pod logs and exit appropriately.nemo_skills/pipeline/backends/base.py (1)
116-121: Consider validating thatcommandis non-empty.
ContainerSpec.__post_init__validatesnameandimagebut doesn't checkcommand. An empty command list ([]) would produce an invalid container spec that would fail at submission time with a less clear error from the Kubernetes API.♻️ Suggested addition
def __post_init__(self): if not self.name: raise ValueError("Container name cannot be empty") if not self.image: raise ValueError("Container image cannot be empty") + if not self.command: + raise ValueError("Container command cannot be empty")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/base.py` around lines 116 - 121, ContainerSpec.__post_init__ currently checks name and image but not command; update __post_init__ to validate that the command attribute is present and non-empty (e.g., not None and len(command) > 0 or command != []), and raise a ValueError with a clear message like "Container command cannot be empty" when validation fails; locate the ContainerSpec class and modify its __post_init__ method to include this check alongside the existing name and image checks.nemo_skills/pipeline/utils/declarative.py (2)
428-456: Timestamp-based uniqueness has a small collision window.
int(time.time()) % 100000wraps every ~27.7 hours, so two pipeline runs exactly 27.7h apart could collide. This is documented and acceptable for typical usage, but if multiple pipelines run concurrently within the same second, they'll also get the same suffix. Consider incorporatingos.getpid()or a random component if concurrent pipeline launches are expected.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` around lines 428 - 456, The _make_unique_job_name function currently uses time-based suffix int(time.time()) % 100000 which can collide for concurrent launches or runs ~27.7h apart; update the suffix to include an additional uniqueness component (e.g., append the process id from os.getpid() or a short random hex from secrets.token_hex) combined with the timestamp so suffix becomes e.g. "-{timestamp}{pid_or_rand}". Ensure you adjust suffix_len calculation and the max_base_len truncation logic accordingly so the final name still respects max_length and strip trailing hyphens from the truncated base_name as before.
826-832:num_tasksmapped tocpusconflates processes-per-node with CPU cores.
HardwareConfig.num_tasksrepresents "Number of tasks (processes) per node" (Slurm-style MPI tasks), butResourceSpec.cpusrepresents CPU core requests for Kubernetes scheduling. These are different concepts — a job withnum_tasks=1may still need multiple CPU cores, and a job withnum_tasks=4(for data loading) doesn't necessarily need exactly 4 CPU cores.Consider using a separate
cpusfield onHardwareConfigor computing a reasonable default for Kubernetes (e.g.,max(num_tasks, 4)for GPU workloads).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` around lines 826 - 832, The code incorrectly maps HardwareConfig.num_tasks (Slurm-style "tasks per node") to ResourceSpec.cpus, conflating processes with CPU core requests; update the mapping in the block that constructs resources (where group.hardware or HardwareConfig() is used and ResourceSpec(...) is created) to use a proper CPU source: add and prefer a new HardwareConfig.cpus field if present, otherwise compute a safer default (e.g., max(hardware.cpus if set else 0, hardware.num_tasks, 4) for GPU workloads or another sensible heuristic), and set ResourceSpec.cpus to that computed value instead of hardware.num_tasks so CPU core requests reflect cores, not MPI tasks.scripts/k8s-tests/pipeline_smoke_test.py (2)
197-216: Dry-run path accesses private_convert_groups_to_job_specmethod.This is acceptable in a smoke test/validation script, but note that this couples the test to an internal API. If the method signature or name changes, this script will break silently (no import-time error). A brief comment noting this dependency would help future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 197 - 216, The dry-run path calls the private pipeline._convert_groups_to_job_spec method (after pipeline.run(dry_run=True)), which couples this smoke test to an internal API; add a short inline comment immediately above the call to _convert_groups_to_job_spec explaining that this is intentionally using a private/internal method and warning maintainers that changes to that method name or signature will break the test, and optionally mention why it's safe here (validation-only) so future developers understand the dependency.
236-252:get_backendandJobStatusimported inside function body — verify lazy import necessity.The import at line 236 is inside
main()to avoid import-time failures when the Kubernetes backend isn't installed. This is reasonable for a script that may run in dry-run mode without the backend. However,wait_for_completionat line 242 uses a hardcoded timeout of 900s that isn't configurable from the CLI.♻️ Consider making the wait timeout configurable
parser.add_argument("--image", default=os.environ.get("PYTORCH_IMAGE", "nvcr.io/nvidia/pytorch:25.03-py3")) + parser.add_argument("--timeout", type=int, default=900, help="Wait timeout in seconds (default: 900)") args = parser.parse_args()Then use
args.timeoutat line 242.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/pipeline_smoke_test.py` around lines 236 - 252, Replace the hardcoded wait timeout (900) passed to backend.wait_for_completion with a CLI-configurable value: add a command-line argument (e.g., parser.add_argument('--timeout', type=int, default=900)) where args is built, then call backend.wait_for_completion(handle, timeout=args.timeout); keep the lazy imports (get_backend, JobStatus) inside the function as they are to avoid import-time failures.scripts/k8s-tests/validate_sft_e2e.sh (1)
303-311: Failure path doesn't save logs to the file before exiting.When the job fails, the script prints the last 50 lines of pod logs to stdout (line 308) but doesn't save them to
$LOG_FILElike the success path does (lines 314-317). This makes post-mortem debugging harder since the log file won't exist on failure.♻️ Suggested fix
if ! kubectl wait --for=condition=complete --timeout="${TIMEOUT}s" "job/$JOB_NAME" -n "$NAMESPACE"; then echo "" echo "FAILED — Job did not complete. Logs:" POD=$(kubectl get pods -l "job-name=$JOB_NAME" -n "$NAMESPACE" -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) if [ -n "$POD" ]; then - kubectl logs "$POD" -n "$NAMESPACE" --tail=50 + LOG_FILE="${SCRIPT_DIR}/logs/sft-e2e-$(date +%Y%m%d-%H%M%S).log" + mkdir -p "$(dirname "$LOG_FILE")" + kubectl logs "$POD" -n "$NAMESPACE" > "$LOG_FILE" 2>&1 || true + echo "Full logs saved to: $LOG_FILE" + tail -50 "$LOG_FILE" fi exit 1 fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/validate_sft_e2e.sh` around lines 303 - 311, The failure branch that handles a non-completing job should capture and append the pod logs into the same $LOG_FILE used by the success path; update the block that runs kubectl logs (the branch that runs kubectl wait --for=condition=complete ... for job/$JOB_NAME and then fetches POD via kubectl get pods) so that it writes the diagnostic output into $LOG_FILE (create the file if necessary) — e.g., redirect the kubectl logs output and any echo messages into $LOG_FILE (and still print to stdout if desired) so post-mortem artifacts are preserved for JOB_NAME and NAMESPACE. Ensure you use the same $LOG_FILE variable and preserve the current --tail=50 behavior when saving logs.scripts/k8s-tests/smoke_test_single_node.sh (1)
202-213:kubectl wait --for=condition=completewon't distinguish job failure from timeout.
kubectl wait --for=condition=completereturns non-zero both when the job fails and when the wait times out. The differentiation at lines 207-213 only checks.status.failedbut doesn't handle the case where the job is still running (timed-out wait) vs. actually failed. Consider also checkingcondition=failedexplicitly for a clearer status report.♻️ Suggested improvement
# Wait for completion -kubectl wait --for=condition=complete --timeout=${TIMEOUT_SECONDS}s \ - -n "$NAMESPACE" "job/$JOB_NAME" 2>/dev/null && JOB_STATUS="succeeded" || JOB_STATUS="failed" - -# Check if it actually failed vs timed out -if [ "$JOB_STATUS" = "failed" ]; then - # Check if job failed or is still running - FAILED=$(kubectl get job "$JOB_NAME" -n "$NAMESPACE" -o jsonpath='{.status.failed}' 2>/dev/null || echo "0") - if [ "${FAILED:-0}" = "0" ]; then - echo "Job timed out waiting for completion" - fi -fi +if kubectl wait --for=condition=complete --timeout=${TIMEOUT_SECONDS}s \ + -n "$NAMESPACE" "job/$JOB_NAME" 2>/dev/null; then + JOB_STATUS="succeeded" +elif kubectl wait --for=condition=failed --timeout=5s \ + -n "$NAMESPACE" "job/$JOB_NAME" 2>/dev/null; then + JOB_STATUS="failed" +else + JOB_STATUS="timed-out" + echo "Job timed out waiting for completion" +fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/smoke_test_single_node.sh` around lines 202 - 213, The current wait treats any non-zero exit as "failed" but only checks .status.failed; update the post-wait logic to distinguish an actual job failure vs a timeout/still-running job by querying the job status with kubectl get job "$JOB_NAME" -n "$NAMESPACE" -o jsonpath and examining .status.failed and .status.active (or .status.conditions[*].type for "Failed"/"Complete"). Concretely: after the kubectl wait block, if JOB_STATUS="failed" run a kubectl get job and if .status.failed > 0 set a clear failure message, else if .status.active > 0 report "Job timed out/still running", otherwise inspect .status.conditions for a Failed condition to report failure; use the existing JOB_NAME, JOB_STATUS and FAILED variables and the same namespace variable to implement these checks.scripts/k8s-tests/check_nccl_logs.py (1)
133-137: Parameterize thetuplereturn type hint.
tuplewithout parameters doesn't communicate the return shape. Since you're already importing fromtyping, this is a quick win for readability.♻️ Suggested fix
def validate_result( result: NCCLCheckResult, expected_nodes: Optional[int] = None, expected_gpus_per_node: Optional[int] = None, -) -> tuple: +) -> Tuple[bool, List[str]]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/check_nccl_logs.py` around lines 133 - 137, The return type for validate_result is currently unparameterized as tuple; update it to a specific Tuple[...] annotation (e.g., Tuple[bool, Optional[str]] if the function returns a success boolean and an optional message) by importing Tuple from typing if necessary and replacing the signature def validate_result(...) -> tuple: with def validate_result(...) -> Tuple[bool, Optional[str]]: (or the exact Tuple[...] that matches the actual return values in validate_result).scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml (1)
43-73: Minor:NCCL_DEBUG=INFOset redundantly in both the script and container env.Line 46 (
export NCCL_DEBUG=INFOin the script) and line 98 ({name: NCCL_DEBUG, value: "INFO"}in the container env) both set the same variable. The container env is sufficient and the script export is redundant.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml` around lines 43 - 73, The run_sft.sh script redundantly exports NCCL_DEBUG=INFO while the container environment already sets NCCL_DEBUG; remove the export line from the run_sft.sh content (the line "export NCCL_DEBUG=INFO") to avoid duplicate environment configuration, keeping the container env entry ({name: NCCL_DEBUG, value: "INFO"}) as the single source of truth.docs/kubernetes-sft-runbook.md (1)
44-49: Thesleep 15approach for log capture is fragile.The driver-check command uses
sleep 15before reading logs, which is a race condition — the pod might not have completed in time. Consider usingkubectl waitinstead:Suggested improvement
-sleep 15 && kubectl logs driver-check && kubectl delete pod driver-check --ignore-not-found +kubectl wait --for=condition=Ready pod/driver-check --timeout=60s && \ + kubectl logs driver-check && kubectl delete pod driver-check --ignore-not-found🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-sft-runbook.md` around lines 44 - 49, Replace the fragile "sleep 15" pattern with an explicit kubectl wait for the driver-check pod; after running kubectl run driver-check, call kubectl wait --for=condition=ContainersReady pod/driver-check --timeout=30s (or --for=condition=Ready if you prefer) and then run kubectl logs driver-check and kubectl delete pod driver-check --ignore-not-found to ensure logs are captured reliably; update references to kubectl run driver-check, kubectl logs driver-check, and kubectl delete pod driver-check accordingly.scripts/k8s-tests/image-build/build-nemo-rl.yaml (1)
53-53: Consider pinning image tags for reproducibility.Both
alpine/git:latest(line 53) andgcr.io/kaniko-project/executor:latest(line 101) use mutablelatesttags. For build reproducibility, consider pinning to specific versions (e.g.,alpine/git:v2.45.2,gcr.io/kaniko-project/executor:v1.23.2).Also applies to: 101-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml` at line 53, Replace mutable "latest" image tags with fixed versioned tags for reproducibility: update the image fields referencing "alpine/git:latest" and "gcr.io/kaniko-project/executor:latest" to specific, pinned versions (e.g., "alpine/git:v2.45.2" and "gcr.io/kaniko-project/executor:v1.23.2") wherever those image strings appear in the manifest so builds use immutable, reproducible images.scripts/k8s-tests/image-build/build-and-load.sh (2)
128-129: Kaniko image not version-pinned.
gcr.io/kaniko-project/executor:latestcan change without notice, potentially breaking builds. Consider pinning to a specific digest or version tag for reproducibility.Suggested fix
- image: gcr.io/kaniko-project/executor:latest + image: gcr.io/kaniko-project/executor:v1.23.2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 128 - 129, The kaniko executor image is using the mutable tag "gcr.io/kaniko-project/executor:latest"; update the image reference in the build job (the "kaniko" container definition) to a specific version tag or immutable digest (for example a released version tag or an `@sha256`:... digest) to ensure reproducible builds and prevent breaking changes from "latest". Locate the "kaniko" container entry in build-and-load.sh and replace the image value with a pinned tag or digest.
234-255: Loader pods run privileged with full host root mount — document the security implications.The loader pod uses
privileged: trueand mounts/from the host as read-only. While necessary forctr import, this is a significant security surface. Consider adding a brief inline comment documenting why this is needed and the security trade-off.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/image-build/build-and-load.sh` around lines 234 - 255, The loader pod created by the kubectl run command (named "load-$NODE") is set to privileged: true and mounts the host root via volume "host-root" (mountPath "/host"), which is a high-risk security surface; add a concise inline comment above this pod definition explaining that privileged mode and the host-root mount are required to allow running ctr import against the host containerd socket (volume "containerd-sock") but they increase attack surface and should only be used in trusted test clusters, and note any mitigation (readOnly host-root, namespace -n "$NAMESPACE", and why those are insufficient); reference the kubectl run block, the "loader" container, securityContext.privileged, and the "host-root"/"containerd-sock" volumes in the comment.docs/kubernetes-nemo-rl-sft-runbook.md (1)
75-84: Add language identifiers to fenced code blocks.Three fenced code blocks (lines 75, 116, 137) lack a language identifier, which triggers markdownlint MD040. Since these show expected output, use
```text.Suggested fix
Line 75:
-``` +```textLine 116:
-``` +```textLine 137:
-``` +```textAlso applies to: 116-127, 137-145
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes-nemo-rl-sft-runbook.md` around lines 75 - 84, The three fenced code blocks showing run output (the blocks containing "▶ Setting up data... … === REAL NEMO-RL SFT COMPLETED ===" and the other two similar expected-output blocks) are missing a language identifier and trigger markdownlint MD040; update each opening triple-backtick to use ```text so the blocks are explicitly plain text (e.g., replace ``` with ```text for the fenced blocks that contain the expected output sequences).scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml (1)
48-49: Redundant self-assignment ofMASTER_ADDR.
MASTER_ADDR=${MASTER_ADDR}on line 49 just reassigns the environment variable to itself. The comment says "use the headless service DNS which resolves to node IPs", but the value already comes from the pod env (line 153–154). This line can be removed or replaced with a comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml` around lines 48 - 49, The env entry MASTER_ADDR=${MASTER_ADDR} is a redundant self-assignment; remove this line from the container/env block (or replace it with a clarifying comment) so the pod uses the existing MASTER_ADDR provided earlier, and ensure any intended headless-service DNS is explicitly assigned if needed (search for MASTER_ADDR and the env block in the manifest to locate the exact spot).nemo_skills/pipeline/backends/slurm.py (1)
128-156: Storedtaskreference from closed context manager is unused — acceptable for now.The
taskobject (line 145) is created inside thewith get_exp(...)block and stored inself._jobs. After the block exits, this reference may be stale. However, since the status methods (get_status,wait_for_completion,cancel_job) currently return placeholders and don't use the stored task, this is safe for the transitional implementation. When implementing the TODOs, avoid relying on this stored task reference — usesacctor re-open the experiment instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/slurm.py` around lines 128 - 156, The code stores a `task` (from add_task) into `self._jobs` while inside the `with get_exp(...)` context which may become stale after the context closes; update future implementations of `get_status`, `wait_for_completion`, and `cancel_job` to not rely on this stored `task` reference (symbols: task, get_exp, self._jobs) — instead query Slurm directly (e.g., via `sacct`, `squeue`, or re-open the experiment with `get_exp` when needed) or store only stable identifiers (experiment name / Slurm job id) in `self._jobs` and use those to retrieve live status/actions.nemo_skills/pipeline/backends/kubernetes.py (1)
268-268: Considerstrict=Trueonzip()for defensive safety.
zip(spec.containers, pod_spec.containers)should always have equal lengths sincepod_spec.containersis built fromspec.containerson lines 209–211. However, addingstrict=Truewould catch any future divergence.- for container_spec, container in zip(spec.containers, pod_spec.containers): + for container_spec, container in zip(spec.containers, pod_spec.containers, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/kubernetes.py` at line 268, The zip over containers should be made strict to catch length mismatches: change the loop using zip(spec.containers, pod_spec.containers) to call zip(spec.containers, pod_spec.containers, strict=True) so any future divergence between spec.containers and pod_spec.containers raises immediately; update the loop where pod_spec and spec containers are iterated and ensure the runtime target supports Python 3.10+ (or add a compatibility guard if not).scripts/k8s-tests/run_sft_k8s_real.py (1)
30-102: Training script is embedded as a large multi-line string — fragile and hard to maintain.The entire training script (lines 51–101) is constructed as a concatenated string. This is difficult to read, test independently, and modify. While acceptable for a test/validation script, consider extracting it to a separate file if this script grows further.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-tests/run_sft_k8s_real.py` around lines 30 - 102, The train_cmd in main() embeds a large multi-line Python training script into a shell heredoc (writing to /tmp/train_pipeline.py), which is fragile and hard to maintain; extract that embedded script into a standalone Python file (e.g., sft_train_pipeline.py) and update train_cmd to run that file instead (keep the same environment exports and torchrun invocation), or alternatively load the script text from a separate resource at runtime and inject it into /tmp/train_pipeline.py; update references to the heredoc and /tmp/train_pipeline.py in train_cmd so main() uses the external script (train_cmd variable) while preserving the torchrun flags (--nproc_per_node, --nnodes, --node_rank, --master_addr, --master_port) and any environment variable usages (NODE_RANK, MASTER_ADDR).nemo_skills/pipeline/backends/__init__.py (1)
29-66:register_backenddecorator is the public extension point but is not re-exported.Users who want to register a custom backend must reach into
nemo_skills.pipeline.backends.factorydirectly, bypassing the unified public surface. SinceBackendFactoryandget_backendare already re-exported,register_backendbelongs alongside them.♻️ Proposed change
-from nemo_skills.pipeline.backends.factory import BackendFactory, get_backend +from nemo_skills.pipeline.backends.factory import BackendFactory, get_backend, register_backend __all__ = [ ... # Factory "BackendFactory", "get_backend", + "register_backend", ... ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/backends/__init__.py` around lines 29 - 66, The public API currently re-exports BackendFactory and get_backend but omits the register_backend decorator; update nemo_skills.pipeline.backends.__init__ by importing register_backend from nemo_skills.pipeline.backends.factory and adding "register_backend" to the __all__ list so users can register custom backends via the unified public surface alongside BackendFactory and get_backend.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cluster_configs/kubernetes/rbac.yaml`:
- Around line 62-65: The RBAC entry for batch/jobs in rbac.yaml grants the
"patch" verb (apiGroups: ["batch"], resources: ["jobs"], verbs includes "patch")
but the simplified example in docs/kubernetes-guide.md only lists "create,
delete, get, list, watch"; update the docs example to include "patch" or add a
brief note that the full RBAC (cluster_configs/kubernetes/rbac.yaml) includes
additional verbs such as "patch" so the guide and the manifest remain
consistent.
In `@cluster_configs/kubernetes/README.md`:
- Line 7: The README statement "Kubernetes cluster 1.24+ (for native job
dependencies)" is misleading because the pipeline uses declarative.py in
sequential mode rather than Kubernetes native job dependency features; update
the README to either remove the parenthetical or change it to state the actual
minimum Kubernetes version required (or that sequential submission is used) and
reference that declarative.py uses sequential mode so readers aren’t led to
expect native K8s job dependencies.
In `@nemo_skills/pipeline/backends/factory.py`:
- Around line 131-141: The fallback path currently ignores a false return from
fallback_backend.health_check() so add explicit handling: after calling
BackendFactory._create_backend(...) check the boolean result and if False call
LOG.error with a descriptive message including fallback_executor and any
available state, then raise a specific exception (or re-raise the primary
exception wrapped with context) so the outer handler reports that the fallback
was attempted and failed; ensure you reference BackendFactory._create_backend,
fallback_backend.health_check(), LOG.error and the RuntimeError re-raise so the
new error path surfaces in logs and exceptions.
- Around line 103-108: The check after fetching cluster_config["executor"]
currently raises a misleading message when the key exists but the value is
falsy; update the guard around executor (the local variable) to validate the
value (e.g., ensure it's not None/empty and is a string) and raise a clearer
ValueError such as "cluster_config['executor'] must be a non-empty string" (or
similar) so callers know the value is invalid rather than the key missing; also
ensure this validation happens before calling executor.lower() to avoid
AttributeError.
- Around line 113-141: The current broad except around
BackendFactory._create_backend swallows ValueError for unknown executors and
re-raises RuntimeError; change the error handling so ValueError from
BackendFactory._create_backend (or any explicit "unknown executor" error) is
allowed to propagate unchanged: either remove ValueError from the general except
or add an early except ValueError: raise to re-raise it immediately before
attempting fallback logic and before converting other exceptions into
RuntimeError; keep the existing fallback flow for non-ValueError exceptions (use
BackendFactory._create_backend, backend.health_check(), fallback_executor, and
the final RuntimeError only for other failures).
In `@nemo_skills/pipeline/backends/local.py`:
- Around line 138-145: The for-loop in the method using enumerate is introducing
an unused loop variable i; change the loop to either drop enumerate (for (proc,
container_spec) in zip(processes, spec.containers):) or rename i to _ so the
unused variable warning is resolved; update the loop that creates threads
calling self._collect_logs and appends to job._log_threads accordingly.
- Around line 191-197: The catch-all except in _collect_logs (and similarly in
health_check) is too broad and may hide programming errors; restrict the handler
to the specific I/O/subprocess exceptions expected instead. Update _collect_logs
to only catch IOError/OSError (and BrokenPipeError if relevant) when iterating
proc.stdout and log the error with container_name and exception details; update
health_check to catch only subprocess.SubprocessError,
subprocess.CalledProcessError, and OSError where subprocess calls are made.
Ensure you reference the same variables (job, container_name, proc, LOG) and
remove the bare `except Exception` blocks so unexpected errors propagate.
In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Around line 99-100: The Dockerfile uses the wrong source path casing: replace
the COPY instruction that references "skills/" with "Skills/" so it matches the
init container build context (change the COPY skills/ /opt/NeMo-Skills/ line to
use COPY Skills/ /opt/NeMo-Skills/); update the COPY line in
Dockerfile.nemo-rl-k8s so the build context prepared by build-nemo-rl.yaml
(which clones to /context/Skills/) is copied correctly.
In `@scripts/k8s-tests/image-build/kaniko-build.sh`:
- Around line 186-202: The copy pod can leak if kubectl wait or kubectl cp
fails; inside the build_image function add a trap that always deletes the copy
pod (referencing the pod name pattern "copy-${name}") on EXIT or ERR so kubectl
delete pod -n "$NAMESPACE" --ignore-not-found is executed even when kubectl wait
or kubectl cp fail; ensure the trap cleans up using the same pod identifier and
namespace variables and is set immediately after creating the pod (before
kubectl wait) so cleanup runs on any early exit.
In `@scripts/k8s-tests/manifests/multi-node-sft-test.yaml`:
- Around line 125-132: The DNS wait loop that runs getent hosts
sft-multi-node-0.sft-multi-node-workers.default.svc.cluster.local may exit after
60 retries without indicating failure, then proceeds to call torchrun; modify
the script so after the for loop you check whether the hostname resolved (e.g.,
test last getent result or attempt a final getent) and if it failed, print a
clear error including the hostname and exit non‑zero instead of invoking
torchrun --nproc_per_node=2 --nnodes=2 (which uses NODE_RANK, MASTER_ADDR and
/scripts/train.py); if the host resolved, continue to run torchrun as before.
- Around line 100-151: The pod spec for the sft-multi-node job is missing a
securityContext which causes scanners to flag running as root/allowing privilege
escalation; add a minimal pod-level securityContext (e.g., runAsNonRoot: true,
runAsUser: 1000) and a container-level securityContext inside the trainer
container (e.g., allowPrivilegeEscalation: false, readOnlyRootFilesystem: true,
capabilities: drop: [ "ALL" ]) to the spec referenced by the pod template (look
for spec -> containers -> name: trainer and the surrounding pod spec) so the
manifest enforces non-root execution and disables privilege escalation while
keeping the test runnable.
In `@scripts/k8s-tests/sft_smoke_test.py`:
- Around line 38-45: Remove the ineffective "noqa: E402" directives from the
import statements — specifically the import of nemo_run and the from-import of
nemo_skills.pipeline.utils.declarative (the names Command, CommandGroup,
HardwareConfig, Pipeline) — since RUF100 indicates E402 isn't enabled; simply
delete the ", # noqa: E402" portions so the imports remain unchanged but
without the unused noqa markers.
---
Duplicate comments:
In `@docs/kubernetes-guide.md`:
- Around line 680-683: Add language specifiers to the three fenced code blocks
that contain plain log/ASCII output: the block showing "Job name
'Qwen2.5_Math_7B' is not Kubernetes-compliant..." should begin with ```text, the
block containing "Pipeline has job dependencies but sequential=False..." should
begin with ```text, and the ASCII diagram that starts with
"┌─────────────────────────" should also begin with ```text; update each opening
fence accordingly so the blocks are treated as plain text/ASCII in the rendered
docs.
In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Around line 975-1020: The get_logs function currently only reads logs from
pods.items[0], dropping logs from other pods in multi-node jobs; update get_logs
(in nemo_skills.pipeline.backends.kubernetes.KubernetesBackend) to either accept
an optional pod_index (int) or pod_name parameter and use it to select the
correct pod, or iterate over all pods.items to yield logs from each pod (for
follow=True you can stream each pod separately or document that follow streams a
single pod by index). Replace references to pods.items[0] with logic that: 1)
validates the requested index/pod name against pods.items and 2) loops over
pods.items (or picks the chosen pod) before calling read_namespaced_pod_log,
ensuring both the follow and non-follow branches handle multiple pods correctly
and include the pod identity in yielded lines for clarity.
In `@nemo_skills/pipeline/nemo_rl/sft.py`:
- Around line 557-580: The Kubernetes branch in the SFT runner currently drops
Slurm-specific params without feedback; update the conditional that routes to
_run_sft_kubernetes to detect if any Slurm-only arguments (qos, time_min,
exclusive, sbatch_kwargs, reuse_code, reuse_code_exp, _reuse_exp,
_task_dependencies) were provided and either log a warning via
LOG.warn/LOG.warning explaining these params are ignored when executor ==
"kubernetes" or raise a ValueError (follow project convention) instead of
silently ignoring them; reference the dispatcher code that checks
cluster_config.get("executor") and the _run_sft_kubernetes call to add this
validation and messaging before returning.
- Around line 614-639: The Slurm path add_task calls that perform CPU-only
conversion/averaging (e.g., the add_task that runs get_checkpoint_convert_cmd
with task_name f"{expname}-convert-final-ckpt" and the other add_task calls that
invoke get_checkpoint_convert_cmd or the averaging command) are incorrectly
passing num_gpus=num_gpus; change those to num_gpus=0 so the Slurm tasks mirror
the K8s path. Locate the add_task invocations that run checkpoint
conversion/averaging (identify by get_checkpoint_convert_cmd,
get_average_checkpoints_cmd or task_names containing "convert" or "average") and
set num_gpus=0, leaving other parameters unchanged.
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 693-696: The current loop over job_dependencies silently logs and
skips string dependencies (the dep variable) which can break execution order on
Kubernetes; replace the LOG.warning branch in the loop that inspects
job_dependencies (where dep is checked with isinstance(dep, str)) so it raises a
clear exception (e.g., ValueError or a custom exception) instead of skipping,
including the dependency name and guidance that external string dependencies are
unsupported on Kubernetes; ensure the change surfaces in the same function
handling job_dependencies so callers immediately fail-fast when a string
dependency is provided.
In `@scripts/k8s-tests/image-build/build-and-load.sh`:
- Around line 19-26: Add a trap-based cleanup similar to the validation scripts:
implement a cleanup function (e.g., named cleanup) in build-and-load.sh that
deletes any intermediate pods created by this script (targeting pod names/labels
such as build-prep, build-nemo-rl-local, cache-copy and load-*) and any
temporary resources, then register it with trap 'cleanup' on EXIT INT TERM;
update places where pods are created (the build/run sections) to ensure they
either set identifying labels or predictable names so the cleanup can reliably
find and delete them.
In `@scripts/k8s-tests/image-build/Dockerfile.nemo-rl-k8s`:
- Around line 38-39: Replace the insecure repo addition and deprecated key
import: stop using the echo ... /etc/apt/sources.list.d/nvidia-devtools.list
command that writes an http:// URL and stop using apt-key adv --fetch-keys ...
7fa2af80.pub; instead install the cuda-keyring package and add the NVIDIA
devtools repository entry with an https:// URL that includes a signed-by=
reference to the keyring file (so the sources list entry references the
installed cuda-keyring key), removing any apt-key usage and legacy key fetches.
In `@scripts/k8s-tests/image-build/import-image.yaml`:
- Around line 88-89: The DaemonSet import script currently ends with "sleep 10",
which causes the pod to exit after 10s and Kubernetes to restart it indefinitely
(violating the pass criteria "no restarts"); update the import container to use
"sleep infinity" instead of "sleep 10" so the pod stays alive after the import,
and ensure you add a cleanup step to delete the DaemonSet after verification (or
convert this flow to a per-node Job pattern) so the import does not run
repeatedly.
In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 101-110: The script incorrectly treats non-zero RAY_EXIT from the
blocking Ray worker as fatal; update the worker start/exit handling around the
ray start --address="${MASTER_ADDR}:6379" --num-gpus=${NUM_GPUS} --block
invocation so the non-zero exit is suppressed per runbook (e.g., add the
shell-safe suppression like appending || true to the ray start command or remove
the RAY_EXIT check/exit path), and ensure the echo "Worker done (Ray head shut
down)." still runs; reference RAY_EXIT, MASTER_ADDR, NUM_GPUS and the ray start
--block command when making the change.
In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 51-52: The pipeline currently silences pip install errors by
redirecting stderr to /dev/null in the train_cmd string; update the train_cmd
(the string assembled in train_cmd) to remove the "2>/dev/null" redirection so
pip install errors are preserved in container logs (or alternatively redirect
stderr to a log file or tee it) and ensure the command still fails on non-zero
exit so diagnostic output remains available for debugging.
- Around line 167-182: The loop over result currently only collects logs and
fails on JobStatus.FAILED, letting NON-TERMINAL/indeterminate statuses (UNKNOWN,
CANCELLED) pass as success; update the logic in the for loop that calls
backend.wait_for_completion(handle, ...) so that any status not equal to
JobStatus.SUCCEEDED is treated as a non-success: always collect logs via
backend.get_logs(handle) for SUCCEEDED/FAILED/UNKNOWN/CANCELLED, call
backend.cleanup(handle) as now, and exit with non-zero (sys.exit(1)) for any
status != JobStatus.SUCCEEDED (i.e., include JobStatus.UNKNOWN and
JobStatus.CANCELLED alongside FAILED). Ensure you reference the same symbols:
result, backend.wait_for_completion, JobStatus, backend.get_logs,
backend.cleanup, and sys.exit.
In `@scripts/k8s-tests/sft_smoke_test.py`:
- Around line 213-222: wait_for_completion can return non-terminal statuses
(RUNNING/PENDING/UNKNOWN) on timeout but the loop currently treats them as
success; update the logic after calling backend.wait_for_completion(handle,
timeout=1800) to treat any status not in (JobStatus.SUCCEEDED, JobStatus.FAILED)
as a failure: always print and collect logs via backend.get_logs(handle) for
non-SUCCEEDED outcomes and call sys.exit(1) when status != JobStatus.SUCCEEDED
(i.e., for FAILED or timeout/non-terminal statuses). Ensure you reference the
existing symbols backend.wait_for_completion, JobStatus, backend.get_logs and
sys.exit in the updated branch so timeouts no longer pass silently.
---
Nitpick comments:
In `@docs/kubernetes-nemo-rl-sft-runbook.md`:
- Around line 75-84: The three fenced code blocks showing run output (the blocks
containing "▶ Setting up data... … === REAL NEMO-RL SFT COMPLETED ===" and the
other two similar expected-output blocks) are missing a language identifier and
trigger markdownlint MD040; update each opening triple-backtick to use ```text
so the blocks are explicitly plain text (e.g., replace ``` with ```text for the
fenced blocks that contain the expected output sequences).
In `@docs/kubernetes-sft-runbook.md`:
- Around line 44-49: Replace the fragile "sleep 15" pattern with an explicit
kubectl wait for the driver-check pod; after running kubectl run driver-check,
call kubectl wait --for=condition=ContainersReady pod/driver-check --timeout=30s
(or --for=condition=Ready if you prefer) and then run kubectl logs driver-check
and kubectl delete pod driver-check --ignore-not-found to ensure logs are
captured reliably; update references to kubectl run driver-check, kubectl logs
driver-check, and kubectl delete pod driver-check accordingly.
In `@nemo_skills/pipeline/backends/__init__.py`:
- Around line 29-66: The public API currently re-exports BackendFactory and
get_backend but omits the register_backend decorator; update
nemo_skills.pipeline.backends.__init__ by importing register_backend from
nemo_skills.pipeline.backends.factory and adding "register_backend" to the
__all__ list so users can register custom backends via the unified public
surface alongside BackendFactory and get_backend.
In `@nemo_skills/pipeline/backends/base.py`:
- Around line 116-121: ContainerSpec.__post_init__ currently checks name and
image but not command; update __post_init__ to validate that the command
attribute is present and non-empty (e.g., not None and len(command) > 0 or
command != []), and raise a ValueError with a clear message like "Container
command cannot be empty" when validation fails; locate the ContainerSpec class
and modify its __post_init__ method to include this check alongside the existing
name and image checks.
In `@nemo_skills/pipeline/backends/kubernetes.py`:
- Line 268: The zip over containers should be made strict to catch length
mismatches: change the loop using zip(spec.containers, pod_spec.containers) to
call zip(spec.containers, pod_spec.containers, strict=True) so any future
divergence between spec.containers and pod_spec.containers raises immediately;
update the loop where pod_spec and spec containers are iterated and ensure the
runtime target supports Python 3.10+ (or add a compatibility guard if not).
In `@nemo_skills/pipeline/backends/slurm.py`:
- Around line 128-156: The code stores a `task` (from add_task) into
`self._jobs` while inside the `with get_exp(...)` context which may become stale
after the context closes; update future implementations of `get_status`,
`wait_for_completion`, and `cancel_job` to not rely on this stored `task`
reference (symbols: task, get_exp, self._jobs) — instead query Slurm directly
(e.g., via `sacct`, `squeue`, or re-open the experiment with `get_exp` when
needed) or store only stable identifiers (experiment name / Slurm job id) in
`self._jobs` and use those to retrieve live status/actions.
In `@nemo_skills/pipeline/utils/cluster.py`:
- Around line 63-73: The docstring for the timeout parsing utility (in the
function/method that describes SLURM and K8s formats—look for the parse_timeout
or similar function in cluster.py) repeats the SLURM format list twice; remove
the redundant second repetition (the lines that repeat "Time format for SLURM:
..." and the list that follows) while keeping the initial "Supports both:"
section and retaining the SLURM reference link
(https://slurm.schedmd.com/sbatch.html#OPT_time) so the docstring only lists the
formats once and remains accurate and concise.
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 428-456: The _make_unique_job_name function currently uses
time-based suffix int(time.time()) % 100000 which can collide for concurrent
launches or runs ~27.7h apart; update the suffix to include an additional
uniqueness component (e.g., append the process id from os.getpid() or a short
random hex from secrets.token_hex) combined with the timestamp so suffix becomes
e.g. "-{timestamp}{pid_or_rand}". Ensure you adjust suffix_len calculation and
the max_base_len truncation logic accordingly so the final name still respects
max_length and strip trailing hyphens from the truncated base_name as before.
- Around line 826-832: The code incorrectly maps HardwareConfig.num_tasks
(Slurm-style "tasks per node") to ResourceSpec.cpus, conflating processes with
CPU core requests; update the mapping in the block that constructs resources
(where group.hardware or HardwareConfig() is used and ResourceSpec(...) is
created) to use a proper CPU source: add and prefer a new HardwareConfig.cpus
field if present, otherwise compute a safer default (e.g., max(hardware.cpus if
set else 0, hardware.num_tasks, 4) for GPU workloads or another sensible
heuristic), and set ResourceSpec.cpus to that computed value instead of
hardware.num_tasks so CPU core requests reflect cores, not MPI tasks.
In `@scripts/k8s-tests/check_nccl_logs.py`:
- Around line 133-137: The return type for validate_result is currently
unparameterized as tuple; update it to a specific Tuple[...] annotation (e.g.,
Tuple[bool, Optional[str]] if the function returns a success boolean and an
optional message) by importing Tuple from typing if necessary and replacing the
signature def validate_result(...) -> tuple: with def validate_result(...) ->
Tuple[bool, Optional[str]]: (or the exact Tuple[...] that matches the actual
return values in validate_result).
In `@scripts/k8s-tests/image-build/build-and-load.sh`:
- Around line 128-129: The kaniko executor image is using the mutable tag
"gcr.io/kaniko-project/executor:latest"; update the image reference in the build
job (the "kaniko" container definition) to a specific version tag or immutable
digest (for example a released version tag or an `@sha256`:... digest) to ensure
reproducible builds and prevent breaking changes from "latest". Locate the
"kaniko" container entry in build-and-load.sh and replace the image value with a
pinned tag or digest.
- Around line 234-255: The loader pod created by the kubectl run command (named
"load-$NODE") is set to privileged: true and mounts the host root via volume
"host-root" (mountPath "/host"), which is a high-risk security surface; add a
concise inline comment above this pod definition explaining that privileged mode
and the host-root mount are required to allow running ctr import against the
host containerd socket (volume "containerd-sock") but they increase attack
surface and should only be used in trusted test clusters, and note any
mitigation (readOnly host-root, namespace -n "$NAMESPACE", and why those are
insufficient); reference the kubectl run block, the "loader" container,
securityContext.privileged, and the "host-root"/"containerd-sock" volumes in the
comment.
In `@scripts/k8s-tests/image-build/build-nemo-rl.yaml`:
- Line 53: Replace mutable "latest" image tags with fixed versioned tags for
reproducibility: update the image fields referencing "alpine/git:latest" and
"gcr.io/kaniko-project/executor:latest" to specific, pinned versions (e.g.,
"alpine/git:v2.45.2" and "gcr.io/kaniko-project/executor:v1.23.2") wherever
those image strings appear in the manifest so builds use immutable, reproducible
images.
In `@scripts/k8s-tests/image-build/kaniko-build.sh`:
- Around line 36-37: Remove the dead variables SCRIPT_DIR and REPO_ROOT from
kaniko-build.sh: these are defined but never used (SCRIPT_DIR only serves to
compute REPO_ROOT). Delete the two lines that set SCRIPT_DIR and REPO_ROOT (the
symbols SCRIPT_DIR and REPO_ROOT) so there is no unused-variable noise and
ShellCheck SC2034 is resolved; if the script actually needs the repository path
later, instead compute and use it where required rather than defining them
unused at the top.
In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-multinode-test.yaml`:
- Around line 48-49: The env entry MASTER_ADDR=${MASTER_ADDR} is a redundant
self-assignment; remove this line from the container/env block (or replace it
with a clarifying comment) so the pod uses the existing MASTER_ADDR provided
earlier, and ensure any intended headless-service DNS is explicitly assigned if
needed (search for MASTER_ADDR and the env block in the manifest to locate the
exact spot).
In `@scripts/k8s-tests/manifests/real-nemo-rl-sft-test.yaml`:
- Around line 43-73: The run_sft.sh script redundantly exports NCCL_DEBUG=INFO
while the container environment already sets NCCL_DEBUG; remove the export line
from the run_sft.sh content (the line "export NCCL_DEBUG=INFO") to avoid
duplicate environment configuration, keeping the container env entry ({name:
NCCL_DEBUG, value: "INFO"}) as the single source of truth.
In `@scripts/k8s-tests/pipeline_smoke_test.py`:
- Around line 197-216: The dry-run path calls the private
pipeline._convert_groups_to_job_spec method (after pipeline.run(dry_run=True)),
which couples this smoke test to an internal API; add a short inline comment
immediately above the call to _convert_groups_to_job_spec explaining that this
is intentionally using a private/internal method and warning maintainers that
changes to that method name or signature will break the test, and optionally
mention why it's safe here (validation-only) so future developers understand the
dependency.
- Around line 236-252: Replace the hardcoded wait timeout (900) passed to
backend.wait_for_completion with a CLI-configurable value: add a command-line
argument (e.g., parser.add_argument('--timeout', type=int, default=900)) where
args is built, then call backend.wait_for_completion(handle,
timeout=args.timeout); keep the lazy imports (get_backend, JobStatus) inside the
function as they are to avoid import-time failures.
In `@scripts/k8s-tests/run_sft_k8s_real.py`:
- Around line 30-102: The train_cmd in main() embeds a large multi-line Python
training script into a shell heredoc (writing to /tmp/train_pipeline.py), which
is fragile and hard to maintain; extract that embedded script into a standalone
Python file (e.g., sft_train_pipeline.py) and update train_cmd to run that file
instead (keep the same environment exports and torchrun invocation), or
alternatively load the script text from a separate resource at runtime and
inject it into /tmp/train_pipeline.py; update references to the heredoc and
/tmp/train_pipeline.py in train_cmd so main() uses the external script
(train_cmd variable) while preserving the torchrun flags (--nproc_per_node,
--nnodes, --node_rank, --master_addr, --master_port) and any environment
variable usages (NODE_RANK, MASTER_ADDR).
In `@scripts/k8s-tests/smoke_test_single_node.sh`:
- Around line 202-213: The current wait treats any non-zero exit as "failed" but
only checks .status.failed; update the post-wait logic to distinguish an actual
job failure vs a timeout/still-running job by querying the job status with
kubectl get job "$JOB_NAME" -n "$NAMESPACE" -o jsonpath and examining
.status.failed and .status.active (or .status.conditions[*].type for
"Failed"/"Complete"). Concretely: after the kubectl wait block, if
JOB_STATUS="failed" run a kubectl get job and if .status.failed > 0 set a clear
failure message, else if .status.active > 0 report "Job timed out/still
running", otherwise inspect .status.conditions for a Failed condition to report
failure; use the existing JOB_NAME, JOB_STATUS and FAILED variables and the same
namespace variable to implement these checks.
In `@scripts/k8s-tests/validate_sft_e2e_multinode.sh`:
- Around line 196-207: The script's kubectl wait call blocks until the full
${TIMEOUT} if the Job fails because it only waits for condition=complete; update
the wait invocation that references JOB_NAME and TIMEOUT to include both
conditions (e.g., --for=condition=complete --for=condition=failed) so the
command returns as soon as the Job either completes or fails (Kubernetes ≥1.23
required), and after the wait check the Job status (inspect the job object or
use kubectl get job/$JOB_NAME -o jsonpath='{.status}') to distinguish failure vs
timeout and proceed to print pod logs and exit appropriately.
In `@scripts/k8s-tests/validate_sft_e2e.sh`:
- Around line 303-311: The failure branch that handles a non-completing job
should capture and append the pod logs into the same $LOG_FILE used by the
success path; update the block that runs kubectl logs (the branch that runs
kubectl wait --for=condition=complete ... for job/$JOB_NAME and then fetches POD
via kubectl get pods) so that it writes the diagnostic output into $LOG_FILE
(create the file if necessary) — e.g., redirect the kubectl logs output and any
echo messages into $LOG_FILE (and still print to stdout if desired) so
post-mortem artifacts are preserved for JOB_NAME and NAMESPACE. Ensure you use
the same $LOG_FILE variable and preserve the current --tail=50 behavior when
saving logs.
- fix backend validation/error handling and resource mapping edge cases - harden k8s manifests/scripts (securityContext, DNS fail-fast, cleanup, pinned images) - update runbooks/examples to 25.04 and remove dgx-29 references in docs - full pre-commit passed; real 2-node x 2-GPU MNMG validation passed Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- add missing docstrings in changed pipeline backends/utils and k8s test scripts - update nemo-rl k8s Dockerfile to run as non-root user in final image Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
- remove hardcoded nemo-skills/nemo-rl:latest in build-and-load.sh - make immutable tag builds/loads deterministic across nodes Signed-off-by: Emanuel Scoullos <escoullos@nvidia.com>
|
Validation rerun with immutable image tag (no shortcut path)
|
Summary
This PR adds Kubernetes multi-node SFT support and validation assets, including runnable scripts, manifests, and
execution evidence from completed runs.
Code changes
nemo_skills/pipeline/backends/base.py: extends job spec support needed by multi-node execution.nemo_skills/pipeline/backends/kubernetes.py: multi-node Indexed Job + Headless Service flow, distributed envinjection, RDMA options, DNS check init container, pod anti-affinity, image pull policy handling, RBAC preflight.
nemo_skills/pipeline/nemo_rl/sft.py: Kubernetes execution path routing for SFT workflows.nemo_skills/pipeline/utils/declarative.py: declarative pipeline conversion updates for backend-aware jobgeneration.
nemo_skills/pipeline/utils/cluster.py: backend/timeout config handling improvements.pyproject.toml: addsintegrationpytest marker used by backend integration tests.tests/test_backends.py,tests/test_pipeline_utils.py: expanded coverage for Kubernetes multi-node behavior andpipeline utilities.
Scripts and purpose
scripts/k8s-tests/validate_sft_e2e.sh: single-node end-to-end SFT validation.scripts/k8s-tests/validate_sft_e2e_multinode.sh: multi-node E2E validator (Headless Service + Indexed Job + NCCLall-reduce checks).
scripts/k8s-tests/smoke_test_single_node.sh: minimal single-node DDP/NCCL smoke test.scripts/k8s-tests/smoke_test_multi_node.sh: minimal multi-node DDP/NCCL smoke test.scripts/k8s-tests/pipeline_smoke_test.py: exercises Pipeline -> KubernetesBackend path.scripts/k8s-tests/sft_smoke_test.py: GPT-2 SFT smoke test via Pipeline+KubernetesBackend.scripts/k8s-tests/run_sft_k8s_real.py: submits real SFT pipeline job through NeMo-Skills K8s code path.scripts/k8s-tests/check_nccl_logs.py: parses NCCL logs for transport/topology sanity.scripts/k8s-tests/generate_sft_data.py: synthetic SFT data generator.scripts/k8s-tests/image-build/*: Kaniko build + node import automation and manifests.scripts/k8s-tests/manifests/*: reusable K8s manifests for single-node, multi-node, and real NeMo-RL runs.Docs
docs/kubernetes-guide.md: RBAC/service permission updates and guidance.docs/kubernetes-migration.md: migration flow updates for multi-node readiness checks.docs/kubernetes-status.md: status updated to include validated multi-node SFT.docs/kubernetes-sft-runbook.md: infra/backend validation runbook and commands.docs/kubernetes-nemo-rl-sft-runbook.md: real NeMo-RL SFT runbook (single-node + MNMG).Cluster configs
cluster_configs/example-kubernetes.yaml: documents new K8s config knobs (rdma, dns_check, scheduling spread, rbacpreflight, image pull policy).
cluster_configs/kubernetes/rbac.yaml: adds services create/delete/get/list/watch required for multi-node HeadlessServices.
cluster_configs/kubernetes/README.md: RBAC verification commands for services.Cluster Config Integration
The Kubernetes backend changes are wired into the standard NeMo-Skills cluster-config flow.
ns nemo_rl sft) loads cluster config viaget_cluster_config(...)(nemo_skills/pipeline/nemo_rl/ sft.py)../cluster_configs/<name>.yaml(nemo_skills/pipeline/utils/cluster.py).executor: kubernetesfrom that config (nemo_skills/pipeline/backends/factory.py).namespace,containers,env_vars,image_pull_policyrdma,dns_check,scheduling,rbac_preflight(
nemo_skills/pipeline/backends/kubernetes.py).Important scope note
The
scripts/k8s-tests/*utilities used for validation are mostly standalone and often build inlinecluster_configdicts or call
kubectldirectly, so they do not always loadcluster_configs/*.yaml.The cluster config updates in this PR are for the canonical CLI/backend path and user-facing documentation.
Validation artifacts (completed runs) Intentionally not committed, but attached.
scripts/k8s-tests/logs/nemo-rl-mnmg-job-status.yamlscripts/k8s-tests/logs/nemo-rl-mnmg-pod-placement.txtscripts/k8s-tests/logs/nemo-rl-mnmg-sft-head.logscripts/k8s-tests/logs/nemo-rl-mnmg-sft-worker.logscripts/k8s-tests/logs/mn-sft-e2e-20260220-105949.log(containsMULTI-NODE PASSED; includes noisy NumPy warningnemo-rl-mnmg-job-status.yaml
nemo-rl-mnmg-pod-placement.txt
nemo-rl-mnmg-sft-head.log
nemo-rl-mnmg-sft-worker.log
mn-sft-e2e-20260220-105949.log
trace output but run completed)
Summary by CodeRabbit
New Features
Documentation
Chores
Tests