-
Notifications
You must be signed in to change notification settings - Fork 250
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Nemotron3 Nano run fails on GB200 when enabling nsys profiles.
NRL_NSYS_WORKER_PATTERNS="*policy*"works fineNRL_NSYS_WORKER_PATTERNS=*vllm*"fails with the following error:
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m !!!!!!! Segfault encountered !!!!!!!
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m File "<unknown>", line 0, in cuGraphLaunch
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m File "<unknown>", line 0, in cudaGraphLaunch
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m File "<unknown>", line 0, in at::cuda::CUDAGraph::replay()
The slurm-based script is as follows:
#!/bin/bash
set -euo pipefail
source ~/.bashrc
NEMO_RL_DIR=<NEMO_RL_DIR>
PROJECT_DIR=<PROJECT_DIR>
RESULTS_DIR=<RESULTS_DIR>
CONTAINER=<CONTAINER>
MODEL_CHECKPOINT=<MODEL_CHECKPOINT>
CHAT_TEMPLATE_FILE=<CHAT_TEMPLATE_FILE>
MOUNTS="${PROJECT_DIR}:${PROJECT_DIR}"
MOUNTS+=",${NEMO_RL_DIR}:${NEMO_RL_DIR}"
MOUNTS+=",${HF_CACHE}:/root/.cache/huggingface"
COMMAND="
NRL_FORCE_REBUILD_VENVS=${NRL_FORCE_REBUILD_VENVS:-true} \
uv run examples/run_grpo_math.py \
--config examples/configs/grpo_math_qwen30ba3b_megatron.yaml \
++policy.model_name=${MODEL_CHECKPOINT} \
++policy.tokenizer.chat_template=${CHAT_TEMPLATE_FILE} \
++policy.train_global_batch_size=64 \
++policy.megatron_cfg.converter_type=null \
++policy.megatron_cfg.context_parallel_size=1 \
++policy.megatron_cfg.pipeline_model_parallel_size=1 \
++policy.megatron_cfg.activation_checkpointing=true \
++policy.megatron_cfg.bias_activation_fusion=false \
++policy.megatron_cfg.optimizer.lr=5e-7 \
++policy.megatron_cfg.optimizer.min_lr=5e-8 \
++loss_fn.reference_policy_kl_penalty=0.001 \
++policy.max_total_sequence_length=4096 \
++grpo.num_prompts_per_step=8 \
++grpo.num_generations_per_prompt=8 \
++grpo.max_num_steps=5 \
++checkpointing.enabled=false \
++logger.log_dir=${RESULTS_DIR}/logs \
++logger.wandb_enabled=false \
++cluster.num_nodes=2 \
++cluster.gpus_per_node=4 \
++policy.sequence_packing.enabled=true"
cd "$PROJECT_DIR"
COMMAND="${COMMAND}" \
HF_HOME="${HF_CACHE}" \
HF_DATASETS_CACHE="${HF_DATASETS_CACHE}" \
HF_TOKEN="${HF_TOKEN}" \
CONTAINER="${CONTAINER}" \
MOUNTS="${MOUNTS}" \
RAY_LOG_SYNC_FREQUENCY=1 \
NRL_NSYS_PROFILE_STEP_RANGE=2:3 \
NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*" \
sbatch \
--nodes=2 \
--account=<ACCOUNT> \
--partition=<PARTITION> \
--job-name=<JOB_NAME> \
--time=00:30:00 \
ray.sub
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working