Skip to content

Run fails when nsys profiling is enabled for vLLM workers #1932

@scsudhakaran

Description

@scsudhakaran

Nemotron3 Nano run fails on GB200 when enabling nsys profiles.

  1. NRL_NSYS_WORKER_PATTERNS="*policy*" works fine
  2. NRL_NSYS_WORKER_PATTERNS=*vllm*" fails with the following error:
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m !!!!!!! Segfault encountered !!!!!!!
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m   File "<unknown>", line 0, in cuGraphLaunch
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m   File "<unknown>", line 0, in cudaGraphLaunch
[36m(RayWorkerWrapper pid=3634859, ip=10.52.97.42)[0m   File "<unknown>", line 0, in at::cuda::CUDAGraph::replay()

The slurm-based script is as follows:

#!/bin/bash
set -euo pipefail

source ~/.bashrc

NEMO_RL_DIR=<NEMO_RL_DIR>
PROJECT_DIR=<PROJECT_DIR>
RESULTS_DIR=<RESULTS_DIR>

CONTAINER=<CONTAINER>
MODEL_CHECKPOINT=<MODEL_CHECKPOINT>
CHAT_TEMPLATE_FILE=<CHAT_TEMPLATE_FILE>

MOUNTS="${PROJECT_DIR}:${PROJECT_DIR}"
MOUNTS+=",${NEMO_RL_DIR}:${NEMO_RL_DIR}"
MOUNTS+=",${HF_CACHE}:/root/.cache/huggingface"

COMMAND="
    NRL_FORCE_REBUILD_VENVS=${NRL_FORCE_REBUILD_VENVS:-true} \
    uv run examples/run_grpo_math.py \
    --config examples/configs/grpo_math_qwen30ba3b_megatron.yaml \
    ++policy.model_name=${MODEL_CHECKPOINT} \
    ++policy.tokenizer.chat_template=${CHAT_TEMPLATE_FILE} \
    ++policy.train_global_batch_size=64 \
    ++policy.megatron_cfg.converter_type=null \
    ++policy.megatron_cfg.context_parallel_size=1 \
    ++policy.megatron_cfg.pipeline_model_parallel_size=1 \
    ++policy.megatron_cfg.activation_checkpointing=true \
    ++policy.megatron_cfg.bias_activation_fusion=false \
    ++policy.megatron_cfg.optimizer.lr=5e-7 \
    ++policy.megatron_cfg.optimizer.min_lr=5e-8 \
    ++loss_fn.reference_policy_kl_penalty=0.001 \
    ++policy.max_total_sequence_length=4096 \
    ++grpo.num_prompts_per_step=8 \
    ++grpo.num_generations_per_prompt=8 \
    ++grpo.max_num_steps=5 \
    ++checkpointing.enabled=false \
    ++logger.log_dir=${RESULTS_DIR}/logs \
    ++logger.wandb_enabled=false \
    ++cluster.num_nodes=2 \
    ++cluster.gpus_per_node=4 \
    ++policy.sequence_packing.enabled=true"


cd "$PROJECT_DIR"
COMMAND="${COMMAND}" \
    HF_HOME="${HF_CACHE}" \
    HF_DATASETS_CACHE="${HF_DATASETS_CACHE}" \
    HF_TOKEN="${HF_TOKEN}" \
    CONTAINER="${CONTAINER}" \
    MOUNTS="${MOUNTS}" \
    RAY_LOG_SYNC_FREQUENCY=1 \
    NRL_NSYS_PROFILE_STEP_RANGE=2:3 \
    NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*" \
    sbatch \
        --nodes=2 \
        --account=<ACCOUNT> \
        --partition=<PARTITION> \
        --job-name=<JOB_NAME> \
        --time=00:30:00 \
        ray.sub

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions