Skip to content

Commit e883ac4

Browse files
cp: feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) (1648) into r0.5.0 (#1697)
Signed-off-by: ruit <[email protected]> Signed-off-by: NeMo Bot <[email protected]> Co-authored-by: Rayen <[email protected]>
1 parent 9902db0 commit e883ac4

File tree

6 files changed

+129
-3
lines changed

6 files changed

+129
-3
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
defaults: ../../sft.yaml
2+
sft:
3+
max_num_steps: 100
4+
checkpointing:
5+
enabled: false
6+
policy:
7+
model_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
8+
train_global_batch_size: 16
9+
max_total_sequence_length: 2048
10+
dtensor_cfg:
11+
lora_cfg:
12+
enabled: true
13+
dim: 256
14+
alpha: 512
15+
use_triton: false
16+
logger:
17+
wandb:
18+
project: nemo-rl
19+
name: sft-nanov3-30BA3B-2n8g-fsdp2-lora
20+
tensorboard:
21+
log_dir: tb_logs-sft-nanov3-30BA3B-2n8g-fsdp2-lora
22+
mlflow:
23+
run_name: sft-nanov3-30BA3B-2n8g-fsdp2-lora
24+
cluster:
25+
gpus_per_node: 8
26+
num_nodes: 2
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
defaults: ../../sft.yaml
2+
sft:
3+
max_num_steps: 100
4+
checkpointing:
5+
enabled: false
6+
policy:
7+
model_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
8+
train_global_batch_size: 16
9+
max_total_sequence_length: 2048
10+
logger:
11+
wandb:
12+
project: nemo-rl
13+
name: sft-nanov3-30BA3B-2n8g-fsdp2
14+
tensorboard:
15+
log_dir: tb_logs-sft-nanov3-30BA3B-2n8g-fsdp2
16+
mlflow:
17+
run_name: sft-nanov3-30BA3B-2n8g-fsdp2
18+
cluster:
19+
gpus_per_node: 8
20+
num_nodes: 2
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/bin/bash
2+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
3+
source $SCRIPT_DIR/common.env
4+
5+
# ===== BEGIN CONFIG =====
6+
NUM_NODES=2
7+
STEPS_PER_RUN=20 # step_time ~ 10sec
8+
MAX_STEPS=20
9+
NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN )) # Round up
10+
NUM_MINUTES=15 # Usually 15 minutes is enough for 20 steps, but we add a buffer of 3 minutes in metrics check
11+
# ===== END CONFIG =====
12+
13+
exit_if_max_steps_reached
14+
15+
# Run the experiment
16+
cd $PROJECT_ROOT
17+
uv run examples/run_sft.py \
18+
--config $CONFIG_PATH \
19+
sft.max_num_steps=$MAX_STEPS \
20+
logger.log_dir=$LOG_DIR \
21+
logger.wandb_enabled=True \
22+
logger.wandb.project=nemo-rl \
23+
logger.wandb.name=$EXP_NAME \
24+
logger.monitor_gpus=True \
25+
logger.tensorboard_enabled=True \
26+
checkpointing.checkpoint_dir=$CKPT_DIR \
27+
$@ \
28+
2>&1 | tee $RUN_LOG
29+
30+
# Convert tensorboard logs to json
31+
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
32+
33+
# Only run metrics if the target step is reached
34+
if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
35+
uv run tests/check_metrics.py $JSON_METRICS \
36+
'data["train/loss"]["20"] < 2.03' \
37+
'mean(data["timing/train/total_step_time"], 2) < 18'
38+
fi
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/bin/bash
2+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
3+
source $SCRIPT_DIR/common.env
4+
5+
# ===== BEGIN CONFIG =====
6+
NUM_NODES=2
7+
STEPS_PER_RUN=20 # step_time ~ 15sec
8+
MAX_STEPS=20
9+
NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN )) # Round up
10+
NUM_MINUTES=15
11+
# ===== END CONFIG =====
12+
13+
exit_if_max_steps_reached
14+
15+
# Run the experiment
16+
cd $PROJECT_ROOT
17+
uv run examples/run_sft.py \
18+
--config $CONFIG_PATH \
19+
sft.max_num_steps=$MAX_STEPS \
20+
logger.log_dir=$LOG_DIR \
21+
logger.wandb_enabled=True \
22+
logger.wandb.project=nemo-rl \
23+
logger.wandb.name=$EXP_NAME \
24+
logger.monitor_gpus=True \
25+
logger.tensorboard_enabled=True \
26+
checkpointing.checkpoint_dir=$CKPT_DIR \
27+
$@ \
28+
2>&1 | tee $RUN_LOG
29+
30+
# Convert tensorboard logs to json
31+
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
32+
33+
# Only run metrics if the target step is reached
34+
if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
35+
uv run tests/check_metrics.py $JSON_METRICS \
36+
'data["train/loss"]["20"] < 1.98' \
37+
'mean(data["timing/train/total_step_time"], 2) < 15'
38+
fi

tests/test_suites/nightly.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,10 @@ tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
9090
# gpt-oss 20b DeepEP test
9191
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh
9292

93+
# Nemotron 3 Nano 30B A3B Base BF16 tests
94+
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
95+
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
96+
9397
#######
9498
# DPO #
9599
#######

tests/unit/test_recipes_and_test_suites.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ def test_all_recipe_yamls_accounted_for_in_test_suites(
180180
)
181181

182182

183-
def test_nightly_compute_stays_below_1130_hours(nightly_test_suite, tracker):
183+
def test_nightly_compute_stays_below_1140_hours(nightly_test_suite, tracker):
184184
command = f"DRYRUN=1 HF_HOME=... HF_DATASETS_CACHE=... CONTAINER= ACCOUNT= PARTITION= ./tools/launch {' '.join(nightly_test_suite)}"
185185

186186
print(f"Running command: {command}")
@@ -212,8 +212,8 @@ def test_nightly_compute_stays_below_1130_hours(nightly_test_suite, tracker):
212212
f"Last line of output was not as expected: '{last_line}'"
213213
)
214214
total_gpu_hours = float(last_line.split(":")[-1].strip())
215-
assert total_gpu_hours <= 1130, (
216-
f"Total GPU hours exceeded 1130: {last_line}. We should revisit the test suites to reduce the total GPU hours."
215+
assert total_gpu_hours <= 1140, (
216+
f"Total GPU hours exceeded 1140: {last_line}. We should revisit the test suites to reduce the total GPU hours."
217217
)
218218
tracker.track("total_nightly_gpu_hours", total_gpu_hours)
219219

0 commit comments

Comments
 (0)