Skip to content

Commit d45ff3f

Browse files
authored
test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (#866)
Signed-off-by: Terry Kong <[email protected]>
1 parent d73c942 commit d45ff3f

File tree

10 files changed

+213
-13
lines changed

10 files changed

+213
-13
lines changed

docs/guides/grpo-deepscaler.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@ This guide explains how to use NeMo RL to train long Chain of Thought (CoT) reas
55

66
## Train the Model
77
We follow the DeepScaleR recipe and train the model in three stages. In the first stage, we train with an 8K context window. In the second stage, we train with a 16K context window. In the third stage, we train with a 24K context window.
8-
To train the model using NeMo RL, use the `examples/configs/grpo-deepscaler-1.5b-8K.yaml` config file. This file closely matches the experiment settings in the original DeepScaleR recipe. We then train with `examples/configs/grpo-deepscaler-1.5b-16K.yaml` and `examples/configs/grpo-deepscaler-1.5b-24K.yaml` for the second and third stages, respectively.
8+
To train the model using NeMo RL, use the `examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml` config file. This file closely matches the experiment settings in the original DeepScaleR recipe. We then train with `examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml` and `examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml` for the second and third stages, respectively.
99

1010
```sh
11-
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml
12-
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-16K.yaml policy.model_name=/path/to/8K/checkpoint/hf
13-
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-24K.yaml policy.model_name=/path/to/16K/checkpoint/hf
11+
uv run examples/run_grpo_math.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml
12+
uv run examples/run_grpo_math.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml policy.model_name=/path/to/8K/checkpoint/hf
13+
uv run examples/run_grpo_math.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml policy.model_name=/path/to/16K/checkpoint/hf
1414
```
1515

1616
At the end of each stage, you need to specify the Hugging Face checkpoint to continue training with. To get this checkpoint, we convert a model checkpoint to a Hugging Face checkpoint with the following command:
@@ -19,7 +19,7 @@ At the end of each stage, you need to specify the Hugging Face checkpoint to con
1919
uv run examples/converters/convert_dcp_to_hf.py --config=results/grpo-deepscaler-1.5b-8K/step_240/config.yaml --dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/policy/weights --hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/hf
2020
```
2121

22-
When running the next command, we use the Hugging Face checkpoint as the initial checkpoint. We train with an 8K context window for 240 steps, a 16K context window for 290 steps, and a 24K context window for 50 steps. The 8K and 16K steps can be run on a single 8XH100 80GB node, while the 24K step requires four nodes. If you're running on 8XA100 80GB, you will need at least 1 node for 8K training and four nodes for 16-24k training.
22+
When running the next command, we use the Hugging Face checkpoint as the initial checkpoint. We train with an 8K context window for 240 steps, a 16K context window for 290 steps, and a 24K context window for 50 steps. We run all experiments on a single 8XH100 80GB node. If you're running on 8XA100 80GB, you will need at least 1 node for 8K training and 2 nodes for 16-24k training.
2323

2424
## Training Curve
2525
When using the above commands, we get the following training curve:

examples/configs/grpo-deepscaler-1.5b-16K.yaml renamed to examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ loss_fn:
88

99
policy:
1010
max_total_sequence_length: 16384
11+
logprob_batch_size: 2
1112

1213
dtensor_cfg:
1314
enabled: true

examples/configs/grpo-deepscaler-1.5b-24K.yaml renamed to examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ loss_fn:
88

99
policy:
1010
max_total_sequence_length: 24576
11+
logprob_batch_size: 2
1112

1213
dtensor_cfg:
1314
enabled: true
@@ -44,9 +45,6 @@ policy:
4445
gpu_memory_utilization: 0.8
4546
enforce_eager: True
4647
max_model_len: ${policy.max_total_sequence_length}
47-
# For most cases, use "dummy" to load the initial weights, since they will be overwritten during refit
48-
# For Gemma models, we need to use "auto" due to a vllm bug
49-
load_format: dummy
5048

5149
cluster:
5250
gpus_per_node: 8

examples/configs/grpo-deepscaler-1.5b-8K.yaml renamed to examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -103,9 +103,6 @@ policy:
103103
gpu_memory_utilization: 0.6
104104
max_model_len: ${policy.max_total_sequence_length}
105105
enforce_eager: True
106-
# For most cases, use "dummy" to load the initial weights, since they will be overwritten during refit
107-
# For Gemma models, we need to use "auto" due to a vllm bug
108-
load_format: dummy
109106
colocated:
110107
# true: generation shares training GPUs
111108
# false: uses dedicated generation resources

nemo_rl/evals/eval.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -427,7 +427,7 @@ def _save_evaluation_data_to_json(evaluation_data, master_config, save_path):
427427
"model_name": master_config["generation"]["model_name"],
428428
"dataset_name": master_config["data"]["dataset_name"],
429429
"metric": master_config["eval"]["metric"],
430-
"pass_k_value": master_config["eval"]["pass_k_value"],
430+
"k_value": master_config["eval"]["k_value"],
431431
"num_tests_per_prompt": master_config["eval"]["num_tests_per_prompt"],
432432
"temperature": master_config["generation"]["temperature"],
433433
"top_p": master_config["generation"]["top_p"],
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/bin/bash
2+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
3+
source $SCRIPT_DIR/common.env
4+
5+
# ===== BEGIN CONFIG =====
6+
NUM_NODES=1
7+
STEPS_PER_RUN=30
8+
MAX_STEPS=30
9+
NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN )) # Round up
10+
NUM_MINUTES=240
11+
# ===== END CONFIG =====
12+
13+
exit_if_max_steps_reached
14+
15+
# Use checkpoint created from the 8K checkpoint in grpo-deepscaler-1.5b-8K.sh
16+
if [[ -z "$CACHED_MODEL_PATH" ]]; then
17+
echo "Need to set CACHED_MODEL_PATH to the path to the trained 8K checkpoint"
18+
exit 1
19+
fi
20+
21+
# Run the experiment
22+
cd $PROJECT_ROOT
23+
uv run examples/run_grpo_math.py \
24+
--config $CONFIG_PATH \
25+
policy.model_name=$CACHED_MODEL_PATH \
26+
grpo.max_num_steps=$MAX_STEPS \
27+
logger.log_dir=$LOG_DIR \
28+
logger.wandb_enabled=True \
29+
logger.wandb.project=nemo-rl \
30+
logger.wandb.name=$EXP_NAME \
31+
logger.monitor_gpus=True \
32+
logger.tensorboard_enabled=True \
33+
checkpointing.enabled=True \
34+
checkpointing.checkpoint_dir=$CKPT_DIR \
35+
$@ \
36+
2>&1 | tee $RUN_LOG
37+
38+
# Convert tensorboard logs to json
39+
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
40+
41+
# Only run metrics if the target step is reached
42+
if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
43+
uv run tests/check_metrics.py $JSON_METRICS \
44+
'mean(data["train/token_mult_prob_error"]) < 1.1' \
45+
"data['train/token_mult_prob_error']['$MAX_STEPS'] < 1.1"
46+
fi
47+
48+
# TODO: enable in subsequent PR to do a quick accuracy check
49+
## Convert 16k checkpoint
50+
#uv run examples/converters/convert_dcp_to_hf.py \
51+
# --config=$CKPT_DIR/step_${MAX_STEPS}/config.yaml \
52+
# --dcp-ckpt-path=$CKPT_DIR/step_${MAX_STEPS}/policy/weights \
53+
# --hf-ckpt-path=$CKPT_DIR/grpo-deepscaler-16k-${MAX_STEPS}-hf
54+
#
55+
## Run eval
56+
#uv run examples/run_eval.py \
57+
# generation.model_name=$CKPT_DIR/grpo-deepscaler-16k-${MAX_STEPS}-hf \
58+
# data.prompt_file=examples/prompts/cot.txt \
59+
# generation.vllm_cfg.max_model_len=32768 2>&1 | tee ${RUN_LOG}.aime-16k
60+
#
61+
#cat ${RUN_LOG}.aime-16k | grep "score=" | sed 's/.*score=\([^ ]*\).*/{"score": \1}/' > ${RUN_LOG}-16k-metric.json
62+
#
63+
#uv run tests/check_metrics.py ${RUN_LOG}-16k-metric.json \
64+
# 'data["score"] >= 0.25' \
65+
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/bin/bash
2+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
3+
source $SCRIPT_DIR/common.env
4+
5+
# ===== BEGIN CONFIG =====
6+
NUM_NODES=4
7+
STEPS_PER_RUN=30
8+
MAX_STEPS=30
9+
NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN )) # Round up
10+
NUM_MINUTES=240
11+
# ===== END CONFIG =====
12+
13+
exit_if_max_steps_reached
14+
15+
# Use checkpoint created from the 16K checkpoint in grpo-deepscaler-1.5b-16K.sh
16+
if [[ -z "$CACHED_MODEL_PATH" ]]; then
17+
echo "Need to set CACHED_MODEL_PATH to the path to the trained 16K checkpoint"
18+
exit 1
19+
fi
20+
21+
# Run the experiment
22+
cd $PROJECT_ROOT
23+
uv run examples/run_grpo_math.py \
24+
--config $CONFIG_PATH \
25+
policy.model_name=$CACHED_MODEL_PATH \
26+
grpo.max_num_steps=$MAX_STEPS \
27+
logger.log_dir=$LOG_DIR \
28+
logger.wandb_enabled=True \
29+
logger.wandb.project=nemo-rl \
30+
logger.wandb.name=$EXP_NAME \
31+
logger.monitor_gpus=True \
32+
logger.tensorboard_enabled=True \
33+
checkpointing.enabled=True \
34+
checkpointing.checkpoint_dir=$CKPT_DIR \
35+
$@ \
36+
2>&1 | tee $RUN_LOG
37+
38+
# Convert tensorboard logs to json
39+
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
40+
41+
# Only run metrics if the target step is reached
42+
if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
43+
uv run tests/check_metrics.py $JSON_METRICS \
44+
'mean(data["train/token_mult_prob_error"]) < 1.1' \
45+
"data['train/token_mult_prob_error']['$MAX_STEPS'] < 1.1"
46+
fi
47+
48+
# TODO: enable in subsequent PR to do a quick accuracy check
49+
## Convert 24k checkpoint
50+
#uv run examples/converters/convert_dcp_to_hf.py \
51+
# --config=$CKPT_DIR/step_${MAX_STEPS}/config.yaml \
52+
# --dcp-ckpt-path=$CKPT_DIR/step_${MAX_STEPS}/policy/weights \
53+
# --hf-ckpt-path=$CKPT_DIR/grpo-deepscaler-24k-${MAX_STEPS}-hf
54+
#
55+
## Run eval
56+
#uv run examples/run_eval.py \
57+
# generation.model_name=$CKPT_DIR/grpo-deepscaler-24k-${MAX_STEPS}-hf \
58+
# data.prompt_file=examples/prompts/cot.txt \
59+
# generation.vllm_cfg.max_model_len=32768 2>&1 | tee ${RUN_LOG}.aime-24k
60+
#
61+
#cat ${RUN_LOG}.aime-24k | grep "score=" | sed 's/.*score=\([^ ]*\).*/{"score": \1}/' > ${RUN_LOG}-24k-metric.json
62+
#
63+
#uv run tests/check_metrics.py ${RUN_LOG}-24k-metric.json \
64+
# 'data["score"] >= 0.25' \
65+
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
#!/bin/bash
2+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
3+
source $SCRIPT_DIR/common.env
4+
5+
# ===== BEGIN CONFIG =====
6+
NUM_NODES=1
7+
STEPS_PER_RUN=40
8+
MAX_STEPS=40
9+
NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN )) # Round up
10+
NUM_MINUTES=240
11+
# ===== END CONFIG =====
12+
13+
exit_if_max_steps_reached
14+
15+
# Run the experiment
16+
cd $PROJECT_ROOT
17+
uv run examples/run_grpo_math.py \
18+
--config $CONFIG_PATH \
19+
grpo.max_num_steps=$MAX_STEPS \
20+
logger.log_dir=$LOG_DIR \
21+
logger.wandb_enabled=True \
22+
logger.wandb.project=nemo-rl \
23+
logger.wandb.name=$EXP_NAME \
24+
logger.monitor_gpus=True \
25+
logger.tensorboard_enabled=True \
26+
checkpointing.enabled=True \
27+
checkpointing.checkpoint_dir=$CKPT_DIR \
28+
$@ \
29+
2>&1 | tee $RUN_LOG
30+
31+
# Convert tensorboard logs to json
32+
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
33+
34+
# Only run metrics if the target step is reached
35+
if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
36+
uv run tests/check_metrics.py $JSON_METRICS \
37+
'mean(data["train/token_mult_prob_error"]) < 1.1' \
38+
"data['train/token_mult_prob_error']['$MAX_STEPS'] < 1.1"
39+
fi
40+
41+
# TODO: enable in subsequent PR to do a quick accuracy check
42+
## Convert 8k checkpoint
43+
#uv run examples/converters/convert_dcp_to_hf.py \
44+
# --config=$CKPT_DIR/step_${MAX_STEPS}/config.yaml \
45+
# --dcp-ckpt-path=$CKPT_DIR/step_${MAX_STEPS}/policy/weights \
46+
# --hf-ckpt-path=$CKPT_DIR/grpo-deepscaler-8k-${MAX_STEPS}-hf
47+
#
48+
## Run eval
49+
#uv run examples/run_eval.py \
50+
# generation.model_name=$CKPT_DIR/grpo-deepscaler-8k-${MAX_STEPS}-hf \
51+
# data.prompt_file=examples/prompts/cot.txt \
52+
# generation.vllm_cfg.max_model_len=32768 2>&1 | tee ${RUN_LOG}.aime-8k
53+
#
54+
#cat ${RUN_LOG}.aime-8k | grep "score=" | sed 's/.*score=\([^ ]*\).*/{"score": \1}/' > ${RUN_LOG}-8k-metric.json
55+
#
56+
#uv run tests/check_metrics.py ${RUN_LOG}-8k-metric.json \
57+
# 'data["score"] >= 0.25' \
58+
#
59+
##uv run examples/run_eval.py \
60+
## generation.model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
61+
## data.prompt_file=examples/prompts/cot.txt \
62+
## generation.vllm_cfg.max_model_len=32768 2>&1 | tee ${RUN_LOG}.aime-baseline
63+
#
64+
##cat ${RUN_LOG}.aime-baseline | grep "score=" | sed 's/.*score=\([^ ]*\).*/{"score": \1}/' > ${RUN_LOG}-baseline-metric.json
65+
#
66+
##uv run tests/check_metrics.py ${RUN_LOG}-baseline-metric.json \
67+
## 'data["score"] == 0.2' \

tests/test_suites/nightly.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ tests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4sp.v3.sh
1313
# Functional 32b run
1414
tests/test_suites/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8sp-actckpt.v3.sh
1515

16+
# Deepscaler (short tests)
17+
tests/test_suites/llm/grpo-deepscaler-1.5b-16K.sh
18+
tests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh
19+
tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh
20+
1621
#######
1722
# SFT #
1823
#######

tests/unit/test_recipes_and_test_suites.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,8 @@ def test_all_recipes_can_merge_configs_with_base_config(
283283
):
284284
from omegaconf import OmegaConf
285285

286+
from nemo_rl.utils.config import load_config
287+
286288
base_yaml = os.path.join(project_root, algo_base_yaml)
287289
base_config = OmegaConf.load(base_yaml)
288290
# Would result in an error if we couldn't merge our config with the recipe's config
@@ -293,7 +295,7 @@ def test_all_recipes_can_merge_configs_with_base_config(
293295
# test_all_recipes_start_with_algo_hyphen()
294296
continue
295297
recipe_yaml_path = os.path.join(recipes_dir, recipe_yaml)
296-
recipe_config = OmegaConf.load(recipe_yaml_path)
298+
recipe_config = load_config(recipe_yaml_path)
297299
OmegaConf.set_struct(recipe_config, True)
298300
# This will raise a error if the config can't be merged
299301
print(f"Merging {recipe_yaml} with {base_yaml}")

0 commit comments

Comments
 (0)