Skip to content

Commit d149a62

Browse files
authored
test: enable 8k/16k/24k deepscaler nightly tests (#934)
Signed-off-by: Terry Kong <[email protected]>
1 parent b721703 commit d149a62

File tree

17 files changed

+177
-94
lines changed

17 files changed

+177
-94
lines changed

docs/guides/eval.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ When you complete the evaluation, you will receive a summary similar to the foll
7979
```
8080
============================================================
8181
model_name='Qwen2.5-Math-1.5B-Instruct' dataset_name='aime2024'
82-
max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
82+
max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1 seed=42
8383
8484
metric=pass@1 num_tests_per_prompt=1
8585

docs/guides/grpo-deepscaler.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,17 @@ Throughout training, the checkpoints of the model will be saved to the `results`
3535
uv run examples/run_eval.py \
3636
generation.model_name=results/grpo-deepscaler-1.5b-8K/step_240/hf \
3737
data.prompt_file=examples/prompts/cot.txt \
38-
generation.vllm_cfg.max_model_len=32768
38+
generation.vllm_cfg.max_model_len=32768 \
39+
generation.vllm_cfg.enforce_eager=True \
40+
generation.temperature=1.0
3941
```
4042

4143
Use `generation.model_name` to specify the path to the Hugging Face checkpoint. In addition, we use AIME24 as the validation dataset and calculate pass@1 on it throughout training.
4244

45+
> [!NOTE]
46+
> AIME24 only has 30 examples so the accuracy can be very noisy.
47+
> To reduce the variance consider runing `run_eval.py` with `eval.num_tests_per_prompt=16`.
48+
4349
## Evaluation Results
4450
Using the above instructions to train DeepSeek-R1-Distill-Qwen-1.5B on the DeepScaleR dataset, we can track the model's performance on the AIME24 benchmark throughout training. The following plot shows the evaluation metrics as training progresses:
4551

nemo_rl/data/processors.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,9 @@ def math_data_processor(
6464
add_generation_prompt=True,
6565
add_special_tokens=False,
6666
)
67-
user_message["token_ids"] = tokenizer(message, return_tensors="pt")["input_ids"][0]
67+
user_message["token_ids"] = tokenizer(
68+
message, return_tensors="pt", add_special_tokens=False
69+
)["input_ids"][0]
6870
user_message["content"] = message
6971
message_log.append(user_message)
7072

nemo_rl/evals/eval.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -485,14 +485,15 @@ def _print_results(
485485
dataset_name = os.path.basename(master_config["data"]["dataset_name"])
486486
model_name = os.path.basename(generation_config["model_name"])
487487
max_new_tokens = generation_config["vllm_cfg"]["max_model_len"]
488+
seed = master_config["eval"]["seed"]
488489
temperature = generation_config["temperature"]
489490
top_p = generation_config["top_p"]
490491
top_k = generation_config["top_k"]
491492
average_score = score / dataset_size
492493

493494
print("\n" + "=" * 60)
494495
print(f"{model_name=} {dataset_name=}")
495-
print(f"{max_new_tokens=} {temperature=} {top_p=} {top_k=}\n")
496+
print(f"{max_new_tokens=} {temperature=} {top_p=} {top_k=} {seed=}\n")
496497
print(f"metric={metric[:-1]}{k_value} {num_tests_per_prompt=}\n")
497498
print(f"score={average_score:.4f} ({score}/{dataset_size})")
498499
print("=" * 60 + "\n")

tests/functional/eval.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,4 @@ uv run coverage run -a --data-file=$PROJECT_ROOT/tests/.coverage --source=$PROJE
2727
cat $RUN_LOG | grep "score=" | sed 's/.*score=\([^ ]*\).*/{"score": \1}/' > $JSON_METRICS
2828

2929
uv run tests/check_metrics.py $JSON_METRICS \
30-
'data["score"] == 0.1' \
30+
'data["score"] == 0.1'

tests/functional/eval_async.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,4 @@ uv run coverage run -a --data-file=$PROJECT_ROOT/tests/.coverage --source=$PROJE
2929
cat $RUN_LOG | grep "score=" | sed 's/.*score=\([^ ]*\).*/{"score": \1}/' > $JSON_METRICS
3030

3131
uv run tests/check_metrics.py $JSON_METRICS \
32-
'data["score"] == 0.1' \
32+
'data["score"] == 0.1'

tests/functional/grpo.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,5 +38,5 @@ uv run coverage run -a --data-file=$PROJECT_ROOT/tests/.coverage --source=$PROJE
3838
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
3939

4040
uv run tests/check_metrics.py $JSON_METRICS \
41-
'max(data["train/token_mult_prob_error"]) < 1.05' \
41+
'max(data["train/token_mult_prob_error"]) < 1.05'
4242

tests/functional/grpo_megatron.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,5 +41,5 @@ uv run coverage run -a --data-file=$PROJECT_ROOT/tests/.coverage --source=$PROJE
4141
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
4242

4343
uv run tests/check_metrics.py $JSON_METRICS \
44-
'max(data["train/token_mult_prob_error"]) < 1.05' \
44+
'max(data["train/token_mult_prob_error"]) < 1.05'
4545

tests/functional/grpo_multiturn.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,5 +41,5 @@ uv run coverage run -a --data-file=$PROJECT_ROOT/tests/.coverage --source=$PROJE
4141
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
4242

4343
uv run tests/check_metrics.py $JSON_METRICS \
44-
'max(data["train/token_mult_prob_error"]) < 1.1' \
44+
'max(data["train/token_mult_prob_error"]) < 1.1'
4545

tests/functional/grpo_non_colocated.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,5 +39,5 @@ uv run coverage run -a --data-file=$PROJECT_ROOT/tests/.coverage --source=$PROJE
3939
uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS
4040

4141
uv run tests/check_metrics.py $JSON_METRICS \
42-
'max(data["train/token_mult_prob_error"]) < 1.05' \
42+
'max(data["train/token_mult_prob_error"]) < 1.05'
4343

0 commit comments

Comments
 (0)