[LLM] Rewrite GSM8K reward function to follow standard GRPO conventions by vmoens · Pull Request #3542 · pytorch/rl

vmoens · 2026-03-05T11:01:14Z

Stack from ghstack (oldest at bottom):

Replace the complex 5-component additive reward (0-100 scale) with the
standard 0/0.1/1.0 convention used by veRL and DeepSeek-R1.

Key changes:

Use regex instead of xml.etree for tag extraction (robust to malformed output)
Add answer normalization (commas, dollar signs, trailing zeros, whitespace)
Remove reward_contained (substring match anti-pattern)
Simplify to: 1.0 correct, 0.1 format ok but wrong, 0.0 no answer
Add configurable format_reward and correct_reward parameters
Add comprehensive unit tests for the reward parser

Made-with: Cursor

[ghstack-poisoned]

pytorch-bot · 2026-03-05T11:01:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3542

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Cancelled Job, 6 Unrelated Failures

As of commit 457915d with merge base 491ed0f ():

NEW FAILURES - The following jobs have failed:

Continuous Benchmark (PR) / GPU Pytest benchmark (gh)
Process completed with exit code 1.
Generate documentation / build-docs (3.12, 12.8) / linux-job (gh)
RuntimeError: Command docker exec -t 96dee7af2958f1a09ae28c5a9e7acc6771ff625face362d80e4d90f057f4729b /exec failed with exit code 2
PR Label / add-label (gh)
Process completed with exit code 1.

CANCELLED JOB - The following job was cancelled. Please retry:

Continuous Benchmark (PR) / CPU Pytest benchmark (gh)
Process completed with exit code 1.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Unit-tests on Linux / tests-cpu (3.10) / linux-job (gh) (trunk failure)
test/test_rb.py::TestStorages::test__rand_given_ndim_recompile
Unit-tests on Linux / tests-cpu (3.11) / linux-job (gh) (trunk failure)
test/test_rb.py::TestStorages::test__rand_given_ndim_recompile
Unit-tests on Linux / tests-cpu (3.12) / linux-job (gh) (trunk failure)
test/test_rb.py::TestStorages::test__rand_given_ndim_recompile
Unit-tests on Linux / tests-cpu (3.13) / linux-job (gh) (trunk failure)
test/test_rb.py::TestStorages::test__rand_given_ndim_recompile
Unit-tests on Linux / tests-olddeps (3.10, 11.8) / linux-job (gh) (trunk failure)
test/test_loggers.py::TestCSVLogger::test_log_video[mp4-steps1]
Unit-tests on Linux / tests-optdeps (3.12, 13.0) / linux-job (gh) (trunk failure)
test/test_rb.py::TestStorages::test__rand_given_ndim_recompile

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-05T11:01:24Z

⚠️ PR Title Label Error

Unknown or invalid prefix [LLM].

Current title: [LLM] Rewrite GSM8K reward function to follow standard GRPO conventions

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix	Label Applied	Example
`[BugFix]`	BugFix	`[BugFix] Fix memory leak in collector`
`[Feature]`	Feature	`[Feature] Add new optimizer`
`[Doc]` or `[Docs]`	Documentation	`[Doc] Update installation guide`
`[Refactor]`	Refactoring	`[Refactor] Clean up module imports`
`[CI]`	CI	`[CI] Fix workflow permissions`
`[Test]` or `[Tests]`	Tests	`[Tests] Add unit tests for buffer`
`[Environment]` or `[Environments]`	Environments	`[Environments] Add Gymnasium support`
`[Data]`	Data	`[Data] Fix replay buffer sampling`
`[Performance]` or `[Perf]`	Performance	`[Performance] Optimize tensor ops`
`[BC-Breaking]`	bc breaking	`[BC-Breaking] Remove deprecated API`
`[Deprecation]`	Deprecation	`[Deprecation] Mark old function`
`[Quality]`	Quality	`[Quality] Fix typos and add codespell`

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

github-actions · 2026-03-05T11:01:39Z

⚠️ PR Title Label Error

Unknown or invalid prefix [LLM].

Current title: [LLM] Rewrite GSM8K reward function to follow standard GRPO conventions

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix	Label Applied	Example
`[BugFix]`	BugFix	`[BugFix] Fix memory leak in collector`
`[Feature]`	Feature	`[Feature] Add new optimizer`
`[Doc]` or `[Docs]`	Documentation	`[Doc] Update installation guide`
`[Refactor]`	Refactoring	`[Refactor] Clean up module imports`
`[CI]`	CI	`[CI] Fix workflow permissions`
`[Test]` or `[Tests]`	Tests	`[Tests] Add unit tests for buffer`
`[Environment]` or `[Environments]`	Environments	`[Environments] Add Gymnasium support`
`[Data]`	Data	`[Data] Fix replay buffer sampling`
`[Performance]` or `[Perf]`	Performance	`[Performance] Optimize tensor ops`
`[BC-Breaking]`	bc breaking	`[BC-Breaking] Remove deprecated API`
`[Deprecation]`	Deprecation	`[Deprecation] Mark old function`
`[Quality]`	Quality	`[Quality] Fix typos and add codespell`

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

Update

457915d

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2026

github-actions bot added llm/ LLM-related PR, triggers LLM CI tests sota-implementations/ labels Mar 5, 2026

This was referenced Mar 5, 2026

[LLM] Simplify IFEval reward aggregator #3543

Open

[LLM] Add MATH (competition mathematics) environment #3544

Open

[LLM] Add Countdown numbers-game environment #3545

Open

[LLM] Wire MATH and Countdown into GRPO and Expert Iteration scripts #3546

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM] Rewrite GSM8K reward function to follow standard GRPO conventions#3542

[LLM] Rewrite GSM8K reward function to follow standard GRPO conventions#3542
vmoens wants to merge 1 commit intogh/vmoens/234/basefrom
gh/vmoens/234/head

vmoens commented Mar 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vmoens commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3542

❌ 3 New Failures, 1 Cancelled Job, 6 Unrelated Failures

Uh oh!

github-actions bot commented Mar 5, 2026

⚠️ PR Title Label Error

Supported Prefixes (case-sensitive)

Uh oh!

github-actions bot commented Mar 5, 2026

⚠️ PR Title Label Error

Supported Prefixes (case-sensitive)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vmoens commented Mar 5, 2026 •

edited

Loading

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading