Skip to content

GRPO training on Qwen3-30b-a3b coder instruct model reward with math error #1117

@chuckxwu

Description

@chuckxwu

When I use smile on GRPO training on Qwen3-30b-a3b coder model , set rm-type to math as below:

ROLLOUT_ARGS=(
--prompt-data $data_path
--input-key prompt
--label-key label
--rollout-shuffle
--rm-type math
--num-rollout 20
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 32768
--rollout-temperature 0.8

--global-batch-size 128
--balance-data

(RolloutManager pid=1353, ip=22.3.71.164) [2025-12-15 14:15:52] sglang_rollout.py:367 - First rollout sample: ['[{\'content\': \'Solve the following math problem step by step. The last line of your response should be of the form Answer: \\\\boxed{$Answer} where $Answer is the answer to the problem.\\n\\n方程 x^3-9x^2+8x+2=0 有 3 个实根 p,q,r,则 \\\\df{1}{p^2}+\\\\df{1}{q^2}+\\\\df{1}{r^2}=__________.\\n\\nRemember to put your answer on its own line after "Answer:".\', \'role\': \'user\'}]I need to find the value of $\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2}$ where $p$, $q$, and $r$ are the roots of $x^3 - 9x^2 + 8x + 2 = 0$.\n\nFirst, I\'ll use Vieta\'s formulas for the polynomial $x^3 - 9x^2 + 8x + 2 = 0$.\n\nFor a cubic polynomial $x^3 + ax^2 + bx + c = 0$ with roots $p$, $q$, $r$:\n- $p + q + r = -a$\n- $pq + pr + qr = b$ \n- $pqr = -c$\n\nFor $x^3 - 9x^2 + 8x + 2 = 0$, we have $a = -9$, $b = 8$, $c = 2$.\n\nTherefore:\n- $p + q + r = 9$\n- $pq + pr + qr = 8$\n- $pqr = -2$\n\nNow I need to find $\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2}$.\n\nI can rewrite this as:\n$$\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2} = \\frac{q^2r^2 + p^2r^2 + p^2q^2}{p^2q^2r^2}$$\n\nThe denominator is $(pqr)^2 = (-2)^2 = 4$.\n\nFor the numerator, I need to find $p^2q^2 + p^2r^2 + q^2r^2$.\n\nI can use the identity:\n$(pq + pr + qr)^2 = p^2q^2 + p^2r^2 + q^2r^2 + 2pqr(p + q + r)$\n\nTherefore:\n$p^2q^2 + p^2r^2 + q^2r^2 = (pq + pr + qr)^2 - 2pqr(p + q + r)$\n\nSubstituting the known values:\n$p^2q^2 + p^2r^2 + q^2r^2 = 8^2 - 2(-2)(9) = 64 + 36 = 100$\n\nTherefore:\n$$\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2} = \\frac{100}{4} = 25$$\n\nAnswer: $\\boxed{25}$<|im_end|>'], label: 25, reward: 0

So, thee reward should be 1 instead of 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions