-
Notifications
You must be signed in to change notification settings - Fork 388
Description
When I use smile on GRPO training on Qwen3-30b-a3b coder model , set rm-type to math as below:
ROLLOUT_ARGS=(
--prompt-data $data_path
--input-key prompt
--label-key label
--rollout-shuffle
--rm-type math
--num-rollout 20
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 32768
--rollout-temperature 0.8
--global-batch-size 128
--balance-data
(RolloutManager pid=1353, ip=22.3.71.164) [2025-12-15 14:15:52] sglang_rollout.py:367 - First rollout sample: ['[{\'content\': \'Solve the following math problem step by step. The last line of your response should be of the form Answer: \\\\boxed{$Answer} where $Answer is the answer to the problem.\\n\\n方程 x^3-9x^2+8x+2=0 有 3 个实根 p,q,r,则 \\\\df{1}{p^2}+\\\\df{1}{q^2}+\\\\df{1}{r^2}=__________.\\n\\nRemember to put your answer on its own line after "Answer:".\', \'role\': \'user\'}]I need to find the value of $\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2}$ where $p$, $q$, and $r$ are the roots of $x^3 - 9x^2 + 8x + 2 = 0$.\n\nFirst, I\'ll use Vieta\'s formulas for the polynomial $x^3 - 9x^2 + 8x + 2 = 0$.\n\nFor a cubic polynomial $x^3 + ax^2 + bx + c = 0$ with roots $p$, $q$, $r$:\n- $p + q + r = -a$\n- $pq + pr + qr = b$ \n- $pqr = -c$\n\nFor $x^3 - 9x^2 + 8x + 2 = 0$, we have $a = -9$, $b = 8$, $c = 2$.\n\nTherefore:\n- $p + q + r = 9$\n- $pq + pr + qr = 8$\n- $pqr = -2$\n\nNow I need to find $\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2}$.\n\nI can rewrite this as:\n$$\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2} = \\frac{q^2r^2 + p^2r^2 + p^2q^2}{p^2q^2r^2}$$\n\nThe denominator is $(pqr)^2 = (-2)^2 = 4$.\n\nFor the numerator, I need to find $p^2q^2 + p^2r^2 + q^2r^2$.\n\nI can use the identity:\n$(pq + pr + qr)^2 = p^2q^2 + p^2r^2 + q^2r^2 + 2pqr(p + q + r)$\n\nTherefore:\n$p^2q^2 + p^2r^2 + q^2r^2 = (pq + pr + qr)^2 - 2pqr(p + q + r)$\n\nSubstituting the known values:\n$p^2q^2 + p^2r^2 + q^2r^2 = 8^2 - 2(-2)(9) = 64 + 36 = 100$\n\nTherefore:\n$$\\frac{1}{p^2} + \\frac{1}{q^2} + \\frac{1}{r^2} = \\frac{100}{4} = 25$$\n\nAnswer: $\\boxed{25}$<|im_end|>'], label: 25, reward: 0
So, thee reward should be 1 instead of 0.