You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Simplify LanguageReward logic to focus on language detection only
Since ThinkingReward already enforces format (single vs multiple blocks),
LanguageReward now focuses purely on language detection with simplified logic:
Detection strategy:
- If exactly one thinking block: detect language of block content only
- Otherwise (no blocks or multiple blocks): detect language of whole response
- Returns match_reward (1.0) if language matches, no_match_reward (0.0) otherwise
Changes:
- Removed partial_reward and fallback_reward parameters (now just match/no-match)
- Renamed full_reward to match_reward for clarity
- Updated all 29 tests to match new behavior (all passing)
- Updated README with clearer explanation of reward separation
- Updated debug script with new expected rewards
This separation of concerns allows each reward to specialize:
- ThinkingReward: format enforcement
- LanguageReward: language detection
Copy file name to clipboardExpand all lines: sandbox/grpo_language/README.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,11 +44,16 @@ You can use any of the config files from `apps/grpo/` (e.g., `qwen3_1_7b.yaml`,
44
44
1. The model receives a math problem and is instructed to use `<思考>` tags for reasoning
45
45
2. During training, the model generates responses with thinking blocks
46
46
3. Three rewards are computed:
47
-
- Math correctness (did it get the right answer?)
48
-
- Thinking usage (did it use `<思考>` tags properly?)
49
-
- Language usage (did it think in Japanese?)
47
+
-**MathReward**: Did it get the right answer?
48
+
-**ThinkingReward**: Did it use `<思考>` tags properly? (single block = full reward, multiple blocks = partial reward)
49
+
-**LanguageReward**: Did it use the target language? Detection strategy:
50
+
- If exactly one thinking block: detect language of block content only
51
+
- Otherwise (no blocks or multiple blocks): detect language of whole response
52
+
- Returns match_reward (1.0) if detected language matches target, no_match_reward (0.0) otherwise
50
53
4. The model is trained to maximize all three rewards
51
54
55
+
**Note**: ThinkingReward enforces format (single vs multiple blocks), while LanguageReward focuses purely on language detection. This separation of concerns allows each reward to specialize in one aspect of the desired behavior.
0 commit comments