Skip to content

Commit a2c0237

Browse files
committed
Add debug logging and troubleshooting guide for LanguageReward
- Add debug prints to RewardActor showing: * Reward value * Number of thinking blocks * Detected language * Response sample - Create debug_reward.py script for testing reward function * Tests 8 common scenarios * Shows detected language and confidence - Add TROUBLESHOOTING.md with solutions for: * Model thinking in English instead of Japanese * Empty or missing thinking blocks * Short content that langid can't detect * Stronger system prompt alternatives * Expected training progression This helps diagnose why LanguageReward might be constantly zero
1 parent b15f171 commit a2c0237

File tree

3 files changed

+283
-0
lines changed

3 files changed

+283
-0
lines changed
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Troubleshooting LanguageReward Training
2+
3+
## Issue: Language Reward is Always Zero
4+
5+
If you're seeing the LanguageReward constantly at 0.0 during training, here's how to debug:
6+
7+
### 1. Check What the Model is Generating
8+
9+
The updated `main.py` includes debug logging. When you run training, look for lines like:
10+
11+
```
12+
[LanguageReward Debug] Reward=0.00 | Blocks=1 | Lang=en | Sample: <think>Let me solve this step by step...</think>...
13+
```
14+
15+
This tells you:
16+
- **Reward**: The actual reward value
17+
- **Blocks**: Number of thinking blocks found
18+
- **Lang**: Language detected by langid
19+
- **Sample**: First 80 chars of the response
20+
21+
### 2. Common Causes and Solutions
22+
23+
#### Cause 1: Model is Thinking in English
24+
25+
**Symptom**: `Lang=en` in debug output
26+
27+
**Why**: The model defaults to English because:
28+
- The dataset (GSM8K) is in English
29+
- Most models are English-dominant
30+
- The instruction might not be strong enough
31+
32+
**Solutions**:
33+
34+
A) **Strengthen the system prompt** (edit `main.py` line 217-220):
35+
```python
36+
system_prompt = """
37+
あなたは数学の問題を解くAIです。<think>タグの中で日本語で考えてください。これは必須です。
38+
Put all your scratchpad work between <think> and </think> tags. You MUST think in Japanese (日本語) inside the <think> tags.
39+
Your final answer should be between <answer> and </answer> tags otherwise it will not be scored.
40+
41+
Example:
42+
<think>この問題を解きましょう。2 + 2 = 4です。</think>
43+
<answer>4</answer>
44+
"""
45+
```
46+
47+
B) **Start with higher language reward weight**:
48+
In `main.py` line 327, you could add multiple LanguageReward instances:
49+
```python
50+
reward_functions=[
51+
MathReward(),
52+
ThinkingReward(),
53+
LanguageReward(target_language="ja"),
54+
LanguageReward(target_language="ja"), # Double weight for language
55+
]
56+
```
57+
58+
C) **Use few-shot examples in the prompt**:
59+
Add Japanese reasoning examples to each problem in the dataset transform.
60+
61+
#### Cause 2: Model Not Using Thinking Blocks
62+
63+
**Symptom**: `Blocks=0` in debug output
64+
65+
**Why**: The model hasn't learned to use `<think>` tags yet
66+
67+
**Solution**: This should improve as ThinkingReward trains the model. Be patient for first few hundred steps. The fallback reward (0.2) should help when there are no blocks but Japanese text.
68+
69+
#### Cause 3: Empty or Very Short Thinking Blocks
70+
71+
**Symptom**: `Lang=en` with very short content, Reward=0.00
72+
73+
**Why**: langid needs sufficient text to reliably detect language. Very short text (< 10 chars) often defaults to English.
74+
75+
**Solution**:
76+
- Wait for model to generate longer reasoning (this improves with training)
77+
- The ThinkingReward encourages substantial content in thinking blocks
78+
79+
#### Cause 4: Mixed Language Content
80+
81+
**Symptom**: Reward sometimes 1.0, sometimes 0.0 randomly
82+
83+
**Why**: When English and Japanese are mixed, langid detects whichever is dominant.
84+
85+
**Solution**: This will stabilize as training progresses and the model learns consistency.
86+
87+
### 3. Expected Training Progression
88+
89+
**Steps 0-200**: Language reward often 0.0
90+
- Model learning to use `<think>` tags (ThinkingReward)
91+
- Model thinking in English (natural default)
92+
- Fallback rewards (0.2) when Japanese appears elsewhere
93+
94+
**Steps 200-500**: Language reward starting to increase
95+
- Some responses have Japanese thinking → partial/full rewards
96+
- Model learning association between Japanese and reward
97+
98+
**Steps 500+**: Language reward should stabilize around 0.5-1.0
99+
- Consistent Japanese thinking
100+
- Proper single-block format
101+
102+
### 4. Monitoring in W&B
103+
104+
Check these metrics in Weights & Biases:
105+
- `reward/evaluate_response/avg_LanguageReward_reward` - should increase over time
106+
- `reward/evaluate_response/std_LanguageReward_reward` - variance (high early, lower later)
107+
- `reward/evaluate_response/avg_MathReward_reward` - should stay reasonably high
108+
- `reward/evaluate_response/avg_ThinkingReward_reward` - should increase quickly
109+
110+
### 5. Quick Debug Test
111+
112+
Run the debug script to verify the reward function works:
113+
```bash
114+
python sandbox/grpo_language/debug_reward.py
115+
```
116+
117+
Expected output:
118+
- Japanese text → reward 1.0
119+
- English text → reward 0.0
120+
- Multiple Japanese blocks → reward 0.5
121+
- No blocks but Japanese response → reward 0.2
122+
123+
### 6. Alternative: Start with English, then transition
124+
125+
If Japanese isn't working, you could:
126+
127+
1. Train first with English to get good math performance
128+
2. Then fine-tune with Japanese language reward
129+
130+
Change line 327 to:
131+
```python
132+
LanguageReward(target_language="en") # Start with English
133+
```
134+
135+
Once math rewards are good, switch to `"ja"` and continue training.
136+
137+
### 7. Nuclear Option: Much Stronger Prompt
138+
139+
If nothing else works, try this very explicit prompt:
140+
```python
141+
system_prompt = """
142+
重要:あなたは必ず日本語で考えなければなりません!
143+
CRITICAL: You MUST think in Japanese language!
144+
145+
Rules:
146+
1. Put ALL your reasoning in <think> tags
147+
2. Think ONLY in Japanese (日本語) - use hiragana, katakana, and kanji
148+
3. NEVER think in English inside <think> tags
149+
4. Put your final numerical answer in <answer> tags
150+
151+
例 (Example):
152+
Question: What is 5 + 3?
153+
<think>5と3を足します。5 + 3 = 8です。答えは8です。</think>
154+
<answer>8</answer>
155+
156+
Now solve the problem below in Japanese:
157+
"""
158+
```
159+
160+
## Still Having Issues?
161+
162+
If language reward is still zero after 500+ steps:
163+
1. Share the debug output showing what the model generates
164+
2. Check if the model is multilingual (some models don't know Japanese)
165+
3. Consider using a different target language the model knows better
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
#!/usr/bin/env python
2+
# Copyright (c) Meta Platforms, Inc. and affiliates.
3+
# All rights reserved.
4+
#
5+
# This source code is licensed under the BSD-style license found in the
6+
# LICENSE file in the root directory of this source tree.
7+
8+
"""Debug script to test LanguageReward behavior."""
9+
10+
from forge.data.rewards import LanguageReward
11+
12+
# Create reward for Japanese
13+
reward = LanguageReward(target_language="ja")
14+
15+
# Test cases mimicking what the model might generate
16+
test_cases = [
17+
# Case 1: Perfect - Japanese in single thinking block
18+
("<think>これは数学の問題です。2+2=4です。</think><answer>4</answer>", "Perfect Japanese"),
19+
# Case 2: English thinking (most likely during training)
20+
(
21+
"<think>This is a math problem. 2+2=4.</think><answer>4</answer>",
22+
"English thinking",
23+
),
24+
# Case 3: No thinking blocks at all
25+
("The answer is 4.<answer>4</answer>", "No thinking blocks"),
26+
# Case 4: Empty thinking blocks
27+
("<think></think><answer>4</answer>", "Empty thinking block"),
28+
# Case 5: Multiple thinking blocks in Japanese
29+
(
30+
"<think>最初の考え。</think><think>次の考え。</think><answer>4</answer>",
31+
"Multiple Japanese blocks",
32+
),
33+
# Case 6: Just the answer, no thinking
34+
("<answer>4</answer>", "Just answer tag"),
35+
# Case 7: Thinking with mostly numbers/symbols
36+
("<think>2 + 2 = 4</think><answer>4</answer>", "Mostly numbers"),
37+
# Case 8: Mixed English and Japanese
38+
("<think>Let me think... これは簡単です。</think><answer>4</answer>", "Mixed languages"),
39+
]
40+
41+
print("=" * 80)
42+
print("LanguageReward Debug Output (target_language='ja')")
43+
print("=" * 80)
44+
45+
for response, description in test_cases:
46+
score = reward(prompt="", response=response, target=None)
47+
48+
import re
49+
50+
# Try to detect what langid thinks
51+
import langid
52+
53+
# Extract thinking content if exists
54+
think_match = re.findall(
55+
r"<\s*think\s*>(.*?)<\s*/\s*think\s*>", response, re.IGNORECASE | re.DOTALL
56+
)
57+
58+
if think_match:
59+
content = " ".join(think_match)
60+
detected_lang, confidence = langid.classify(content)
61+
print(f"\n{description}:")
62+
print(f" Response: {response[:60]}...")
63+
print(f" Reward: {score}")
64+
print(f" Thinking blocks found: {len(think_match)}")
65+
print(f" Detected language: {detected_lang} (confidence: {confidence:.3f})")
66+
else:
67+
# Check fallback
68+
response_text = re.sub(
69+
r"<\s*/?\s*think\s*>", "", response, flags=re.IGNORECASE
70+
).strip()
71+
if response_text:
72+
detected_lang, confidence = langid.classify(response_text)
73+
print(f"\n{description}:")
74+
print(f" Response: {response[:60]}...")
75+
print(f" Reward: {score}")
76+
print(" Thinking blocks found: 0")
77+
print(
78+
f" Fallback detection on response text: {detected_lang} (confidence: {confidence:.3f})"
79+
)
80+
else:
81+
print(f"\n{description}:")
82+
print(f" Response: {response[:60]}...")
83+
print(f" Reward: {score}")
84+
print(" No content to analyze")
85+
86+
print("\n" + "=" * 80)
87+
print("Expected rewards:")
88+
print(" full_reward (1.0): Single Japanese thinking block")
89+
print(" partial_reward (0.5): Multiple Japanese thinking blocks")
90+
print(" fallback_reward (0.2): No blocks but Japanese response text")
91+
print(" no_match_reward (0.0): Wrong language")
92+
print("=" * 80)

sandbox/grpo_language/main.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,32 @@ async def evaluate_response(self, prompt: str, response: str, target: str) -> fl
154154
reward_fn_name = getattr(
155155
reward_fn, "__name__", reward_fn.__class__.__name__
156156
)
157+
158+
# Debug logging for LanguageReward to see what's happening
159+
if reward_fn_name == "LanguageReward":
160+
import re
161+
162+
import langid
163+
164+
think_matches = re.findall(
165+
r"<\s*think\s*>(.*?)<\s*/\s*think\s*>",
166+
response,
167+
re.IGNORECASE | re.DOTALL,
168+
)
169+
if think_matches:
170+
content = " ".join(think_matches)
171+
detected_lang, confidence = langid.classify(content)
172+
print(
173+
f"[LanguageReward Debug] Reward={reward:.2f} | "
174+
f"Blocks={len(think_matches)} | Lang={detected_lang} | "
175+
f"Sample: {response[:80]}..."
176+
)
177+
else:
178+
print(
179+
f"[LanguageReward Debug] Reward={reward:.2f} | "
180+
f"Blocks=0 | Sample: {response[:80]}..."
181+
)
182+
157183
# per function reward
158184
record_metric(
159185
f"reward/evaluate_response/sum_{reward_fn_name}_reward",

0 commit comments

Comments
 (0)