|
| 1 | +# Troubleshooting LanguageReward Training |
| 2 | + |
| 3 | +## Issue: Language Reward is Always Zero |
| 4 | + |
| 5 | +If you're seeing the LanguageReward constantly at 0.0 during training, here's how to debug: |
| 6 | + |
| 7 | +### 1. Check What the Model is Generating |
| 8 | + |
| 9 | +The updated `main.py` includes debug logging. When you run training, look for lines like: |
| 10 | + |
| 11 | +``` |
| 12 | +[LanguageReward Debug] Reward=0.00 | Blocks=1 | Lang=en | Sample: <think>Let me solve this step by step...</think>... |
| 13 | +``` |
| 14 | + |
| 15 | +This tells you: |
| 16 | +- **Reward**: The actual reward value |
| 17 | +- **Blocks**: Number of thinking blocks found |
| 18 | +- **Lang**: Language detected by langid |
| 19 | +- **Sample**: First 80 chars of the response |
| 20 | + |
| 21 | +### 2. Common Causes and Solutions |
| 22 | + |
| 23 | +#### Cause 1: Model is Thinking in English |
| 24 | + |
| 25 | +**Symptom**: `Lang=en` in debug output |
| 26 | + |
| 27 | +**Why**: The model defaults to English because: |
| 28 | +- The dataset (GSM8K) is in English |
| 29 | +- Most models are English-dominant |
| 30 | +- The instruction might not be strong enough |
| 31 | + |
| 32 | +**Solutions**: |
| 33 | + |
| 34 | +A) **Strengthen the system prompt** (edit `main.py` line 217-220): |
| 35 | +```python |
| 36 | +system_prompt = """ |
| 37 | +あなたは数学の問題を解くAIです。<think>タグの中で日本語で考えてください。これは必須です。 |
| 38 | +Put all your scratchpad work between <think> and </think> tags. You MUST think in Japanese (日本語) inside the <think> tags. |
| 39 | +Your final answer should be between <answer> and </answer> tags otherwise it will not be scored. |
| 40 | +
|
| 41 | +Example: |
| 42 | +<think>この問題を解きましょう。2 + 2 = 4です。</think> |
| 43 | +<answer>4</answer> |
| 44 | +""" |
| 45 | +``` |
| 46 | + |
| 47 | +B) **Start with higher language reward weight**: |
| 48 | +In `main.py` line 327, you could add multiple LanguageReward instances: |
| 49 | +```python |
| 50 | +reward_functions=[ |
| 51 | + MathReward(), |
| 52 | + ThinkingReward(), |
| 53 | + LanguageReward(target_language="ja"), |
| 54 | + LanguageReward(target_language="ja"), # Double weight for language |
| 55 | +] |
| 56 | +``` |
| 57 | + |
| 58 | +C) **Use few-shot examples in the prompt**: |
| 59 | +Add Japanese reasoning examples to each problem in the dataset transform. |
| 60 | + |
| 61 | +#### Cause 2: Model Not Using Thinking Blocks |
| 62 | + |
| 63 | +**Symptom**: `Blocks=0` in debug output |
| 64 | + |
| 65 | +**Why**: The model hasn't learned to use `<think>` tags yet |
| 66 | + |
| 67 | +**Solution**: This should improve as ThinkingReward trains the model. Be patient for first few hundred steps. The fallback reward (0.2) should help when there are no blocks but Japanese text. |
| 68 | + |
| 69 | +#### Cause 3: Empty or Very Short Thinking Blocks |
| 70 | + |
| 71 | +**Symptom**: `Lang=en` with very short content, Reward=0.00 |
| 72 | + |
| 73 | +**Why**: langid needs sufficient text to reliably detect language. Very short text (< 10 chars) often defaults to English. |
| 74 | + |
| 75 | +**Solution**: |
| 76 | +- Wait for model to generate longer reasoning (this improves with training) |
| 77 | +- The ThinkingReward encourages substantial content in thinking blocks |
| 78 | + |
| 79 | +#### Cause 4: Mixed Language Content |
| 80 | + |
| 81 | +**Symptom**: Reward sometimes 1.0, sometimes 0.0 randomly |
| 82 | + |
| 83 | +**Why**: When English and Japanese are mixed, langid detects whichever is dominant. |
| 84 | + |
| 85 | +**Solution**: This will stabilize as training progresses and the model learns consistency. |
| 86 | + |
| 87 | +### 3. Expected Training Progression |
| 88 | + |
| 89 | +**Steps 0-200**: Language reward often 0.0 |
| 90 | +- Model learning to use `<think>` tags (ThinkingReward) |
| 91 | +- Model thinking in English (natural default) |
| 92 | +- Fallback rewards (0.2) when Japanese appears elsewhere |
| 93 | + |
| 94 | +**Steps 200-500**: Language reward starting to increase |
| 95 | +- Some responses have Japanese thinking → partial/full rewards |
| 96 | +- Model learning association between Japanese and reward |
| 97 | + |
| 98 | +**Steps 500+**: Language reward should stabilize around 0.5-1.0 |
| 99 | +- Consistent Japanese thinking |
| 100 | +- Proper single-block format |
| 101 | + |
| 102 | +### 4. Monitoring in W&B |
| 103 | + |
| 104 | +Check these metrics in Weights & Biases: |
| 105 | +- `reward/evaluate_response/avg_LanguageReward_reward` - should increase over time |
| 106 | +- `reward/evaluate_response/std_LanguageReward_reward` - variance (high early, lower later) |
| 107 | +- `reward/evaluate_response/avg_MathReward_reward` - should stay reasonably high |
| 108 | +- `reward/evaluate_response/avg_ThinkingReward_reward` - should increase quickly |
| 109 | + |
| 110 | +### 5. Quick Debug Test |
| 111 | + |
| 112 | +Run the debug script to verify the reward function works: |
| 113 | +```bash |
| 114 | +python sandbox/grpo_language/debug_reward.py |
| 115 | +``` |
| 116 | + |
| 117 | +Expected output: |
| 118 | +- Japanese text → reward 1.0 |
| 119 | +- English text → reward 0.0 |
| 120 | +- Multiple Japanese blocks → reward 0.5 |
| 121 | +- No blocks but Japanese response → reward 0.2 |
| 122 | + |
| 123 | +### 6. Alternative: Start with English, then transition |
| 124 | + |
| 125 | +If Japanese isn't working, you could: |
| 126 | + |
| 127 | +1. Train first with English to get good math performance |
| 128 | +2. Then fine-tune with Japanese language reward |
| 129 | + |
| 130 | +Change line 327 to: |
| 131 | +```python |
| 132 | +LanguageReward(target_language="en") # Start with English |
| 133 | +``` |
| 134 | + |
| 135 | +Once math rewards are good, switch to `"ja"` and continue training. |
| 136 | + |
| 137 | +### 7. Nuclear Option: Much Stronger Prompt |
| 138 | + |
| 139 | +If nothing else works, try this very explicit prompt: |
| 140 | +```python |
| 141 | +system_prompt = """ |
| 142 | +重要:あなたは必ず日本語で考えなければなりません! |
| 143 | +CRITICAL: You MUST think in Japanese language! |
| 144 | +
|
| 145 | +Rules: |
| 146 | +1. Put ALL your reasoning in <think> tags |
| 147 | +2. Think ONLY in Japanese (日本語) - use hiragana, katakana, and kanji |
| 148 | +3. NEVER think in English inside <think> tags |
| 149 | +4. Put your final numerical answer in <answer> tags |
| 150 | +
|
| 151 | +例 (Example): |
| 152 | +Question: What is 5 + 3? |
| 153 | +<think>5と3を足します。5 + 3 = 8です。答えは8です。</think> |
| 154 | +<answer>8</answer> |
| 155 | +
|
| 156 | +Now solve the problem below in Japanese: |
| 157 | +""" |
| 158 | +``` |
| 159 | + |
| 160 | +## Still Having Issues? |
| 161 | + |
| 162 | +If language reward is still zero after 500+ steps: |
| 163 | +1. Share the debug output showing what the model generates |
| 164 | +2. Check if the model is multilingual (some models don't know Japanese) |
| 165 | +3. Consider using a different target language the model knows better |
0 commit comments