meta-pytorch · casteryh · Oct 31, 2025 · Oct 31, 2025 · Oct 31, 2025 · Oct 31, 2025
diff --git a/pyproject.toml b/pyproject.toml
@@ -47,6 +47,7 @@ dev = [
     "anyio",
     "pytest-asyncio",
     "multiprocess",
+    "langid",
 ]
 docs = [
     "sphinx==7.2.6",

diff --git a/sandbox/grpo_language/README.md b/sandbox/grpo_language/README.md
@@ -0,0 +1,98 @@
+# GRPO with Language Reward
+
+This sandbox app demonstrates using GRPO training with a language reward that encourages the model to think in a specific target language.
+
+## Overview
+
+This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<思考></思考>` tags) is in the target language.
+
+**Key Insight**: Uses Japanese tags `<思考>` (shikō = "thinking") instead of English `<think>` tags to break the model's association between thinking tags and English language. This helps encourage multilingual thinking.
+
+## Key Features
+
+- **Multi-objective training**: Combines three rewards:
+  - `MathReward`: Evaluates correctness of math answers
+  - `ThinkingReward`: Encourages use of `<思考>` tags
+  - `LanguageReward`: Rewards thinking in target language (Japanese by default)
+
+- **Japanese thinking tags**: Uses `<思考>` instead of `<think>` to encourage non-English reasoning
+
+- **Language detection**: Uses `langid` to detect the language of thinking blocks
+
+- **Configurable target language**: While this app defaults to Japanese (`ja`), the `LanguageReward` can be configured for any ISO 639-1 language code
+
+- **Configurable tags**: Both rewards support custom tag names via the `tag` parameter
+
+## Requirements
+
+Before running this app, install the required language detection library:
+
+```bash
+pip install langid
+```
+
+## Usage
+
+```bash
+python -m sandbox.grpo_language.main --config apps/grpo/qwen3_1_7b.yaml
+```
+
+You can use any of the config files from `apps/grpo/` (e.g., `qwen3_1_7b.yaml`, `qwen3_8b.yaml`, `qwen3_32b.yaml`).
+
+## How It Works
+
+1. The model receives a math problem and is instructed to use `<思考>` tags for reasoning
+2. During training, the model generates responses with thinking blocks
+3. Three rewards are computed:
+   - **MathReward**: Did it get the right answer?
+   - **ThinkingReward**: Did it use `<思考>` tags properly? (single block = full reward, multiple blocks = partial reward)
+   - **LanguageReward**: Did it use the target language? Detection strategy:
+     - If exactly one thinking block: detect language of block content only
+     - Otherwise (no blocks or multiple blocks): detect language of whole response
+     - Returns match_reward (1.0) if detected language matches target, no_match_reward (0.0) otherwise
+4. The model is trained to maximize all three rewards
+
+**Note**: ThinkingReward enforces format (single vs multiple blocks), while LanguageReward focuses purely on language detection. This separation of concerns allows each reward to specialize in one aspect of the desired behavior.
+
+## Configuration
+
+### Target Language
+
+The target language is configured as Japanese in `main.py`:
+
+```python
+LanguageReward(target_language="ja", tag="思考")
+ThinkingReward(tag="思考")
+```
+
+To use a different language:
+1. Change `target_language` to the appropriate ISO 639-1 code:
+   - English: `"en"`
+   - Chinese: `"zh"`
+- Spanish: `"es"`
+- French: `"fr"`
+- etc.
+
+## Expected Behavior
+
+Over the course of training, the model should learn to:
+1. Solve math problems correctly
+2. Use `<思考></思考>` tags for its reasoning
+3. Write its thinking in Japanese (or the configured target language)
+
+## Metrics
+
+The following metrics are logged to W&B:
+- `reward/evaluate_response/avg_LanguageReward_reward`: Average language reward
+- `reward/evaluate_response/avg_MathReward_reward`: Average math reward
+- `reward/evaluate_response/avg_ThinkingReward_reward`: Average thinking reward
+- `reward/evaluate_response/avg_total_reward`: Average of all rewards
+
+## Differences from Standard GRPO
+
+This is a modified version of `apps/grpo/main.py` with:
+1. Added import: `from forge.data.rewards import LanguageReward`
+2. Modified reward functions list to include `LanguageReward(target_language="ja")`
+3. Updated config to use different W&B group name
+
+All other training logic remains the same.
diff --git a/sandbox/grpo_language/TROUBLESHOOTING.md b/sandbox/grpo_language/TROUBLESHOOTING.md
@@ -0,0 +1,147 @@
+# Troubleshooting LanguageReward Training
+
+## Issue: Language Reward is Always Zero
+
+If you're seeing the LanguageReward constantly at 0.0 during training, here's how to debug:
+
+### 1. Check What the Model is Generating
+
+The updated `main.py` includes debug logging. When you run training, look for lines like:
+
+```
+[LanguageReward Debug] Reward=0.00 | Blocks=1 | Lang=en | Sample: <think>Let me solve this step by step...</think>...
+```
+
+This tells you:
+- **Reward**: The actual reward value
+- **Blocks**: Number of thinking blocks found
+- **Lang**: Language detected by langid
+- **Sample**: First 80 chars of the response
+
+### 2. Common Causes and Solutions
+
+#### Cause 1: Model is Thinking in English
+
+**Symptom**: `Lang=en` in debug output
+
+**Why**: The model defaults to English because:
+- The dataset (GSM8K) is in English
+- Most models are English-dominant
+- The instruction might not be strong enough
+
+**Solutions**:
+
+A) **Strengthen the system prompt** (edit `main.py` line 217-220):
+```python
+system_prompt = """
+あなたは数学の問題を解くAIです。<think>タグの中で日本語で考えてください。これは必須です。
+Put all your scratchpad work between <think> and </think> tags. You MUST think in Japanese (日本語) inside the <think> tags.
+Your final answer should be between <answer> and </answer> tags otherwise it will not be scored.
+
+Example:
+<think>この問題を解きましょう。2 + 2 = 4です。</think>
+<answer>4</answer>
+"""
+```
+
+B) **Start with higher language reward weight**:
+In `main.py` line 327, you could add multiple LanguageReward instances:
+```python
+reward_functions=[
+    MathReward(),
+    ThinkingReward(),
+    LanguageReward(target_language="ja"),
+    LanguageReward(target_language="ja"),  # Double weight for language
+]
+```
+
+C) **Use few-shot examples in the prompt**:
+Add Japanese reasoning examples to each problem in the dataset transform.
+
+#### Cause 2: Model Not Using Thinking Blocks
+
+**Symptom**: `Blocks=0` in debug output
+
+**Why**: The model hasn't learned to use `<think>` tags yet
+
+**Solution**: This should improve as ThinkingReward trains the model. Be patient for first few hundred steps. The fallback reward (0.2) should help when there are no blocks but Japanese text.
+
+#### Cause 3: Empty or Very Short Thinking Blocks
+
+**Symptom**: `Lang=en` with very short content, Reward=0.00
+
+**Why**: langid needs sufficient text to reliably detect language. Very short text (< 10 chars) often defaults to English.
+
+**Solution**:
+- Wait for model to generate longer reasoning (this improves with training)
+- The ThinkingReward encourages substantial content in thinking blocks
+
+#### Cause 4: Mixed Language Content
+
+**Symptom**: Reward sometimes 1.0, sometimes 0.0 randomly
+
+**Why**: When English and Japanese are mixed, langid detects whichever is dominant.
+
+**Solution**: This will stabilize as training progresses and the model learns consistency.
+
+### 3. Expected Training Progression
+
+**Steps 0-200**: Language reward often 0.0
+- Model learning to use `<think>` tags (ThinkingReward)
+- Model thinking in English (natural default)
+- Fallback rewards (0.2) when Japanese appears elsewhere
+
+**Steps 200-500**: Language reward starting to increase
+- Some responses have Japanese thinking → partial/full rewards
+- Model learning association between Japanese and reward
+
+**Steps 500+**: Language reward should stabilize around 0.5-1.0
+- Consistent Japanese thinking
+- Proper single-block format
+
+### 4. Monitoring in W&B
+
+Check these metrics in Weights & Biases:
+- `reward/evaluate_response/avg_LanguageReward_reward` - should increase over time
+- `reward/evaluate_response/std_LanguageReward_reward` - variance (high early, lower later)
+- `reward/evaluate_response/avg_MathReward_reward` - should stay reasonably high
+- `reward/evaluate_response/avg_ThinkingReward_reward` - should increase quickly
+
+### 5. Why Not Train with English?
+
+Training with English thinking won't work well because:
+- Models are already extensively trained on GSM8K and similar datasets with English thinking
+- There's little room for improvement on English math reasoning
+- The RL signal would be weak (model already knows how to do this)
+
+**That's why we use Japanese** - it provides a novel combination of math reasoning + non-English thinking that the model hasn't been extensively pre-trained on, giving clear RL signal for improvement.
+
+### 6. Nuclear Option: Much Stronger Prompt
+
+If nothing else works, try this very explicit prompt:
+```python
+system_prompt = """
+重要：あなたは必ず日本語で考えなければなりません！
+CRITICAL: You MUST think in Japanese language!
+
+Rules:
+1. Put ALL your reasoning in <think> tags
+2. Think ONLY in Japanese (日本語) - use hiragana, katakana, and kanji
+3. NEVER think in English inside <think> tags
+4. Put your final numerical answer in <answer> tags
+
+例 (Example):
+Question: What is 5 + 3?
+<think>5と3を足します。5 + 3 = 8です。答えは8です。</think>
+<answer>8</answer>
+
+Now solve the problem below in Japanese:
+"""
+```
+
+## Still Having Issues?
+
+If language reward is still zero after 500+ steps:
+1. Share the debug output showing what the model generates
+2. Check if the model is multilingual (some models don't know Japanese)
+3. Consider using a different target language the model knows better