Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
033fa60
Add LanguageReward for training models to think in target language
casteryh Oct 31, 2025
b12ed15
Update system prompt to instruct model to think in Japanese
casteryh Oct 31, 2025
b15f171
Add fallback reward for correct language without thinking blocks
casteryh Oct 31, 2025
a2c0237
Add debug logging and troubleshooting guide for LanguageReward
casteryh Oct 31, 2025
afca75c
Add debug printing to LanguageReward and strengthen system prompt
casteryh Oct 31, 2025
2625b28
Refactor to use configurable Japanese tags <思考> instead of English <t…
casteryh Oct 31, 2025
1a4d5fb
Remove old debug code from main.py
casteryh Oct 31, 2025
4e87a4d
Weaken system prompt to rely more on RL rewards
casteryh Oct 31, 2025
abb653e
Remove sandbox config and reference apps/grpo configs instead
casteryh Oct 31, 2025
7b4829c
Simplify LanguageReward logic to focus on language detection only
casteryh Oct 31, 2025
0ed798c
Add langid to dev dependencies for CI
casteryh Oct 31, 2025
5a3193e
Remove debug script
casteryh Oct 31, 2025
93a65b2
Clarify why English training won't work in TROUBLESHOOTING
casteryh Nov 1, 2025
f72be7f
Add unit test for ThinkingReward custom tag
casteryh Nov 1, 2025
6186f9f
Bump LanguageReward match_reward to 2.0
casteryh Nov 1, 2025
c640d37
Set KL divergence coefficient to zero in loss function
casteryh Nov 1, 2025
7fde86d
Change KL divergence coefficient to 1e-3
casteryh Nov 1, 2025
ffb6c43
Change KL divergence coefficient to 1e-4
casteryh Nov 1, 2025
7ffa20e
Enable multi-epoch training in sandbox/grpo_language app
casteryh Nov 2, 2025
1bf3cca
Fix recursive endpoint call - use while loop instead
casteryh Nov 2, 2025
7758b48
Simplify multi-epoch fix - use return next() instead of while loop
casteryh Nov 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ dev = [
"anyio",
"pytest-asyncio",
"multiprocess",
"langid",
]
docs = [
"sphinx==7.2.6",
Expand Down
98 changes: 98 additions & 0 deletions sandbox/grpo_language/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# GRPO with Language Reward

This sandbox app demonstrates using GRPO training with a language reward that encourages the model to think in a specific target language.

## Overview

This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<思考></思考>` tags) is in the target language.

**Key Insight**: Uses Japanese tags `<思考>` (shikō = "thinking") instead of English `<think>` tags to break the model's association between thinking tags and English language. This helps encourage multilingual thinking.

## Key Features

- **Multi-objective training**: Combines three rewards:
- `MathReward`: Evaluates correctness of math answers
- `ThinkingReward`: Encourages use of `<思考>` tags
- `LanguageReward`: Rewards thinking in target language (Japanese by default)

- **Japanese thinking tags**: Uses `<思考>` instead of `<think>` to encourage non-English reasoning

- **Language detection**: Uses `langid` to detect the language of thinking blocks

- **Configurable target language**: While this app defaults to Japanese (`ja`), the `LanguageReward` can be configured for any ISO 639-1 language code

- **Configurable tags**: Both rewards support custom tag names via the `tag` parameter

## Requirements

Before running this app, install the required language detection library:

```bash
pip install langid
```

## Usage

```bash
python -m sandbox.grpo_language.main --config apps/grpo/qwen3_1_7b.yaml
```

You can use any of the config files from `apps/grpo/` (e.g., `qwen3_1_7b.yaml`, `qwen3_8b.yaml`, `qwen3_32b.yaml`).

## How It Works

1. The model receives a math problem and is instructed to use `<思考>` tags for reasoning
2. During training, the model generates responses with thinking blocks
3. Three rewards are computed:
- **MathReward**: Did it get the right answer?
- **ThinkingReward**: Did it use `<思考>` tags properly? (single block = full reward, multiple blocks = partial reward)
- **LanguageReward**: Did it use the target language? Detection strategy:
- If exactly one thinking block: detect language of block content only
- Otherwise (no blocks or multiple blocks): detect language of whole response
- Returns match_reward (1.0) if detected language matches target, no_match_reward (0.0) otherwise
4. The model is trained to maximize all three rewards

**Note**: ThinkingReward enforces format (single vs multiple blocks), while LanguageReward focuses purely on language detection. This separation of concerns allows each reward to specialize in one aspect of the desired behavior.

## Configuration

### Target Language

The target language is configured as Japanese in `main.py`:

```python
LanguageReward(target_language="ja", tag="思考")
ThinkingReward(tag="思考")
```

To use a different language:
1. Change `target_language` to the appropriate ISO 639-1 code:
- English: `"en"`
- Chinese: `"zh"`
- Spanish: `"es"`
- French: `"fr"`
- etc.

## Expected Behavior

Over the course of training, the model should learn to:
1. Solve math problems correctly
2. Use `<思考></思考>` tags for its reasoning
3. Write its thinking in Japanese (or the configured target language)

## Metrics

The following metrics are logged to W&B:
- `reward/evaluate_response/avg_LanguageReward_reward`: Average language reward
- `reward/evaluate_response/avg_MathReward_reward`: Average math reward
- `reward/evaluate_response/avg_ThinkingReward_reward`: Average thinking reward
- `reward/evaluate_response/avg_total_reward`: Average of all rewards

## Differences from Standard GRPO

This is a modified version of `apps/grpo/main.py` with:
1. Added import: `from forge.data.rewards import LanguageReward`
2. Modified reward functions list to include `LanguageReward(target_language="ja")`
3. Updated config to use different W&B group name

All other training logic remains the same.
147 changes: 147 additions & 0 deletions sandbox/grpo_language/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Troubleshooting LanguageReward Training

## Issue: Language Reward is Always Zero

If you're seeing the LanguageReward constantly at 0.0 during training, here's how to debug:

### 1. Check What the Model is Generating

The updated `main.py` includes debug logging. When you run training, look for lines like:

```
[LanguageReward Debug] Reward=0.00 | Blocks=1 | Lang=en | Sample: <think>Let me solve this step by step...</think>...
```

This tells you:
- **Reward**: The actual reward value
- **Blocks**: Number of thinking blocks found
- **Lang**: Language detected by langid
- **Sample**: First 80 chars of the response

### 2. Common Causes and Solutions

#### Cause 1: Model is Thinking in English

**Symptom**: `Lang=en` in debug output

**Why**: The model defaults to English because:
- The dataset (GSM8K) is in English
- Most models are English-dominant
- The instruction might not be strong enough

**Solutions**:

A) **Strengthen the system prompt** (edit `main.py` line 217-220):
```python
system_prompt = """
あなたは数学の問題を解くAIです。<think>タグの中で日本語で考えてください。これは必須です。
Put all your scratchpad work between <think> and </think> tags. You MUST think in Japanese (日本語) inside the <think> tags.
Your final answer should be between <answer> and </answer> tags otherwise it will not be scored.

Example:
<think>この問題を解きましょう。2 + 2 = 4です。</think>
<answer>4</answer>
"""
```

B) **Start with higher language reward weight**:
In `main.py` line 327, you could add multiple LanguageReward instances:
```python
reward_functions=[
MathReward(),
ThinkingReward(),
LanguageReward(target_language="ja"),
LanguageReward(target_language="ja"), # Double weight for language
]
```

C) **Use few-shot examples in the prompt**:
Add Japanese reasoning examples to each problem in the dataset transform.

#### Cause 2: Model Not Using Thinking Blocks

**Symptom**: `Blocks=0` in debug output

**Why**: The model hasn't learned to use `<think>` tags yet

**Solution**: This should improve as ThinkingReward trains the model. Be patient for first few hundred steps. The fallback reward (0.2) should help when there are no blocks but Japanese text.

#### Cause 3: Empty or Very Short Thinking Blocks

**Symptom**: `Lang=en` with very short content, Reward=0.00

**Why**: langid needs sufficient text to reliably detect language. Very short text (< 10 chars) often defaults to English.

**Solution**:
- Wait for model to generate longer reasoning (this improves with training)
- The ThinkingReward encourages substantial content in thinking blocks

#### Cause 4: Mixed Language Content

**Symptom**: Reward sometimes 1.0, sometimes 0.0 randomly

**Why**: When English and Japanese are mixed, langid detects whichever is dominant.

**Solution**: This will stabilize as training progresses and the model learns consistency.

### 3. Expected Training Progression

**Steps 0-200**: Language reward often 0.0
- Model learning to use `<think>` tags (ThinkingReward)
- Model thinking in English (natural default)
- Fallback rewards (0.2) when Japanese appears elsewhere

**Steps 200-500**: Language reward starting to increase
- Some responses have Japanese thinking → partial/full rewards
- Model learning association between Japanese and reward

**Steps 500+**: Language reward should stabilize around 0.5-1.0
- Consistent Japanese thinking
- Proper single-block format

### 4. Monitoring in W&B

Check these metrics in Weights & Biases:
- `reward/evaluate_response/avg_LanguageReward_reward` - should increase over time
- `reward/evaluate_response/std_LanguageReward_reward` - variance (high early, lower later)
- `reward/evaluate_response/avg_MathReward_reward` - should stay reasonably high
- `reward/evaluate_response/avg_ThinkingReward_reward` - should increase quickly

### 5. Why Not Train with English?

Training with English thinking won't work well because:
- Models are already extensively trained on GSM8K and similar datasets with English thinking
- There's little room for improvement on English math reasoning
- The RL signal would be weak (model already knows how to do this)

**That's why we use Japanese** - it provides a novel combination of math reasoning + non-English thinking that the model hasn't been extensively pre-trained on, giving clear RL signal for improvement.

### 6. Nuclear Option: Much Stronger Prompt

If nothing else works, try this very explicit prompt:
```python
system_prompt = """
重要:あなたは必ず日本語で考えなければなりません!
CRITICAL: You MUST think in Japanese language!

Rules:
1. Put ALL your reasoning in <think> tags
2. Think ONLY in Japanese (日本語) - use hiragana, katakana, and kanji
3. NEVER think in English inside <think> tags
4. Put your final numerical answer in <answer> tags

例 (Example):
Question: What is 5 + 3?
<think>5と3を足します。5 + 3 = 8です。答えは8です。</think>
<answer>8</answer>

Now solve the problem below in Japanese:
"""
```

## Still Having Issues?

If language reward is still zero after 500+ steps:
1. Share the debug output showing what the model generates
2. Check if the model is multilingual (some models don't know Japanese)
3. Consider using a different target language the model knows better
Loading
Loading