Add LanguageReward for training models to think in target language #515

casteryh · 2025-10-31T19:50:43Z

Summary

Adds LanguageReward for training models to think in a specific target language. Primarily intended as a testing/development task - the combination of math reasoning with language-specific thinking provides a good evaluation task that's challenging enough to be interesting but not overly complex or extensively pre-trained.

Changes

Core Implementation:

LanguageReward class in src/forge/data/rewards.py using langid for language detection
Detection strategy:
- Single thinking block → detect language of block content only
- Otherwise → detect language of whole response
Returns match_reward (1.0) if language matches target, no_match_reward (0.0) otherwise
Configurable tags (default <思考> for Japanese to break English associations)

Separation of Concerns:

ThinkingReward: Enforces format (single vs multiple blocks)
LanguageReward: Detects language only
This allows each reward to specialize in one aspect

Testing & Sandbox:

29 unit tests covering multiple languages and edge cases (all passing ✅)
Sandbox app in sandbox/grpo_language/ using Japanese as default
Uses configs from apps/grpo/ (no duplicate configs)
Added langid to dev dependencies for CI

Usage

pip install langid  # Required dependency
python -m sandbox.grpo_language.main --config apps/grpo/qwen3_1_7b.yaml

from forge.data.rewards import LanguageReward

reward = LanguageReward(target_language="ja", tag="思考")
score = reward("", "<思考>これは日本語です。</思考>")  # Returns 1.0

Why This Task?

Math problems with language-specific thinking make a good development test because:

Not trivially easy, but not extremely hard
Not extensively pre-trained compared to pure English math
Clear signal for RL training (language detection + correctness)
Easy to verify during development

Design Notes

Japanese tags <思考>: Breaks model's association between thinking and English (models are heavily trained with English <think> tags)
Weak system prompt: Minimal instructions, lets RL rewards drive behavior
Optional dependency: langid in dev dependencies for CI, but not required for core library

This commit introduces a new reward function that encourages models to think in a specific target language within their <think> tags. Key changes: - Add LanguageReward class to src/forge/data/rewards.py - Uses langid for language detection - Configurable target language (ISO 639-1 codes) - Returns full_reward for language match, no_match_reward otherwise - Raises helpful error if langid not installed - Add comprehensive unit tests in tests/unit_tests/rl/test_language_reward.py - Tests for multiple languages (English, Japanese, Chinese, Spanish, etc.) - Tests for edge cases and error handling - All 28 tests passing - Create sandbox/grpo_language/ app for experimentation - Extends apps/grpo/ with LanguageReward - Hardcoded to Japanese (ja) as default target language - Includes README with usage instructions - Config file for Qwen3-1.7B model Implementation details: - Extracts text from <think></think> tags for analysis - Concatenates multiple thinking blocks for language detection - Compatible with existing MathReward and ThinkingReward - Does not add langid to requirements.txt (optional dependency) Usage: python -m sandbox.grpo_language.main --config sandbox/grpo_language/qwen3_1_7b.yaml Note: Requires 'pip install langid' before use

- Add fallback_reward parameter (default 0.2) - If no <think> blocks found, check if response text is in target language - Reward structure: * full_reward (1.0): Single block + correct language * partial_reward (0.5): Multiple blocks + correct language * fallback_reward (0.2): No blocks + correct language in response text * no_match_reward (0.0): Wrong language - Update all tests to reflect new behavior (29 tests passing)

- Add debug prints to RewardActor showing: * Reward value * Number of thinking blocks * Detected language * Response sample - Create debug_reward.py script for testing reward function * Tests 8 common scenarios * Shows detected language and confidence - Add TROUBLESHOOTING.md with solutions for: * Model thinking in English instead of Japanese * Empty or missing thinking blocks * Short content that langid can't detect * Stronger system prompt alternatives * Expected training progression This helps diagnose why LanguageReward might be constantly zero

Core Changes: - Add debug mode to LanguageReward class * debug parameter: enables debug printing * debug_sample_rate: controls sampling (default 0.1 = 10%) * Prints detected language, confidence, reward, and sample text * Shows why reward was given (e.g., 'single block, correct language ✓') - Strengthen system prompt for Japanese thinking * Add Japanese instructions at top (あなたは数学の問題を解く...) * List explicit CRITICAL RULES * Include concrete example in Japanese * Much more forceful about requiring Japanese in <think> tags - Enable debug in sandbox app * Set debug=True, debug_sample_rate=0.1 for LanguageReward * Will print every ~10th response with language detection details Example debug output: [LanguageReward] Found 1 thinking block(s) Target: ja | Detected: en | Confidence: -54.41 Thinking sample: Let me solve this step by step... → Reward: 0.0 (wrong language) ✗ This should help diagnose why rewards are zero during training

…hink> BREAKING CHANGE: Default tag for LanguageReward changed from 'think' to '思考' Key changes: - Both ThinkingReward and LanguageReward now accept 'tag' parameter - ThinkingReward default: 'think' (backward compatible) - LanguageReward default: '思考' (Japanese, breaks English associations) - Sandbox app uses <思考> tags throughout - System prompt updated to Japanese with <思考> examples - All tests updated and passing (29/29) - Debug script updated Rationale: Models may be heavily trained to think in English when using <think> tags. Using Japanese tags <思考> (shikō = 'thinking') breaks this association and encourages thinking in the target language.

The old debug code was checking for <think> tags instead of <思考> tags, which was misleading. The LanguageReward class now has its own debug mode that correctly checks for the configured tag.

Changed from strong Japanese instructions with CRITICAL RULES to a simple English prompt that just mentions the <思考> tags with a Japanese example. This allows the RL training to do the work of encouraging Japanese thinking through rewards rather than relying on heavy prompt engineering.

Changes: - Removed sandbox/grpo_language/qwen3_1_7b.yaml (use configs from apps/grpo/) - Updated usage comment in main.py to reference apps/grpo/qwen3_1_7b.yaml - Updated README.md to reference apps/grpo/ configs - Fixed README.md to use <思考> tags instead of <think> tags

Since ThinkingReward already enforces format (single vs multiple blocks), LanguageReward now focuses purely on language detection with simplified logic: Detection strategy: - If exactly one thinking block: detect language of block content only - Otherwise (no blocks or multiple blocks): detect language of whole response - Returns match_reward (1.0) if language matches, no_match_reward (0.0) otherwise Changes: - Removed partial_reward and fallback_reward parameters (now just match/no-match) - Renamed full_reward to match_reward for clarity - Updated all 29 tests to match new behavior (all passing) - Updated README with clearer explanation of reward separation - Updated debug script with new expected rewards This separation of concerns allows each reward to specialize: - ThinkingReward: format enforcement - LanguageReward: language detection

Added langid to [project.optional-dependencies] dev section so CI tests can run without failing on import.

Removed sandbox/grpo_language/debug_reward.py and updated TROUBLESHOOTING.md to remove references to it.

Replaced misleading "Start with English" section with explanation of why English training won't work (already extensively pre-trained) and why Japanese is the right choice (novel combination, clear RL signal).

Added test_custom_tag to verify that ThinkingReward correctly uses the custom tag passed in during initialization. The test confirms: - Responses with custom tag get full reward - Responses with default tag get no reward when custom tag is used All 26 ThinkingReward tests passing.

Increased match_reward from default 1.0 to 2.0 to give more weight to language matching in the multi-objective reward.

Changed beta parameter from 0.1 to 0.0 in simple_grpo_loss to remove the KL divergence penalty term from the loss.

Changed beta parameter from 0.0 to 1e-3 in simple_grpo_loss to add a small KL divergence penalty.

Changed beta parameter from 1e-3 to 1e-4 in simple_grpo_loss.

Apply the multi-epoch training fix from fix-multi-epoch-training branch: - Add epoch tracking to DatasetActor - Store base dataset for iterator reuse - Restart iterator with set_epoch() when StopIteration occurs - Add epoch completion logging and metrics This allows training to continue beyond the first epoch instead of stopping when the dataset is exhausted.

Endpoints cannot be called recursively. Changed the StopIteration handler from `return await self.sample()` to using a while loop that continues to retry getting the next sample after restarting the iterator.

Instead of while True loop, simply return next(self._iterator) after restarting the iterator. This is cleaner and safer - if the dataset is truly empty, it will raise StopIteration without infinite recursion.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 31, 2025

casteryh added 3 commits October 31, 2025 14:43

Update system prompt to instruct model to think in Japanese

b12ed15

casteryh force-pushed the language-reward-feature branch from be5552d to b15f171 Compare October 31, 2025 21:43

casteryh added 8 commits October 31, 2025 14:53

Remove old debug code from main.py

1a4d5fb

The old debug code was checking for <think> tags instead of <思考> tags, which was misleading. The LanguageReward class now has its own debug mode that correctly checks for the configured tag.

Add langid to dev dependencies for CI

0ed798c

Added langid to [project.optional-dependencies] dev section so CI tests can run without failing on import.

casteryh marked this pull request as ready for review October 31, 2025 23:55

casteryh added 3 commits October 31, 2025 16:57

Remove debug script

5a3193e

Removed sandbox/grpo_language/debug_reward.py and updated TROUBLESHOOTING.md to remove references to it.

Clarify why English training won't work in TROUBLESHOOTING

93a65b2

Replaced misleading "Start with English" section with explanation of why English training won't work (already extensively pre-trained) and why Japanese is the right choice (novel combination, clear RL signal).

casteryh requested a review from JenniferWang November 1, 2025 00:13

casteryh added 7 commits October 31, 2025 17:20

Bump LanguageReward match_reward to 2.0

6186f9f

Increased match_reward from default 1.0 to 2.0 to give more weight to language matching in the multi-objective reward.

Set KL divergence coefficient to zero in loss function

c640d37

Changed beta parameter from 0.1 to 0.0 in simple_grpo_loss to remove the KL divergence penalty term from the loss.

Change KL divergence coefficient to 1e-3

7fde86d

Changed beta parameter from 0.0 to 1e-3 in simple_grpo_loss to add a small KL divergence penalty.

Change KL divergence coefficient to 1e-4

ffb6c43

Changed beta parameter from 1e-3 to 1e-4 in simple_grpo_loss.

Fix recursive endpoint call - use while loop instead

1bf3cca

Endpoints cannot be called recursively. Changed the StopIteration handler from `return await self.sample()` to using a while loop that continues to retry getting the next sample after restarting the iterator.

Simplify multi-epoch fix - use return next() instead of while loop

7758b48

Instead of while True loop, simply return next(self._iterator) after restarting the iterator. This is cleaner and safer - if the dataset is truly empty, it will raise StopIteration without infinite recursion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LanguageReward for training models to think in target language #515

Add LanguageReward for training models to think in target language #515

Uh oh!

casteryh commented Oct 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add LanguageReward for training models to think in target language #515

Are you sure you want to change the base?

Add LanguageReward for training models to think in target language #515

Uh oh!

Conversation

casteryh commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Usage

Why This Task?

Design Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

casteryh commented Oct 31, 2025 •

edited

Loading