-
Notifications
You must be signed in to change notification settings - Fork 47
Add LanguageReward for training models to think in target language #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
casteryh
wants to merge
21
commits into
main
Choose a base branch
from
language-reward-feature
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit introduces a new reward function that encourages models to think in a specific target language within their <think> tags. Key changes: - Add LanguageReward class to src/forge/data/rewards.py - Uses langid for language detection - Configurable target language (ISO 639-1 codes) - Returns full_reward for language match, no_match_reward otherwise - Raises helpful error if langid not installed - Add comprehensive unit tests in tests/unit_tests/rl/test_language_reward.py - Tests for multiple languages (English, Japanese, Chinese, Spanish, etc.) - Tests for edge cases and error handling - All 28 tests passing - Create sandbox/grpo_language/ app for experimentation - Extends apps/grpo/ with LanguageReward - Hardcoded to Japanese (ja) as default target language - Includes README with usage instructions - Config file for Qwen3-1.7B model Implementation details: - Extracts text from <think></think> tags for analysis - Concatenates multiple thinking blocks for language detection - Compatible with existing MathReward and ThinkingReward - Does not add langid to requirements.txt (optional dependency) Usage: python -m sandbox.grpo_language.main --config sandbox/grpo_language/qwen3_1_7b.yaml Note: Requires 'pip install langid' before use
- Add fallback_reward parameter (default 0.2) - If no <think> blocks found, check if response text is in target language - Reward structure: * full_reward (1.0): Single block + correct language * partial_reward (0.5): Multiple blocks + correct language * fallback_reward (0.2): No blocks + correct language in response text * no_match_reward (0.0): Wrong language - Update all tests to reflect new behavior (29 tests passing)
be5552d to
b15f171
Compare
- Add debug prints to RewardActor showing: * Reward value * Number of thinking blocks * Detected language * Response sample - Create debug_reward.py script for testing reward function * Tests 8 common scenarios * Shows detected language and confidence - Add TROUBLESHOOTING.md with solutions for: * Model thinking in English instead of Japanese * Empty or missing thinking blocks * Short content that langid can't detect * Stronger system prompt alternatives * Expected training progression This helps diagnose why LanguageReward might be constantly zero
Core Changes:
- Add debug mode to LanguageReward class
* debug parameter: enables debug printing
* debug_sample_rate: controls sampling (default 0.1 = 10%)
* Prints detected language, confidence, reward, and sample text
* Shows why reward was given (e.g., 'single block, correct language ✓')
- Strengthen system prompt for Japanese thinking
* Add Japanese instructions at top (あなたは数学の問題を解く...)
* List explicit CRITICAL RULES
* Include concrete example in Japanese
* Much more forceful about requiring Japanese in <think> tags
- Enable debug in sandbox app
* Set debug=True, debug_sample_rate=0.1 for LanguageReward
* Will print every ~10th response with language detection details
Example debug output:
[LanguageReward] Found 1 thinking block(s)
Target: ja | Detected: en | Confidence: -54.41
Thinking sample: Let me solve this step by step...
→ Reward: 0.0 (wrong language) ✗
This should help diagnose why rewards are zero during training
…hink> BREAKING CHANGE: Default tag for LanguageReward changed from 'think' to '思考' Key changes: - Both ThinkingReward and LanguageReward now accept 'tag' parameter - ThinkingReward default: 'think' (backward compatible) - LanguageReward default: '思考' (Japanese, breaks English associations) - Sandbox app uses <思考> tags throughout - System prompt updated to Japanese with <思考> examples - All tests updated and passing (29/29) - Debug script updated Rationale: Models may be heavily trained to think in English when using <think> tags. Using Japanese tags <思考> (shikō = 'thinking') breaks this association and encourages thinking in the target language.
The old debug code was checking for <think> tags instead of <思考> tags, which was misleading. The LanguageReward class now has its own debug mode that correctly checks for the configured tag.
Changed from strong Japanese instructions with CRITICAL RULES to a simple English prompt that just mentions the <思考> tags with a Japanese example. This allows the RL training to do the work of encouraging Japanese thinking through rewards rather than relying on heavy prompt engineering.
Changes: - Removed sandbox/grpo_language/qwen3_1_7b.yaml (use configs from apps/grpo/) - Updated usage comment in main.py to reference apps/grpo/qwen3_1_7b.yaml - Updated README.md to reference apps/grpo/ configs - Fixed README.md to use <思考> tags instead of <think> tags
Since ThinkingReward already enforces format (single vs multiple blocks), LanguageReward now focuses purely on language detection with simplified logic: Detection strategy: - If exactly one thinking block: detect language of block content only - Otherwise (no blocks or multiple blocks): detect language of whole response - Returns match_reward (1.0) if language matches, no_match_reward (0.0) otherwise Changes: - Removed partial_reward and fallback_reward parameters (now just match/no-match) - Renamed full_reward to match_reward for clarity - Updated all 29 tests to match new behavior (all passing) - Updated README with clearer explanation of reward separation - Updated debug script with new expected rewards This separation of concerns allows each reward to specialize: - ThinkingReward: format enforcement - LanguageReward: language detection
Added langid to [project.optional-dependencies] dev section so CI tests can run without failing on import.
Removed sandbox/grpo_language/debug_reward.py and updated TROUBLESHOOTING.md to remove references to it.
Replaced misleading "Start with English" section with explanation of why English training won't work (already extensively pre-trained) and why Japanese is the right choice (novel combination, clear RL signal).
Added test_custom_tag to verify that ThinkingReward correctly uses the custom tag passed in during initialization. The test confirms: - Responses with custom tag get full reward - Responses with default tag get no reward when custom tag is used All 26 ThinkingReward tests passing.
Increased match_reward from default 1.0 to 2.0 to give more weight to language matching in the multi-objective reward.
Changed beta parameter from 0.1 to 0.0 in simple_grpo_loss to remove the KL divergence penalty term from the loss.
Changed beta parameter from 0.0 to 1e-3 in simple_grpo_loss to add a small KL divergence penalty.
Changed beta parameter from 1e-3 to 1e-4 in simple_grpo_loss.
Apply the multi-epoch training fix from fix-multi-epoch-training branch: - Add epoch tracking to DatasetActor - Store base dataset for iterator reuse - Restart iterator with set_epoch() when StopIteration occurs - Add epoch completion logging and metrics This allows training to continue beyond the first epoch instead of stopping when the dataset is exhausted.
Endpoints cannot be called recursively. Changed the StopIteration handler from `return await self.sample()` to using a while loop that continues to retry getting the next sample after restarting the iterator.
Instead of while True loop, simply return next(self._iterator) after restarting the iterator. This is cleaner and safer - if the dataset is truly empty, it will raise StopIteration without infinite recursion.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds
LanguageRewardfor training models to think in a specific target language. Primarily intended as a testing/development task - the combination of math reasoning with language-specific thinking provides a good evaluation task that's challenging enough to be interesting but not overly complex or extensively pre-trained.Changes
Core Implementation:
LanguageRewardclass insrc/forge/data/rewards.pyusinglangidfor language detectionmatch_reward(1.0) if language matches target,no_match_reward(0.0) otherwise<思考>for Japanese to break English associations)Separation of Concerns:
ThinkingReward: Enforces format (single vs multiple blocks)LanguageReward: Detects language onlyTesting & Sandbox:
sandbox/grpo_language/using Japanese as defaultapps/grpo/(no duplicate configs)langidto dev dependencies for CIUsage
pip install langid # Required dependency python -m sandbox.grpo_language.main --config apps/grpo/qwen3_1_7b.yamlWhy This Task?
Math problems with language-specific thinking make a good development test because:
Design Notes
<思考>: Breaks model's association between thinking and English (models are heavily trained with English<think>tags)langidin dev dependencies for CI, but not required for core library