Skip to content

Conversation

@casteryh
Copy link
Contributor

@casteryh casteryh commented Oct 31, 2025

Summary

Adds LanguageReward for training models to think in a specific target language. Primarily intended as a testing/development task - the combination of math reasoning with language-specific thinking provides a good evaluation task that's challenging enough to be interesting but not overly complex or extensively pre-trained.

Changes

Core Implementation:

  • LanguageReward class in src/forge/data/rewards.py using langid for language detection
  • Detection strategy:
    • Single thinking block → detect language of block content only
    • Otherwise → detect language of whole response
  • Returns match_reward (1.0) if language matches target, no_match_reward (0.0) otherwise
  • Configurable tags (default <思考> for Japanese to break English associations)

Separation of Concerns:

  • ThinkingReward: Enforces format (single vs multiple blocks)
  • LanguageReward: Detects language only
  • This allows each reward to specialize in one aspect

Testing & Sandbox:

  • 29 unit tests covering multiple languages and edge cases (all passing ✅)
  • Sandbox app in sandbox/grpo_language/ using Japanese as default
  • Uses configs from apps/grpo/ (no duplicate configs)
  • Added langid to dev dependencies for CI

Usage

pip install langid  # Required dependency
python -m sandbox.grpo_language.main --config apps/grpo/qwen3_1_7b.yaml
from forge.data.rewards import LanguageReward

reward = LanguageReward(target_language="ja", tag="思考")
score = reward("", "<思考>これは日本語です。</思考>")  # Returns 1.0

Why This Task?

Math problems with language-specific thinking make a good development test because:

  • Not trivially easy, but not extremely hard
  • Not extensively pre-trained compared to pure English math
  • Clear signal for RL training (language detection + correctness)
  • Easy to verify during development

Design Notes

  • Japanese tags <思考>: Breaks model's association between thinking and English (models are heavily trained with English <think> tags)
  • Weak system prompt: Minimal instructions, lets RL rewards drive behavior
  • Optional dependency: langid in dev dependencies for CI, but not required for core library

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 31, 2025
This commit introduces a new reward function that encourages models to
think in a specific target language within their <think> tags.

Key changes:
- Add LanguageReward class to src/forge/data/rewards.py
  - Uses langid for language detection
  - Configurable target language (ISO 639-1 codes)
  - Returns full_reward for language match, no_match_reward otherwise
  - Raises helpful error if langid not installed

- Add comprehensive unit tests in tests/unit_tests/rl/test_language_reward.py
  - Tests for multiple languages (English, Japanese, Chinese, Spanish, etc.)
  - Tests for edge cases and error handling
  - All 28 tests passing

- Create sandbox/grpo_language/ app for experimentation
  - Extends apps/grpo/ with LanguageReward
  - Hardcoded to Japanese (ja) as default target language
  - Includes README with usage instructions
  - Config file for Qwen3-1.7B model

Implementation details:
- Extracts text from <think></think> tags for analysis
- Concatenates multiple thinking blocks for language detection
- Compatible with existing MathReward and ThinkingReward
- Does not add langid to requirements.txt (optional dependency)

Usage:
python -m sandbox.grpo_language.main --config sandbox/grpo_language/qwen3_1_7b.yaml

Note: Requires 'pip install langid' before use
- Add fallback_reward parameter (default 0.2)
- If no <think> blocks found, check if response text is in target language
- Reward structure:
  * full_reward (1.0): Single block + correct language
  * partial_reward (0.5): Multiple blocks + correct language
  * fallback_reward (0.2): No blocks + correct language in response text
  * no_match_reward (0.0): Wrong language
- Update all tests to reflect new behavior (29 tests passing)
@casteryh casteryh force-pushed the language-reward-feature branch from be5552d to b15f171 Compare October 31, 2025 21:43
- Add debug prints to RewardActor showing:
  * Reward value
  * Number of thinking blocks
  * Detected language
  * Response sample

- Create debug_reward.py script for testing reward function
  * Tests 8 common scenarios
  * Shows detected language and confidence

- Add TROUBLESHOOTING.md with solutions for:
  * Model thinking in English instead of Japanese
  * Empty or missing thinking blocks
  * Short content that langid can't detect
  * Stronger system prompt alternatives
  * Expected training progression

This helps diagnose why LanguageReward might be constantly zero
Core Changes:
- Add debug mode to LanguageReward class
  * debug parameter: enables debug printing
  * debug_sample_rate: controls sampling (default 0.1 = 10%)
  * Prints detected language, confidence, reward, and sample text
  * Shows why reward was given (e.g., 'single block, correct language ✓')

- Strengthen system prompt for Japanese thinking
  * Add Japanese instructions at top (あなたは数学の問題を解く...)
  * List explicit CRITICAL RULES
  * Include concrete example in Japanese
  * Much more forceful about requiring Japanese in <think> tags

- Enable debug in sandbox app
  * Set debug=True, debug_sample_rate=0.1 for LanguageReward
  * Will print every ~10th response with language detection details

Example debug output:
  [LanguageReward] Found 1 thinking block(s)
    Target: ja | Detected: en | Confidence: -54.41
    Thinking sample: Let me solve this step by step...
    → Reward: 0.0 (wrong language) ✗

This should help diagnose why rewards are zero during training
…hink>

BREAKING CHANGE: Default tag for LanguageReward changed from 'think' to '思考'

Key changes:
- Both ThinkingReward and LanguageReward now accept 'tag' parameter
- ThinkingReward default: 'think' (backward compatible)
- LanguageReward default: '思考' (Japanese, breaks English associations)
- Sandbox app uses <思考> tags throughout
- System prompt updated to Japanese with <思考> examples
- All tests updated and passing (29/29)
- Debug script updated

Rationale: Models may be heavily trained to think in English when using
<think> tags. Using Japanese tags <思考> (shikō = 'thinking') breaks
this association and encourages thinking in the target language.
The old debug code was checking for <think> tags instead of <思考> tags,
which was misleading. The LanguageReward class now has its own debug mode
that correctly checks for the configured tag.
Changed from strong Japanese instructions with CRITICAL RULES to a simple
English prompt that just mentions the <思考> tags with a Japanese example.

This allows the RL training to do the work of encouraging Japanese thinking
through rewards rather than relying on heavy prompt engineering.
Changes:
- Removed sandbox/grpo_language/qwen3_1_7b.yaml (use configs from apps/grpo/)
- Updated usage comment in main.py to reference apps/grpo/qwen3_1_7b.yaml
- Updated README.md to reference apps/grpo/ configs
- Fixed README.md to use <思考> tags instead of <think> tags
Since ThinkingReward already enforces format (single vs multiple blocks),
LanguageReward now focuses purely on language detection with simplified logic:

Detection strategy:
- If exactly one thinking block: detect language of block content only
- Otherwise (no blocks or multiple blocks): detect language of whole response
- Returns match_reward (1.0) if language matches, no_match_reward (0.0) otherwise

Changes:
- Removed partial_reward and fallback_reward parameters (now just match/no-match)
- Renamed full_reward to match_reward for clarity
- Updated all 29 tests to match new behavior (all passing)
- Updated README with clearer explanation of reward separation
- Updated debug script with new expected rewards

This separation of concerns allows each reward to specialize:
- ThinkingReward: format enforcement
- LanguageReward: language detection
Added langid to [project.optional-dependencies] dev section so CI tests can run without failing on import.
@casteryh casteryh marked this pull request as ready for review October 31, 2025 23:55
Removed sandbox/grpo_language/debug_reward.py and updated TROUBLESHOOTING.md to remove references to it.
Replaced misleading "Start with English" section with explanation of why
English training won't work (already extensively pre-trained) and why
Japanese is the right choice (novel combination, clear RL signal).
Added test_custom_tag to verify that ThinkingReward correctly uses
the custom tag passed in during initialization. The test confirms:
- Responses with custom tag get full reward
- Responses with default tag get no reward when custom tag is used

All 26 ThinkingReward tests passing.
@casteryh casteryh requested a review from JenniferWang November 1, 2025 00:13
Increased match_reward from default 1.0 to 2.0 to give more weight
to language matching in the multi-objective reward.
Changed beta parameter from 0.1 to 0.0 in simple_grpo_loss to remove
the KL divergence penalty term from the loss.
Changed beta parameter from 0.0 to 1e-3 in simple_grpo_loss to add
a small KL divergence penalty.
Changed beta parameter from 1e-3 to 1e-4 in simple_grpo_loss.
Apply the multi-epoch training fix from fix-multi-epoch-training branch:
- Add epoch tracking to DatasetActor
- Store base dataset for iterator reuse
- Restart iterator with set_epoch() when StopIteration occurs
- Add epoch completion logging and metrics

This allows training to continue beyond the first epoch instead of stopping
when the dataset is exhausted.
Endpoints cannot be called recursively. Changed the StopIteration
handler from `return await self.sample()` to using a while loop
that continues to retry getting the next sample after restarting
the iterator.
Instead of while True loop, simply return next(self._iterator) after
restarting the iterator. This is cleaner and safer - if the dataset
is truly empty, it will raise StopIteration without infinite recursion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants