Skip to content

Commit 2625b28

Browse files
committed
Refactor to use configurable Japanese tags <思考> instead of English <think>
BREAKING CHANGE: Default tag for LanguageReward changed from 'think' to '思考' Key changes: - Both ThinkingReward and LanguageReward now accept 'tag' parameter - ThinkingReward default: 'think' (backward compatible) - LanguageReward default: '思考' (Japanese, breaks English associations) - Sandbox app uses <思考> tags throughout - System prompt updated to Japanese with <思考> examples - All tests updated and passing (29/29) - Debug script updated Rationale: Models may be heavily trained to think in English when using <think> tags. Using Japanese tags <思考> (shikō = 'thinking') breaks this association and encourages thinking in the target language.
1 parent afca75c commit 2625b28

File tree

5 files changed

+97
-72
lines changed

5 files changed

+97
-72
lines changed

sandbox/grpo_language/README.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,25 @@ This sandbox app demonstrates using GRPO training with a language reward that en
44

55
## Overview
66

7-
This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<think></think>` tags) is in the target language.
7+
This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<思考></思考>` tags) is in the target language.
8+
9+
**Key Insight**: Uses Japanese tags `<思考>` (shikō = "thinking") instead of English `<think>` tags to break the model's association between thinking tags and English language. This helps encourage multilingual thinking.
810

911
## Key Features
1012

1113
- **Multi-objective training**: Combines three rewards:
1214
- `MathReward`: Evaluates correctness of math answers
13-
- `ThinkingReward`: Encourages use of thinking tags
15+
- `ThinkingReward`: Encourages use of `<思考>` tags
1416
- `LanguageReward`: Rewards thinking in target language (Japanese by default)
1517

18+
- **Japanese thinking tags**: Uses `<思考>` instead of `<think>` to encourage non-English reasoning
19+
1620
- **Language detection**: Uses `langid` to detect the language of thinking blocks
1721

1822
- **Configurable target language**: While this app defaults to Japanese (`ja`), the `LanguageReward` can be configured for any ISO 639-1 language code
1923

24+
- **Configurable tags**: Both rewards support custom tag names via the `tag` parameter
25+
2026
## Requirements
2127

2228
Before running this app, install the required language detection library:
@@ -33,25 +39,29 @@ python -m sandbox.grpo_language.main --config sandbox/grpo_language/qwen3_1_7b.y
3339

3440
## How It Works
3541

36-
1. The model receives a math problem and is instructed to use `<think>` tags for reasoning
42+
1. The model receives a math problem and is instructed to use `<思考>` tags for reasoning
3743
2. During training, the model generates responses with thinking blocks
3844
3. Three rewards are computed:
3945
- Math correctness (did it get the right answer?)
40-
- Thinking usage (did it use thinking tags properly?)
46+
- Thinking usage (did it use `<思考>` tags properly?)
4147
- Language usage (did it think in Japanese?)
4248
4. The model is trained to maximize all three rewards
4349

4450
## Configuration
4551

46-
The target language is hardcoded as Japanese in `main.py` (line 321):
52+
### Target Language
53+
54+
The target language is configured as Japanese in `main.py`:
4755

4856
```python
49-
LanguageReward(target_language="ja")
57+
LanguageReward(target_language="ja", tag="思考")
58+
ThinkingReward(tag="思考")
5059
```
5160

52-
To use a different language, modify this line with the appropriate ISO 639-1 code:
53-
- English: `"en"`
54-
- Chinese: `"zh"`
61+
To use a different language:
62+
1. Change `target_language` to the appropriate ISO 639-1 code:
63+
- English: `"en"`
64+
- Chinese: `"zh"`
5565
- Spanish: `"es"`
5666
- French: `"fr"`
5767
- etc.

sandbox/grpo_language/debug_reward.py

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,27 +15,27 @@
1515
# Test cases mimicking what the model might generate
1616
test_cases = [
1717
# Case 1: Perfect - Japanese in single thinking block
18-
("<think>これは数学の問題です。2+2=4です。</think><answer>4</answer>", "Perfect Japanese"),
18+
("<思考>これは数学の問題です。2+2=4です。</思考><answer>4</answer>", "Perfect Japanese"),
1919
# Case 2: English thinking (most likely during training)
2020
(
21-
"<think>This is a math problem. 2+2=4.</think><answer>4</answer>",
21+
"<思考>This is a math problem. 2+2=4.</思考><answer>4</answer>",
2222
"English thinking",
2323
),
2424
# Case 3: No thinking blocks at all
2525
("The answer is 4.<answer>4</answer>", "No thinking blocks"),
2626
# Case 4: Empty thinking blocks
27-
("<think></think><answer>4</answer>", "Empty thinking block"),
27+
("<思考></思考><answer>4</answer>", "Empty thinking block"),
2828
# Case 5: Multiple thinking blocks in Japanese
2929
(
30-
"<think>最初の考え。</think><think>次の考え。</think><answer>4</answer>",
30+
"<思考>最初の考え。</思考><思考>次の考え。</思考><answer>4</answer>",
3131
"Multiple Japanese blocks",
3232
),
3333
# Case 6: Just the answer, no thinking
3434
("<answer>4</answer>", "Just answer tag"),
3535
# Case 7: Thinking with mostly numbers/symbols
36-
("<think>2 + 2 = 4</think><answer>4</answer>", "Mostly numbers"),
36+
("<思考>2 + 2 = 4</思考><answer>4</answer>", "Mostly numbers"),
3737
# Case 8: Mixed English and Japanese
38-
("<think>Let me think... これは簡単です。</think><answer>4</answer>", "Mixed languages"),
38+
("<思考>Let me think... これは簡単です。</思考><answer>4</answer>", "Mixed languages"),
3939
]
4040

4141
print("=" * 80)
@@ -51,9 +51,7 @@
5151
import langid
5252

5353
# Extract thinking content if exists
54-
think_match = re.findall(
55-
r"<\s*think\s*>(.*?)<\s*/\s*think\s*>", response, re.IGNORECASE | re.DOTALL
56-
)
54+
think_match = re.findall(r"<\s*思考\s*>(.*?)<\s*/\s*思考\s*>", response, re.DOTALL)
5755

5856
if think_match:
5957
content = " ".join(think_match)
@@ -65,9 +63,7 @@
6563
print(f" Detected language: {detected_lang} (confidence: {confidence:.3f})")
6664
else:
6765
# Check fallback
68-
response_text = re.sub(
69-
r"<\s*/?\s*think\s*>", "", response, flags=re.IGNORECASE
70-
).strip()
66+
response_text = re.sub(r"<\s*/?\s*思考\s*>", "", response).strip()
7167
if response_text:
7268
detected_lang, confidence = langid.classify(response_text)
7369
print(f"\n{description}:")

sandbox/grpo_language/main.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -243,18 +243,18 @@ def gsm8k_transform(sample):
243243
system_prompt = """
244244
あなたは数学の問題を解くAIアシスタントです。以下の重要なルールに従ってください:
245245
246-
CRITICAL RULES:
247-
1. Put ALL your reasoning inside <think> and </think> tags
248-
2. You MUST think in Japanese (日本語) inside the <think> tags - use hiragana, katakana, and kanji
249-
3. NEVER use English inside <think> tags
250-
4. Put your final numerical answer inside <answer> and </answer> tags
246+
重要なルール (CRITICAL RULES):
247+
1. すべての思考過程を <思考> と </思考> タグの中に入れてください
248+
2. <思考> タグの中では必ず日本語で考えてください(ひらがな、カタカナ、漢字を使用)
249+
3. <思考> タグの中では絶対に英語を使わないでください
250+
4. 最終的な数値の答えを <answer> </answer> タグの中に入れてください
251251
252-
Example:
252+
例 (Example):
253253
Question: What is 12 + 5?
254-
<think>12と5を足します。12 + 5 = 17です。したがって、答えは17です。</think>
254+
<思考>12と5を足します。12 + 5 = 17です。したがって、答えは17です。</思考>
255255
<answer>17</answer>
256256
257-
Now solve the following problem using Japanese in your <think> tags:
257+
以下の問題を <思考> タグの中で日本語を使って解いてください:
258258
"""
259259
request: str = sample["question"]
260260
as_chat = [
@@ -358,9 +358,9 @@ async def main(cfg: DictConfig):
358358
RewardActor.options(**cfg.services.reward_actor).as_service(
359359
reward_functions=[
360360
MathReward(),
361-
ThinkingReward(),
361+
ThinkingReward(tag="思考"), # Use Japanese tag
362362
LanguageReward(
363-
target_language="ja", debug=True, debug_sample_rate=0.1
363+
target_language="ja", tag="思考", debug=True, debug_sample_rate=0.1
364364
), # Japanese language reward with debug
365365
]
366366
),

src/forge/data/rewards.py

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -57,15 +57,28 @@ def _to_float(self, text: str) -> float | None:
5757

5858

5959
class ThinkingReward:
60-
"""Reward class for evaluating use of <think> tags in reasoning."""
60+
"""Reward class for evaluating use of thinking tags in reasoning.
6161
62-
def __init__(self, partial_reward: float = 0.2, full_reward: float = 1.0):
62+
Args:
63+
partial_reward: Reward for partial tag usage (incomplete/malformed)
64+
full_reward: Reward for well-formed thinking blocks with content
65+
tag: Tag name to use (default "think", can use "思考" for Japanese, etc.)
66+
"""
67+
68+
def __init__(
69+
self, partial_reward: float = 0.2, full_reward: float = 1.0, tag: str = "think"
70+
):
6371
self.partial_reward = partial_reward
6472
self.full_reward = full_reward
73+
self.tag = tag
74+
# Build regex patterns for the specified tag
6575
self._THINK_BLOCK_RE = re.compile(
66-
r"<\s*think\s*>(.*?)<\s*/\s*think\s*>", re.IGNORECASE | re.DOTALL
76+
rf"<\s*{re.escape(tag)}\s*>(.*?)<\s*/\s*{re.escape(tag)}\s*>",
77+
re.IGNORECASE | re.DOTALL,
78+
)
79+
self._THINK_TAG_ATTEMPT_RE = re.compile(
80+
rf"<\s*/?\s*{re.escape(tag)}\s*>", re.IGNORECASE
6781
)
68-
self._THINK_TAG_ATTEMPT_RE = re.compile(r"<\s*/?\s*think\s*>", re.IGNORECASE)
6982

7083
def __call__(self, prompt: str, response: str, target: str | None = None) -> float:
7184
"""Compute thinking reward."""
@@ -83,7 +96,7 @@ def __call__(self, prompt: str, response: str, target: str | None = None) -> flo
8396

8497

8598
class LanguageReward:
86-
"""Reward class for evaluating the language used in <think> tags.
99+
"""Reward class for evaluating the language used in thinking tags.
87100
88101
This reward uses langid to detect the language of text within thinking blocks
89102
and rewards responses that use the target language.
@@ -94,6 +107,7 @@ class LanguageReward:
94107
partial_reward: Reward when language matches but format is wrong (multiple blocks)
95108
fallback_reward: Reward when no valid blocks but response text is in target language
96109
no_match_reward: Reward when language doesn't match
110+
tag: Tag name to use (default "思考" for multilingual, can use "think", etc.)
97111
debug: If True, print debug samples showing model outputs and detected language
98112
debug_sample_rate: Fraction of calls to debug (e.g., 0.1 = 10% of calls)
99113
@@ -107,6 +121,7 @@ def __init__(
107121
partial_reward: float = 0.5,
108122
fallback_reward: float = 0.2,
109123
no_match_reward: float = 0.0,
124+
tag: str = "思考",
110125
debug: bool = False,
111126
debug_sample_rate: float = 0.1,
112127
):
@@ -115,12 +130,15 @@ def __init__(
115130
self.partial_reward = partial_reward
116131
self.fallback_reward = fallback_reward
117132
self.no_match_reward = no_match_reward
133+
self.tag = tag
118134
self.debug = debug
119135
self.debug_sample_rate = debug_sample_rate
120136
self._debug_counter = 0
137+
# Build regex pattern for the specified tag
121138
self._THINK_BLOCK_RE = re.compile(
122-
r"<\s*think\s*>(.*?)<\s*/\s*think\s*>", re.IGNORECASE | re.DOTALL
139+
rf"<\s*{re.escape(tag)}\s*>(.*?)<\s*/\s*{re.escape(tag)}\s*>", re.DOTALL
123140
)
141+
self._TAG_PATTERN = rf"<\s*/?\s*{re.escape(tag)}\s*>"
124142

125143
# Lazy import langid with helpful error message
126144
try:
@@ -164,9 +182,7 @@ def __call__(self, prompt: str, response: str, target: str | None = None) -> flo
164182
# If no thinking blocks found, check if response text is in target language
165183
if len(matches) == 0:
166184
# Remove any partial tags that might exist
167-
response_text = re.sub(
168-
r"<\s*/?\s*think\s*>", "", response, flags=re.IGNORECASE
169-
).strip()
185+
response_text = re.sub(self._TAG_PATTERN, "", response).strip()
170186

171187
if not response_text:
172188
if should_debug:

0 commit comments

Comments
 (0)