You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor to use configurable Japanese tags <思考> instead of English <think>
BREAKING CHANGE: Default tag for LanguageReward changed from 'think' to '思考'
Key changes:
- Both ThinkingReward and LanguageReward now accept 'tag' parameter
- ThinkingReward default: 'think' (backward compatible)
- LanguageReward default: '思考' (Japanese, breaks English associations)
- Sandbox app uses <思考> tags throughout
- System prompt updated to Japanese with <思考> examples
- All tests updated and passing (29/29)
- Debug script updated
Rationale: Models may be heavily trained to think in English when using
<think> tags. Using Japanese tags <思考> (shikō = 'thinking') breaks
this association and encourages thinking in the target language.
Copy file name to clipboardExpand all lines: sandbox/grpo_language/README.md
+19-9Lines changed: 19 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,19 +4,25 @@ This sandbox app demonstrates using GRPO training with a language reward that en
4
4
5
5
## Overview
6
6
7
-
This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<think></think>` tags) is in the target language.
7
+
This app extends the standard GRPO training (from `apps/grpo/`) by adding a `LanguageReward` that evaluates whether the model's thinking (text within `<思考></思考>` tags) is in the target language.
8
+
9
+
**Key Insight**: Uses Japanese tags `<思考>` (shikō = "thinking") instead of English `<think>` tags to break the model's association between thinking tags and English language. This helps encourage multilingual thinking.
8
10
9
11
## Key Features
10
12
11
13
-**Multi-objective training**: Combines three rewards:
12
14
-`MathReward`: Evaluates correctness of math answers
13
-
-`ThinkingReward`: Encourages use of thinking tags
15
+
-`ThinkingReward`: Encourages use of `<思考>` tags
14
16
-`LanguageReward`: Rewards thinking in target language (Japanese by default)
15
17
18
+
-**Japanese thinking tags**: Uses `<思考>` instead of `<think>` to encourage non-English reasoning
19
+
16
20
-**Language detection**: Uses `langid` to detect the language of thinking blocks
17
21
18
22
-**Configurable target language**: While this app defaults to Japanese (`ja`), the `LanguageReward` can be configured for any ISO 639-1 language code
19
23
24
+
-**Configurable tags**: Both rewards support custom tag names via the `tag` parameter
25
+
20
26
## Requirements
21
27
22
28
Before running this app, install the required language detection library:
0 commit comments