Skip to content

Commit 8e7f2f9

Browse files
authored
Merge pull request #626 from RobotSail/fix-data-processing
fix edge case for qwen3 data processing
2 parents 92b4a45 + a05d1e4 commit 8e7f2f9

File tree

7 files changed

+3040
-86
lines changed

7 files changed

+3040
-86
lines changed

README.md

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,18 @@ The InstructLab Training library is an optimized model instruction-tuning librar
1414
To simplify the process of fine-tuning models with the [LAB
1515
method](https://arxiv.org/abs/2403.01081), or for general use, this library provides a simple pythonic training interface.
1616

17+
### Reasoning Content Support
18+
19+
The library now supports reasoning traces through the `reasoning_content` field in message samples. This enables training models that can handle both regular content and structured reasoning traces, making it ideal for training reasoning-capable models that can separate their thinking process from their final output.
20+
1721
## Usage and Guidance Sections
1822

1923
- [Installing](#installing-the-library)
2024
- [Additional Nvidia packages](#additional-nvidia-packages)
2125
- [Using the library](#using-the-library)
26+
- [Data format](#data-format)
27+
- [Reasoning content support](#reasoning-content-support-1)
28+
- [Documentation](#documentation)
2229
- [Learning about the training arguments](#learning-about-training-arguments)
2330
- [`TrainingArgs`](#trainingargs)
2431
- [`DeepSpeedOptions`](#deepspeedoptions)
@@ -80,6 +87,72 @@ You can then define various training arguments. They will serve as the parameter
8087
- [Learning about the training argument](#learning-about-training-arguments)
8188
- [Example training run with arguments](#example-training-run-with-arguments)
8289

90+
## Data format
91+
92+
The library expects training data in the messages format, where each sample contains a list of messages with different roles (user, assistant, system, etc.). Each message should have at minimum:
93+
94+
- `role`: The role of the message sender (e.g., "user", "assistant", "system")
95+
- `content`: The main content of the message
96+
97+
### Reasoning content support
98+
99+
The library now supports an optional `reasoning_content` field in addition to the standard `content` field. This enables training models with structured reasoning traces. The `reasoning_content` field is particularly useful for:
100+
101+
- Training reasoning-capable models that can separate their thinking process from their output
102+
- Supporting models that need to generate internal reasoning traces
103+
- Enabling step-by-step reasoning in model responses
104+
105+
> **Note**: this is only supported for models with chat templates that use the DeepSeek R1-style parser. Models without a custom thought processor such as Phi-4 must still provide their reasoning traces in the `content` field.
106+
107+
**Example message structure with reasoning content:**
108+
109+
```json
110+
{
111+
"messages": [
112+
{
113+
"role": "user",
114+
"content": "What is 15 * 23?"
115+
},
116+
{
117+
"role": "assistant",
118+
"reasoning_content": "I need to multiply 15 by 23. Let me break this down: 15 * 23 = 15 * (20 + 3) = 15 * 20 + 15 * 3 = 300 + 45 = 345",
119+
"content": "15 * 23 = 345"
120+
}
121+
]
122+
}
123+
```
124+
125+
**Standard message structure:**
126+
127+
```json
128+
{
129+
"messages": [
130+
{
131+
"role": "user",
132+
"content": "Hello! How are you?"
133+
},
134+
{
135+
"role": "assistant",
136+
"content": "Hello! I'm doing well, thank you for asking. How can I help you today?"
137+
}
138+
]
139+
}
140+
```
141+
142+
#### Important Notes
143+
144+
1. **Automatic reasoning content processing**: If `reasoning_content` exists in a message, it will always be processed and unmasked as long as the message role is targeted for unmasking. This ensures that reasoning traces are properly included in the training data.
145+
146+
2. **DeepSeek R1 Thinking Compatibility**: Models using the DeepSeek R1 thought processor (such as Qwen3) must supply their thinking traces in the `reasoning_content` field to be processed correctly. Failure to do so may result in improper handling of reasoning tokens and suboptimal training performance.
147+
148+
## Documentation
149+
150+
For detailed information about specific features:
151+
152+
- **[Reasoning Content Support](docs/reasoning_content.md)**: Comprehensive guide to using the `reasoning_content` field for training reasoning-capable models
153+
- **[CI Documentation](docs/ci.md)**: Information about continuous integration processes
154+
- **[Logging Documentation](docs/logging.md)**: Guide to logging configuration and usage
155+
83156
## Learning about training arguments
84157

85158
The `TrainingArgs` class provides most of the customization options
@@ -378,4 +451,4 @@ Below is a list of custom environment variables users can set in the training li
378451

379452
## Developer Certificate of Origin
380453

381-
When you make a contribution to InstructLab training, you implicitly agree to the Developer Certificate of Origin terms as set in `DCO.txt` at the root of this repository.
454+
When you make a contribution to InstructLab training, you implicitly agree to the Developer Certificate of Origin terms as set in `DCO.txt` at the root of this repository.

docs/reasoning_content.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Reasoning Content Support
2+
3+
The InstructLab Training library supports structured reasoning traces through the `reasoning_content` field in message samples. This feature enables training models that can separate their thinking process from their final output.
4+
5+
## Overview
6+
7+
The `reasoning_content` field is an optional addition to the standard message format that allows you to include the model's internal reasoning process alongside the final response. This is particularly useful for:
8+
9+
- Training reasoning-capable models that show their work
10+
- Supporting models that need to generate step-by-step reasoning
11+
- Enabling chain-of-thought style training data
12+
- Separating internal thinking from user-facing responses
13+
14+
## Message Format
15+
16+
### Standard Message Format
17+
18+
```json
19+
{
20+
"role": "assistant",
21+
"content": "The answer is 42."
22+
}
23+
```
24+
25+
### Extended Message Format with Reasoning Content
26+
27+
```json
28+
{
29+
"role": "assistant",
30+
"content": "The answer is 42.",
31+
"reasoning_content": "Let me think about this step by step. The question asks for the meaning of life, and according to The Hitchhiker's Guide to the Galaxy, the answer is 42."
32+
}
33+
```
34+
35+
## Data Processing Behavior
36+
37+
When processing messages during training:
38+
39+
1. **Unmasking Rules**: Both `content` and `reasoning_content` fields follow the same unmasking rules based on the message role
40+
2. **Template Integration**: Both fields are processed by the chat template and included in the tokenized output
41+
3. **Token Wrapping**: If a role is configured to be unmasked, both fields (when present) are wrapped with unmask tokens
42+
4. **Independent Fields**: Either field can exist independently - messages can have only `content`, only `reasoning_content`, or both
43+
44+
## Usage Examples
45+
46+
### Training Data with Reasoning Traces
47+
48+
```json
49+
{
50+
"messages": [
51+
{
52+
"role": "user",
53+
"content": "What is 15 * 23?"
54+
},
55+
{
56+
"role": "assistant",
57+
"reasoning_content": "I need to multiply 15 by 23. Let me break this down: 15 * 23 = 15 * (20 + 3) = 15 * 20 + 15 * 3 = 300 + 45 = 345",
58+
"content": "15 * 23 = 345"
59+
}
60+
]
61+
}
62+
```
63+
64+
### Mixed Content Types
65+
66+
```json
67+
{
68+
"messages": [
69+
{
70+
"role": "user",
71+
"content": "Solve this math problem step by step: 2x + 5 = 13"
72+
},
73+
{
74+
"role": "assistant",
75+
"reasoning_content": "I need to solve for x. First, I'll subtract 5 from both sides: 2x = 8. Then divide by 2: x = 4.",
76+
"content": "To solve 2x + 5 = 13:\n1. Subtract 5 from both sides: 2x = 8\n2. Divide by 2: x = 4\n\nTherefore, x = 4."
77+
}
78+
]
79+
}
80+
```
81+
82+
### Reasoning-Only Responses
83+
84+
```json
85+
{
86+
"messages": [
87+
{
88+
"role": "user",
89+
"content": "Think about the implications of AI safety."
90+
},
91+
{
92+
"role": "assistant",
93+
"reasoning_content": "This is a complex topic that requires careful consideration of multiple factors including alignment, capability control, and social implications..."
94+
}
95+
]
96+
}
97+
```
98+
99+
## Implementation Details
100+
101+
### Token Processing
102+
103+
During data processing, the library:
104+
105+
1. Wraps both `content` and `reasoning_content` with special unmask tokens (`<|UNMASK_BEGIN|>`, `<|UNMASK_END|>`, `<|UNMASK_REASONING_BEGIN|>`, `<|UNMASK_REASONING_END|>`)
106+
2. Applies the chat template to the combined message content
107+
3. Processes the tokenized sequence to create appropriate labels for training
108+
4. Removes the special unmask tokens from the final training data
109+
110+
### Validation
111+
112+
The library validates that:
113+
114+
- Both `content` and `reasoning_content` must be strings if present
115+
- Special unmask tokens are properly processed and removed
116+
- The final training data contains no residual unmask tokens
117+
118+
### Error Handling
119+
120+
Common errors and their meanings:
121+
122+
- `"unmasking non-string data types is currently unsupported"`: The `content` field contains non-string data
123+
- `"received an entry for reasoning_content which was not a string"`: The `reasoning_content` field contains non-string data
124+
125+
## Integration with Existing Features
126+
127+
### Unmasking Policies
128+
129+
The `reasoning_content` field respects all existing unmasking policies:
130+
131+
- When `unmask=true` is set on a sample, both fields are unmasked for non-system roles
132+
- When `unmask=false` (default), only assistant role messages are unmasked
133+
- Custom unmask role configurations work with both fields
134+
135+
### Chat Templates
136+
137+
The `reasoning_content` is unsupported by the legacy chat templates and will not be rendered.
138+
139+
### Backward Compatibility
140+
141+
The feature is fully backward compatible:
142+
143+
- Existing datasets without `reasoning_content` continue to work unchanged
144+
- All existing training configurations and arguments remain valid
145+
146+
## Testing
147+
148+
The library includes comprehensive tests for reasoning content functionality:
149+
150+
- Unit tests for message wrapping and processing
151+
- Integration tests with real tokenizers
152+
- Validation tests for error conditions
153+
- Backward compatibility tests
154+
155+
## Important Notes
156+
157+
### Automatic Processing Behavior
158+
159+
1. **Always processed when present**: If `reasoning_content` exists in a message, it will always be processed and unmasked as long as the message role is targeted for unmasking. This ensures that reasoning traces are properly included in the training data without requiring additional configuration.
160+
161+
2. **DeepSeek R1 and Qwen3 compatibility**: Models using the DeepSeek R1 thought processor (such as Qwen3) **must** supply their thinking traces in the `reasoning_content` field to be processed correctly. Failure to do so may result in improper handling of reasoning tokens and suboptimal training performance.
162+
163+
3. **Separate token handling**: The library uses distinct unmask tokens for reasoning content (`<|UNMASK_REASONING_BEGIN|>` and `<|UNMASK_REASONING_END|>`) versus regular content (`<|UNMASK_BEGIN|>` and `<|UNMASK_END|>`), allowing for proper differentiation during training.
164+
165+
## Best Practices
166+
167+
1. **Consistent Usage**: When applicable, use `reasoning_content` consistently within a dataset for best results
168+
2. **Clear Separation**: Keep reasoning traces separate from final outputs for clarity
169+
3. **Template Compatibility**: Ensure your chat template properly handles both fields
170+
4. **Validation**: Test your data processing pipeline with small samples before full training
171+
172+
## Migration Guide
173+
174+
To add reasoning content support to existing datasets:
175+
176+
1. Add `reasoning_content` fields to relevant messages
177+
2. Ensure content is in string format
178+
3. Test with a small sample using the data processing pipeline
179+
4. Verify that unmask tokens are properly processed
180+
181+
No changes to training arguments or configuration are required.

0 commit comments

Comments
 (0)