Skip to content

Commit d51a6c4

Browse files
authored
Merge pull request #1 from HeRunming/main
Update quickstart REASONING pipeline.
2 parents 9240e9f + 5a3121a commit d51a6c4

File tree

6 files changed

+558
-9
lines changed

6 files changed

+558
-9
lines changed

docs/.vuepress/notes/en/guide.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ export const Guide: ThemeNote = defineNoteConfig({
1212
prefix: 'quickstart',
1313
items: [
1414
'install',
15+
'ReasoningPipeline',
1516
// 'usage',
1617
// 'project-structure',
1718
// 'write',

docs/.vuepress/notes/zh/guide.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ export const Guide: ThemeNote = defineNoteConfig({
2222
prefix: 'quickstart',
2323
items: [
2424
'install',
25+
'ReasoningPipeline',
2526
],
2627
},
2728
{
Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
---
2+
title: Reasoning Data Synthesis Pipeline
3+
icon: mdi:brain
4+
createTime: 2025/06/16 13:08:42
5+
permalink: /en/guide/reasoningpipeline/
6+
---
7+
8+
# Reasoning Data Synthesis Pipeline
9+
10+
## 1. Overview
11+
12+
The **Reasoning Data Synthesis Pipeline** is an end-to-end framework to:
13+
- Clean and augment existing math QA datasets
14+
- Generate high-quality answers complete with chain-of-thought (CoT) rationales
15+
16+
This pipeline natively handles four input scenarios:
17+
1. Question only
18+
2. Question + Golden Answer
19+
3. Question + Existing Solution
20+
4. Any combination of the above
21+
22+
Under the hood, it’s split into three stages—**Question Processing**, **Branch Scheduling**, and **Answer Processing**—all orchestrated by configurable YAML specs and a unified `pipeline_step.py`. With a single command you trigger the entire pipeline and produce intermediate outputs at every stage.
23+
24+
---
25+
26+
## 2. One-Click Execution
27+
28+
Run the full pipeline with one of these scripts:
29+
30+
If **all samples include** golden answers:
31+
```bash
32+
bash ReasoningPipeline/pipeline_GT.sh
33+
```
34+
35+
If **no samples include** golden answers:
36+
```bash
37+
bash ReasoningPipeline/pipeline_withoutGT.sh
38+
```
39+
40+
For **mixed** scenarios:
41+
```bash
42+
bash ReasoningPipeline/pipeline_full.sh
43+
```
44+
45+
> Each script loads its corresponding YAML config, invokes each operator in sequence, and writes intermediate files to the designated directories.
46+
47+
---
48+
49+
## 3. Data Format
50+
51+
### 3.1 Input Data
52+
53+
- Supported formats: `json`, `jsonl`
54+
- Required fields:
55+
`instruction`: the math problem prompt
56+
`golden_answer`: golden answer (if available)
57+
`solution`: any existing solution or CoT
58+
- Optional fields are ignored—keep only what you need to avoid conflicts.
59+
- Example (`json`):
60+
```json
61+
{
62+
"instruction": "…For this super-ellipse… (a) find d; (b) write the hyperbola equation.",
63+
"golden_answer": "8",
64+
"source": "Bigmath_synth"
65+
}
66+
```
67+
- Demo dataset for quick testing:
68+
`demos/text_process/reasoners/pipeline_math.json`
69+
(contains question + golden answer)
70+
71+
### 3.2 Output Data
72+
73+
- Format: `jsonl` (one file per pipeline stage)
74+
- Key fields:
75+
`instruction`: the question
76+
`generated_cot`: model-generated chain-of-thought
77+
`output`: model answer
78+
`golden_answer`: ground-truth answer
79+
`Synth_or_Input`: `input` (original data) or `synth` (synthesized by the pipeline)
80+
`Difficulty`: score from 0 to 10
81+
`primary_category`: primary category of math problem
82+
`secondary_category`: secondary category of math problem
83+
- Example:
84+
```json
85+
{
86+
"instruction": "Given … find δ?",
87+
"generated_cot": "…detailed derivation…",
88+
"output": "δ = 30°",
89+
"golden_answer": "30",
90+
"Synth_or_Input": "input",
91+
"Difficulty": 4.0,
92+
"primary_category": "Geometry and Topology",
93+
"secondary_category": "Euclidean Geometry"
94+
}
95+
```
96+
97+
---
98+
99+
## 4. Pipeline & Operators
100+
101+
All steps are implemented as operators driven by `pipeline_step.py` and configured via YAML.
102+
103+
### 4.1 Question Processing Operators
104+
105+
1. **MathProblemFilter**
106+
- Function: Remove non-math questions
107+
- Command:
108+
```bash
109+
python pipeline_step.py \
110+
--yaml_path ReasoningPipeline/yaml/MathProblemFilter.yaml \
111+
--step_name MathProblemFilter \
112+
--step_type process
113+
```
114+
115+
2. **QuestionGenerator**
116+
- Function: Prompt a large model to synthesize new math questions
117+
- Command:
118+
```bash
119+
python pipeline_step.py \
120+
--yaml_path ReasoningPipeline/yaml/QuestionGenerator.yaml \
121+
--step_name QuestionGenerator \
122+
--step_type generator
123+
```
124+
125+
3. **QuestionVerify**
126+
- Function: Filter out questions with incorrect or missing conditions using [MathQ-Verify](https://arxiv.org/abs/2505.13903)
127+
- Command:
128+
```bash
129+
python pipeline_step.py \
130+
--yaml_path ReasoningPipeline/yaml/QuestionVerify.yaml \
131+
--step_name QuestionVerify \
132+
--step_type process
133+
```
134+
135+
4. **QuestionDifficultyClassifier**
136+
- Function: Score and classify difficulty according to [Omni-Math](https://arxiv.org/abs/2410.07985) prompts
137+
- Command:
138+
```bash
139+
python pipeline_step.py \
140+
--yaml_path ReasoningPipeline/yaml/QuestionDifficultyClassifier.yaml \
141+
--step_name QuestionDifficultyClassifier \
142+
--step_type generator
143+
```
144+
145+
5. **QuestionCategoryClassifier**
146+
- Function: Referring to the [MSC-2020](https://msc2020.org/) classification, the problems are clustered into seven primary categories and several secondary categories.
147+
- Command:
148+
```bash
149+
python pipeline_step.py \
150+
--yaml_path ReasoningPipeline/yaml/QuestionCategoryClassifier.yaml \
151+
--step_name QuestionCategoryClassifier \
152+
--step_type generator
153+
```
154+
155+
### 4.2 Pipeline Brancher (AnswerPipelineRoot)
156+
157+
- Function: Split data into two streams based on the presence of golden answers
158+
- Output files:
159+
- `*_withGT.jsonl` (with golden answers)
160+
- `*_withoutGT.jsonl` (without golden answers)
161+
- Command:
162+
```bash
163+
python pipeline_step.py \
164+
--yaml_path ReasoningPipeline/yaml/AnswerPipelineRoot.yaml \
165+
--step_name AnswerPipelineRoot \
166+
--step_type generator
167+
```
168+
169+
### 4.3 Golden Answer Processing Operators
170+
171+
Executed only on the “with golden answer” branch
172+
173+
1. **AnswerGenerator**
174+
- Function: Generate answers with detailed CoT (Chain of Thought)
175+
- Command:
176+
```bash
177+
python pipeline_step.py \
178+
--yaml_path ReasoningPipeline/yaml/AnswerGenerator.yaml \
179+
--step_name AnswerGenerator \
180+
--step_type generator
181+
```
182+
183+
2. **AnswerFormatFilter**
184+
- Function: Discard answers that don’t match the expected format
185+
- Command:
186+
```bash
187+
python pipeline_step.py \
188+
--yaml_path ReasoningPipeline/yaml/AnswerFormatFilter.yaml \
189+
--step_name AnswerFormatFilter \
190+
--step_type process
191+
```
192+
193+
3. **AnswerLengthFilter**
194+
- Function: Remove answers that are too long or too short
195+
- Command:
196+
```bash
197+
python pipeline_step.py \
198+
--yaml_path ReasoningPipeline/yaml/AnswerLengthFilter.yaml \
199+
--step_name AnswerLengthFilter \
200+
--step_type process
201+
```
202+
203+
4. **AnswerGroundTruthFilter**
204+
- Function: Use [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math) and [Math-Verify](https://github.com/huggingface/Math-Verify) to extract and verify the final answer
205+
- Command:
206+
```bash
207+
python pipeline_step.py \
208+
--yaml_path ReasoningPipeline/yaml/ReasonerAnsSelection.yaml \
209+
--step_name AnswerGroundTruthFilter \
210+
--step_type process
211+
```
212+
213+
5. **AnswerNgramFilter**
214+
- Function: Deduplicate QA pairs with an n-gram similarity filter
215+
- Command:
216+
```bash
217+
python pipeline_step.py \
218+
--yaml_path ReasoningPipeline/yaml/ReasonerNgramFilter.yaml \
219+
--step_name AnswerNgramFilter \
220+
--step_type process
221+
```
222+
223+
### 4.4 No Golden Answer Processing Operators
224+
225+
Executed only on the “without golden answer” branch
226+
227+
1. **PseudoAnswerGenerator**
228+
- Function: Generate multiple candidate answers, then vote to select the majority as a pseudo-answer
229+
- Command:
230+
```bash
231+
python pipeline_step.py \
232+
--yaml_path ReasoningPipeline/yaml/PseudoAnswerGenerator.yaml \
233+
--step_name PseudoAnswerGenerator \
234+
--step_type generator
235+
```
236+
237+
2. **AnswerFormatFilter**
238+
- Function: Discard answers not conforming to the expected format
239+
- Command:
240+
```bash
241+
python pipeline_step.py \
242+
--yaml_path ReasoningPipeline/yaml/ReasonerFormatFilter_withoutGT.yaml \
243+
--step_name AnswerFormatFilter \
244+
--step_type process
245+
```
246+
247+
3. **AnswerLengthFilter**
248+
- Function: Remove answers that are too long or too short
249+
- Command:
250+
```bash
251+
python pipeline_step.py \
252+
--yaml_path ReasoningPipeline/yaml/ReasonerLengthFilter_withoutGT.yaml \
253+
--step_name AnswerLengthFilter \
254+
--step_type process
255+
```
256+
257+
4. **AnswerNgramFilter**
258+
- Function: Deduplicate QA pairs with an n-gram similarity filter
259+
- Command:
260+
```bash
261+
python pipeline_step.py \
262+
--yaml_path ReasoningPipeline/yaml/ReasonerNgramFilter_withoutGT.yaml \
263+
--step_name AnswerNgramFilter \
264+
--step_type process
265+
```

0 commit comments

Comments
 (0)