|
| 1 | +--- |
| 2 | +title: Reasoning Data Synthesis Pipeline |
| 3 | +icon: mdi:brain |
| 4 | +createTime: 2025/06/16 13:08:42 |
| 5 | +permalink: /en/guide/reasoningpipeline/ |
| 6 | +--- |
| 7 | + |
| 8 | +# Reasoning Data Synthesis Pipeline |
| 9 | + |
| 10 | +## 1. Overview |
| 11 | + |
| 12 | +The **Reasoning Data Synthesis Pipeline** is an end-to-end framework to: |
| 13 | +- Clean and augment existing math QA datasets |
| 14 | +- Generate high-quality answers complete with chain-of-thought (CoT) rationales |
| 15 | + |
| 16 | +This pipeline natively handles four input scenarios: |
| 17 | +1. Question only |
| 18 | +2. Question + Golden Answer |
| 19 | +3. Question + Existing Solution |
| 20 | +4. Any combination of the above |
| 21 | + |
| 22 | +Under the hood, it’s split into three stages—**Question Processing**, **Branch Scheduling**, and **Answer Processing**—all orchestrated by configurable YAML specs and a unified `pipeline_step.py`. With a single command you trigger the entire pipeline and produce intermediate outputs at every stage. |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## 2. One-Click Execution |
| 27 | + |
| 28 | +Run the full pipeline with one of these scripts: |
| 29 | + |
| 30 | +If **all samples include** golden answers: |
| 31 | +```bash |
| 32 | +bash ReasoningPipeline/pipeline_GT.sh |
| 33 | +``` |
| 34 | + |
| 35 | +If **no samples include** golden answers: |
| 36 | +```bash |
| 37 | +bash ReasoningPipeline/pipeline_withoutGT.sh |
| 38 | +``` |
| 39 | + |
| 40 | +For **mixed** scenarios: |
| 41 | +```bash |
| 42 | +bash ReasoningPipeline/pipeline_full.sh |
| 43 | +``` |
| 44 | + |
| 45 | +> Each script loads its corresponding YAML config, invokes each operator in sequence, and writes intermediate files to the designated directories. |
| 46 | +
|
| 47 | +--- |
| 48 | + |
| 49 | +## 3. Data Format |
| 50 | + |
| 51 | +### 3.1 Input Data |
| 52 | + |
| 53 | +- Supported formats: `json`, `jsonl` |
| 54 | +- Required fields: |
| 55 | + • `instruction`: the math problem prompt |
| 56 | + • `golden_answer`: golden answer (if available) |
| 57 | + • `solution`: any existing solution or CoT |
| 58 | +- Optional fields are ignored—keep only what you need to avoid conflicts. |
| 59 | +- Example (`json`): |
| 60 | + ```json |
| 61 | + { |
| 62 | + "instruction": "…For this super-ellipse… (a) find d; (b) write the hyperbola equation.", |
| 63 | + "golden_answer": "8", |
| 64 | + "source": "Bigmath_synth" |
| 65 | + } |
| 66 | + ``` |
| 67 | +- Demo dataset for quick testing: |
| 68 | + `demos/text_process/reasoners/pipeline_math.json` |
| 69 | + (contains question + golden answer) |
| 70 | + |
| 71 | +### 3.2 Output Data |
| 72 | + |
| 73 | +- Format: `jsonl` (one file per pipeline stage) |
| 74 | +- Key fields: |
| 75 | + • `instruction`: the question |
| 76 | + • `generated_cot`: model-generated chain-of-thought |
| 77 | + • `output`: model answer |
| 78 | + • `golden_answer`: ground-truth answer |
| 79 | + • `Synth_or_Input`: `input` (original data) or `synth` (synthesized by the pipeline) |
| 80 | + • `Difficulty`: score from 0 to 10 |
| 81 | + • `primary_category`: primary category of math problem |
| 82 | + • `secondary_category`: secondary category of math problem |
| 83 | +- Example: |
| 84 | + ```json |
| 85 | + { |
| 86 | + "instruction": "Given … find δ?", |
| 87 | + "generated_cot": "…detailed derivation…", |
| 88 | + "output": "δ = 30°", |
| 89 | + "golden_answer": "30", |
| 90 | + "Synth_or_Input": "input", |
| 91 | + "Difficulty": 4.0, |
| 92 | + "primary_category": "Geometry and Topology", |
| 93 | + "secondary_category": "Euclidean Geometry" |
| 94 | + } |
| 95 | + ``` |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +## 4. Pipeline & Operators |
| 100 | + |
| 101 | +All steps are implemented as operators driven by `pipeline_step.py` and configured via YAML. |
| 102 | + |
| 103 | +### 4.1 Question Processing Operators |
| 104 | + |
| 105 | +1. **MathProblemFilter** |
| 106 | + - Function: Remove non-math questions |
| 107 | + - Command: |
| 108 | + ```bash |
| 109 | + python pipeline_step.py \ |
| 110 | + --yaml_path ReasoningPipeline/yaml/MathProblemFilter.yaml \ |
| 111 | + --step_name MathProblemFilter \ |
| 112 | + --step_type process |
| 113 | + ``` |
| 114 | + |
| 115 | +2. **QuestionGenerator** |
| 116 | + - Function: Prompt a large model to synthesize new math questions |
| 117 | + - Command: |
| 118 | + ```bash |
| 119 | + python pipeline_step.py \ |
| 120 | + --yaml_path ReasoningPipeline/yaml/QuestionGenerator.yaml \ |
| 121 | + --step_name QuestionGenerator \ |
| 122 | + --step_type generator |
| 123 | + ``` |
| 124 | + |
| 125 | +3. **QuestionVerify** |
| 126 | + - Function: Filter out questions with incorrect or missing conditions using [MathQ-Verify](https://arxiv.org/abs/2505.13903) |
| 127 | + - Command: |
| 128 | + ```bash |
| 129 | + python pipeline_step.py \ |
| 130 | + --yaml_path ReasoningPipeline/yaml/QuestionVerify.yaml \ |
| 131 | + --step_name QuestionVerify \ |
| 132 | + --step_type process |
| 133 | + ``` |
| 134 | + |
| 135 | +4. **QuestionDifficultyClassifier** |
| 136 | + - Function: Score and classify difficulty according to [Omni-Math](https://arxiv.org/abs/2410.07985) prompts |
| 137 | + - Command: |
| 138 | + ```bash |
| 139 | + python pipeline_step.py \ |
| 140 | + --yaml_path ReasoningPipeline/yaml/QuestionDifficultyClassifier.yaml \ |
| 141 | + --step_name QuestionDifficultyClassifier \ |
| 142 | + --step_type generator |
| 143 | + ``` |
| 144 | + |
| 145 | +5. **QuestionCategoryClassifier** |
| 146 | + - Function: Referring to the [MSC-2020](https://msc2020.org/) classification, the problems are clustered into seven primary categories and several secondary categories. |
| 147 | + - Command: |
| 148 | + ```bash |
| 149 | + python pipeline_step.py \ |
| 150 | + --yaml_path ReasoningPipeline/yaml/QuestionCategoryClassifier.yaml \ |
| 151 | + --step_name QuestionCategoryClassifier \ |
| 152 | + --step_type generator |
| 153 | + ``` |
| 154 | + |
| 155 | +### 4.2 Pipeline Brancher (AnswerPipelineRoot) |
| 156 | + |
| 157 | +- Function: Split data into two streams based on the presence of golden answers |
| 158 | +- Output files: |
| 159 | + - `*_withGT.jsonl` (with golden answers) |
| 160 | + - `*_withoutGT.jsonl` (without golden answers) |
| 161 | +- Command: |
| 162 | +```bash |
| 163 | +python pipeline_step.py \ |
| 164 | + --yaml_path ReasoningPipeline/yaml/AnswerPipelineRoot.yaml \ |
| 165 | + --step_name AnswerPipelineRoot \ |
| 166 | + --step_type generator |
| 167 | +``` |
| 168 | + |
| 169 | +### 4.3 Golden Answer Processing Operators |
| 170 | + |
| 171 | +Executed only on the “with golden answer” branch |
| 172 | + |
| 173 | +1. **AnswerGenerator** |
| 174 | + - Function: Generate answers with detailed CoT (Chain of Thought) |
| 175 | + - Command: |
| 176 | + ```bash |
| 177 | + python pipeline_step.py \ |
| 178 | + --yaml_path ReasoningPipeline/yaml/AnswerGenerator.yaml \ |
| 179 | + --step_name AnswerGenerator \ |
| 180 | + --step_type generator |
| 181 | + ``` |
| 182 | + |
| 183 | +2. **AnswerFormatFilter** |
| 184 | + - Function: Discard answers that don’t match the expected format |
| 185 | + - Command: |
| 186 | + ```bash |
| 187 | + python pipeline_step.py \ |
| 188 | + --yaml_path ReasoningPipeline/yaml/AnswerFormatFilter.yaml \ |
| 189 | + --step_name AnswerFormatFilter \ |
| 190 | + --step_type process |
| 191 | + ``` |
| 192 | + |
| 193 | +3. **AnswerLengthFilter** |
| 194 | + - Function: Remove answers that are too long or too short |
| 195 | + - Command: |
| 196 | + ```bash |
| 197 | + python pipeline_step.py \ |
| 198 | + --yaml_path ReasoningPipeline/yaml/AnswerLengthFilter.yaml \ |
| 199 | + --step_name AnswerLengthFilter \ |
| 200 | + --step_type process |
| 201 | + ``` |
| 202 | + |
| 203 | +4. **AnswerGroundTruthFilter** |
| 204 | + - Function: Use [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math) and [Math-Verify](https://github.com/huggingface/Math-Verify) to extract and verify the final answer |
| 205 | + - Command: |
| 206 | + ```bash |
| 207 | + python pipeline_step.py \ |
| 208 | + --yaml_path ReasoningPipeline/yaml/ReasonerAnsSelection.yaml \ |
| 209 | + --step_name AnswerGroundTruthFilter \ |
| 210 | + --step_type process |
| 211 | + ``` |
| 212 | + |
| 213 | +5. **AnswerNgramFilter** |
| 214 | + - Function: Deduplicate QA pairs with an n-gram similarity filter |
| 215 | + - Command: |
| 216 | + ```bash |
| 217 | + python pipeline_step.py \ |
| 218 | + --yaml_path ReasoningPipeline/yaml/ReasonerNgramFilter.yaml \ |
| 219 | + --step_name AnswerNgramFilter \ |
| 220 | + --step_type process |
| 221 | + ``` |
| 222 | + |
| 223 | +### 4.4 No Golden Answer Processing Operators |
| 224 | + |
| 225 | +Executed only on the “without golden answer” branch |
| 226 | + |
| 227 | +1. **PseudoAnswerGenerator** |
| 228 | + - Function: Generate multiple candidate answers, then vote to select the majority as a pseudo-answer |
| 229 | + - Command: |
| 230 | + ```bash |
| 231 | + python pipeline_step.py \ |
| 232 | + --yaml_path ReasoningPipeline/yaml/PseudoAnswerGenerator.yaml \ |
| 233 | + --step_name PseudoAnswerGenerator \ |
| 234 | + --step_type generator |
| 235 | + ``` |
| 236 | + |
| 237 | +2. **AnswerFormatFilter** |
| 238 | + - Function: Discard answers not conforming to the expected format |
| 239 | + - Command: |
| 240 | + ```bash |
| 241 | + python pipeline_step.py \ |
| 242 | + --yaml_path ReasoningPipeline/yaml/ReasonerFormatFilter_withoutGT.yaml \ |
| 243 | + --step_name AnswerFormatFilter \ |
| 244 | + --step_type process |
| 245 | + ``` |
| 246 | + |
| 247 | +3. **AnswerLengthFilter** |
| 248 | + - Function: Remove answers that are too long or too short |
| 249 | + - Command: |
| 250 | + ```bash |
| 251 | + python pipeline_step.py \ |
| 252 | + --yaml_path ReasoningPipeline/yaml/ReasonerLengthFilter_withoutGT.yaml \ |
| 253 | + --step_name AnswerLengthFilter \ |
| 254 | + --step_type process |
| 255 | + ``` |
| 256 | + |
| 257 | +4. **AnswerNgramFilter** |
| 258 | + - Function: Deduplicate QA pairs with an n-gram similarity filter |
| 259 | + - Command: |
| 260 | + ```bash |
| 261 | + python pipeline_step.py \ |
| 262 | + --yaml_path ReasoningPipeline/yaml/ReasonerNgramFilter_withoutGT.yaml \ |
| 263 | + --step_name AnswerNgramFilter \ |
| 264 | + --step_type process |
| 265 | + ``` |
0 commit comments