Skip to content

Commit 3401f19

Browse files
authored
fix reasoning pipeline doc (add dataflow init guidance) (#130)
1 parent 8cb3a61 commit 3401f19

File tree

2 files changed

+100
-36
lines changed

2 files changed

+100
-36
lines changed

docs/en/notes/guide/pipelines/ReasoningPipeline.md

Lines changed: 50 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,54 @@ The main processes of the pipeline include:
2424
2. **Answer Generation and Processing**: Processing based on standard answers or model-generated answers for problems, including format filtering, length filtering, and correctness verification.
2525
3. **Data Deduplication**: Deduplicating generated question-answer data to ensure dataset quality.
2626

27-
## 2. Data Flow and Pipeline Logic
27+
## 2. Quick Start
28+
29+
### Step 1: Install DataFlow Environment
30+
```shell
31+
pip install open-dataflow
32+
```
33+
34+
### Step 2: Create New DataFlow Working Directory
35+
```shell
36+
mkdir run_dataflow
37+
cd run_dataflow
38+
```
39+
40+
### Step 3: Initialize DataFlow
41+
```shell
42+
dataflow init
43+
```
44+
You will see:
45+
```shell
46+
run_dataflow/pipelines/api_pipelines/reasoning_math_pipeline.py
47+
```
48+
49+
### Step 4: Configure API Key and API URL
50+
For Linux and Mac OS:
51+
```shell
52+
export DF_API_KEY="sk-xxxxx"
53+
```
54+
55+
For Windows:
56+
```powershell
57+
$env:DF_API_KEY = "sk-xxxxx"
58+
```
59+
Configure the api_url in `reasoning_general_pipeline.py` as follows:
60+
```python
61+
self.llm_serving = APILLMServing_request(
62+
api_url="https://api.openai.com/v1/chat/completions",
63+
model_name="gpt-4o",
64+
max_workers=100
65+
)
66+
```
67+
68+
### Step 5: One-Click Execution
69+
```bash
70+
python pipelines/api_pipelines/reasoning_math_pipeline.py
71+
```
72+
Additionally, you can choose to run any other Pipeline code according to your needs, and the execution method is similar. Next, we will introduce the operators used in the Pipeline and how to configure parameters.
73+
74+
## 3. Data Flow and Pipeline Logic
2875

2976
### 1. **Input Data**
3077

@@ -240,21 +287,6 @@ Finally, the output data generated by the pipeline will contain the following co
240287
* **primary\_category**: Primary category of the problem
241288
* **secondary\_category**: Secondary category of the problem
242289

243-
## 3. Execution Methods
244-
245-
The pipeline executes different configurations through simple Python commands to meet different data needs:
246-
247-
* **Strong reasoning instruction fine-tuning data synthesis**:
248-
249-
```bash
250-
python test/test_reasoning.py
251-
```
252-
253-
* **Large-scale pretraining data synthesis**:
254-
255-
```bash
256-
python test/test_reasoning_pretrain.py
257-
```
258290

259291
## 4. Pipeline Example
260292

@@ -371,7 +403,7 @@ class ReasoningPipeline():
371403
)
372404

373405
if __name__ == "__main__":
374-
model = ReasoningPipeline()
375-
model.forward()
406+
pl = ReasoningPipeline()
407+
pl.forward()
376408
```
377409

docs/zh/notes/guide/pipelines/ReasoningPipeline.md

Lines changed: 50 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,54 @@ permalink: /zh/guide/reasoningpipeline/
2424
2. **答案生成与处理**:根据问题的标准答案或模型生成的答案进行处理,包括格式过滤、长度过滤和正确性验证等。
2525
3. **数据去重**:对生成的问答数据进行去重,确保数据集的质量。
2626

27-
## 2. 数据流与流水线逻辑
27+
## 2. 快速开始
28+
29+
### 第一步:安装dataflow环境
30+
```shell
31+
pip install open-dataflow
32+
```
33+
34+
### 第二步:创建新的dataflow工作文件夹
35+
```shell
36+
mkdir run_dataflow
37+
cd run_dataflow
38+
```
39+
40+
### 第三步:初始化Dataflow
41+
```shell
42+
dataflow init
43+
```
44+
这时你会看见
45+
```shell
46+
run_dataflow/pipelines/api_pipelines/reasoning_math_pipeline.py
47+
```
48+
49+
### 第四步:填入你的api key以及api_url
50+
对于Linux和Mac OS
51+
```shell
52+
export DF_API_KEY="sk-xxxxx"
53+
```
54+
55+
对于Windows
56+
```powershell
57+
$env:DF_API_KEY = "sk-xxxxx"
58+
```
59+
`reasoning_general_pipeline.py`中的api_url填写方式如下:
60+
```python
61+
self.llm_serving = APILLMServing_request(
62+
api_url="https://api.openai.com/v1/chat/completions",
63+
model_name="gpt-4o",
64+
max_workers=100
65+
)
66+
```
67+
68+
### 第五步:一键运行
69+
```bash
70+
python pipelines/api_pipelines/reasoning_math_pipeline.py
71+
```
72+
此外,你可以根据自己的需求选择任意其他的Pipeline代码运行,其运行方式都是类似的。接下来,我们会介绍在Pipeline中使用到的算子以及如何进行参数配置。
73+
74+
## 3. 数据流与流水线逻辑
2875

2976
### 1. **输入数据**
3077

@@ -240,21 +287,6 @@ ngram_filter = ReasoningAnswerNgramFilter(
240287
* **primary\_category**:问题的主要类别
241288
* **secondary\_category**:问题的次要类别
242289

243-
## 3. 运行方式
244-
245-
该流水线通过简单的Python命令执行不同的配置,满足不同的数据需求:
246-
247-
* **强推理指令微调数据合成**
248-
249-
```bash
250-
python test/test_reasoning.py
251-
```
252-
253-
* **大规模预训练数据合成**
254-
255-
```bash
256-
python test/test_reasoning_pretrain.py
257-
```
258290

259291
## 4. 流水线示例
260292
以下给出示例流水线,演示如何使用多个算子进行推理数据处理。该示例展示了如何初始化一个推理数据处理流水线,并且顺序执行各个过滤和清理步骤。
@@ -370,7 +402,7 @@ class ReasoningPipeline():
370402
)
371403

372404
if __name__ == "__main__":
373-
model = ReasoningPipeline()
374-
model.forward()
405+
pl = ReasoningPipeline()
406+
pl.forward()
375407
```
376408

0 commit comments

Comments
 (0)