Skip to content

Commit 217652f

Browse files
pdf2model&text2model-v2 (#109)
Co-authored-by: Ma, Xiaochen <mxch1122@126.com>
1 parent dde6f7d commit 217652f

File tree

4 files changed

+300
-8
lines changed

4 files changed

+300
-8
lines changed

docs/en/notes/guide/pipelines/Pdf2ModelPipeline.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,10 +121,14 @@ Project Root/
121121

122122

123123

124-
## Step 6: Chat with Fine-tuned Model
124+
## **Step 6: Chat with Fine-tuned Model**
125125

126126
```bash
127-
# --model can specify the path location of the chat model (optional)
128-
# Default value is .cache/saves/qwen2.5_7b_sft_model
127+
# Method 1: Specify model path with --model flag (optional)
128+
# Default path: .cache/saves/pdf2model_cache_{timestamp}
129129
dataflow chat --model ./custom_model_path
130+
131+
# Method 2: Navigate to model directory and run dataflow chat
132+
cd .cache/saves/pdf2model_cache_20250901_143022
133+
dataflow chat
130134
```
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
title: Text2ModelPipeline
3+
createTime: 2025/08/31 03:42:49
4+
permalink: /en/guide/uw6hfcwp/
5+
---
6+
# DataFlow-text2model & LlamaFactory
7+
8+
A complete text processing and training pipeline with intelligent Text2QA generation capabilities.
9+
10+
## Quick Start
11+
12+
```bash
13+
# Environment setup
14+
conda create -n dataflow python=3.10
15+
conda activate dataflow
16+
git clone https://github.com/OpenDCAI/DataFlow.git
17+
cd DataFlow
18+
pip install -e .
19+
pip install llamafactory[torch,metrics]
20+
pip install open-dataflow[vllm]
21+
# Model download
22+
# First option: choose either one
23+
# Second option: select all
24+
mineru-models-download
25+
26+
# Run program
27+
cd ..
28+
mkdir test
29+
cd test
30+
31+
# Initialize
32+
dataflow text2model init
33+
34+
# Train
35+
dataflow text2model train
36+
37+
# Chat with trained model, or chat with local trained model
38+
dataflow chat
39+
```
40+
41+
42+
43+
## Step 1: Install DataFlow Environment
44+
45+
```bash
46+
# Create environment
47+
conda create -n dataflow python=3.10
48+
49+
# Activate environment
50+
conda activate dataflow
51+
52+
# Enter root directory
53+
cd DataFlow
54+
55+
# Install mineru base environment
56+
pip install -e .
57+
58+
# Install llamafactory environment
59+
pip install llamafactory[torch,metrics]
60+
pip install open-dataflow[vllm]
61+
mineru-models-download
62+
```
63+
64+
65+
66+
## Step 2: Create New DataFlow Working Folder
67+
68+
```bash
69+
mkdir run_dataflow
70+
cd run_dataflow
71+
```
72+
73+
74+
75+
## Step 3: Setup Dataset
76+
77+
Place appropriately sized dataset (data files in JSON or JSONL format) into the working folder.
78+
79+
80+
81+
## Step 4: Initialize DataFlow-text2model
82+
83+
```bash
84+
# Initialize
85+
# --cache can specify .cache directory location (optional)
86+
# Default value is current folder directory
87+
dataflow text2model init
88+
```
89+
90+
After initialization, the project directory becomes:
91+
92+
```
93+
Project Root/
94+
├── sft_data_pipeline.py # Pipeline execution file
95+
├── text_2_qa_pipeline.py # Text2QA generation pipeline
96+
├── merge_filter_qa_pairs.py # QA format conversion script
97+
└── .cache/ # Cache directory
98+
└── train_config.yaml # Default configuration file for llamafactory training
99+
```
100+
101+
102+
103+
## Step 5: One-Click Fine-tuning
104+
105+
```bash
106+
# --lf_yaml can specify the path of llamafactory yaml parameter file for training (optional)
107+
# Default value is .cache/train_config.yaml
108+
# --input-keys can specify fields to detect in json files
109+
# Default value is text
110+
dataflow text2model train
111+
```
112+
113+
After fine-tuning completion, the project directory becomes:
114+
115+
```
116+
Project Root/
117+
├── sft_data_pipeline.py # Pipeline execution file
118+
├── text_2_qa_pipeline.py # Text2QA generation pipeline
119+
├── merge_filter_qa_pairs.py # QA format conversion script
120+
└── .cache/ # Cache directory
121+
├── train_config.yaml # Default configuration file for llamafactory training
122+
├── pt_input.jsonl # Merged input data
123+
├── data/
124+
│ ├── dataset_info.json
125+
│ └── qa.json
126+
├── gpu/
127+
│ ├── text_input.jsonl # Text2QA input file (if using Text2QA)
128+
│ ├── text2qa_step_step1.json
129+
│ ├── text2qa_step_step2.json
130+
│ ├── text2qa_step_step3.json # Text2QA output
131+
│ └── sft_dataflow_cache_step_*.jsonl # SFT processing files
132+
└── saves/
133+
└── text2model_cache_{time}/
134+
```
135+
136+
137+
138+
## **Step 6: Chat with Fine-tuned Model**
139+
140+
```bash
141+
# Method 1: Specify model path with --model flag (optional)
142+
# Default path: .cache/saves/pdf2model_cache_{timestamp}
143+
dataflow chat --model ./custom_model_path
144+
145+
# Method 2: Navigate to model directory and run dataflow chat
146+
cd .cache/saves/pdf2model_cache_{timestamp}
147+
dataflow chat
148+
```

docs/zh/notes/guide/pipelines/Pdf2ModelPipeline.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -114,16 +114,15 @@ dataflow pdf2model train
114114
├── mineru/
115115
│ └── sample-1-7/auto/
116116
└── saves/
117-
└── qwen2.5_7b_sft_model/
117+
└── pdf2model_cache_{timestamp}/
118118
```
119119

120120

121121

122122
## 第六步: 与微调好的模型对话
123123

124124
```
125-
#--model 可以指定 对话模型的路径位置(可选)
126-
#默认值为.cache/saves/qwen2.5_7b_sft_model
127-
dataflow chat --model ./custom_model_path
125+
#用法一:--model 可以指定 对话模型的路径位置(可选)
126+
#默认值为.cache/saves/pdf2model_cache_{timestamp}
127+
#用法二:到模型文件夹下 运行dataflow chat
128128
```
129-
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: Text2ModelPipeline
3+
createTime: 2025/08/31 03:42:26
4+
permalink: /zh/guide/ndyvouo2/
5+
---
6+
# DataFlow-pdf2model&LlaMA-Factory
7+
8+
## 快速开始
9+
10+
```
11+
#环境配置
12+
conda create -n dataflow python=3.10
13+
conda activate dataflow
14+
git clone https://github.com/OpenDCAI/DataFlow.git
15+
cd DataFlow
16+
pip install -e .
17+
pip install llamafactory[torch,metrics]
18+
pip install open-dataflow[vllm]
19+
#模型下载
20+
#第一个两者都可以选
21+
#第二个选all
22+
mineru-models-download
23+
24+
#运行程序
25+
cd ..
26+
mkdir test
27+
cd test
28+
29+
#初始化
30+
dataflow text2model init
31+
32+
#训练
33+
dataflow text2model train
34+
35+
#与训练好的模型进行对话,也可以与本地训练好的模型对话
36+
dataflow chat
37+
```
38+
39+
40+
41+
## 第一步: 安装dataflow环境
42+
43+
```
44+
#创建环境
45+
conda create -n dataflow python=3.10
46+
47+
#激活环境
48+
conda activate dataflow
49+
50+
#进入根目录
51+
cd DataFlow
52+
53+
#下载mineru基础环境
54+
pip install -e .
55+
56+
#下载llamafactory环境
57+
pip install llamafactory[torch,metrics]
58+
pip install open-dataflow[vllm]
59+
mineru-models-download
60+
```
61+
62+
63+
64+
## 第二步: 创建新的dataflow工作文件夹
65+
66+
```
67+
mkdir run_dataflow
68+
cd run_dataflow
69+
```
70+
71+
72+
73+
## 第三步: 设置数据集
74+
75+
将合适大小的数据集(数据文件为json或jsonl格式)放到工作文件夹中
76+
77+
78+
79+
## 第四步: 初始化dataflow-pdf2model
80+
81+
```
82+
#初始化
83+
#--cache 可以指定.cache目录的位置(可选)
84+
#默认值为当前文件夹目录
85+
dataflow pdf2model init
86+
```
87+
88+
初始化完成后,项目目录变成:
89+
90+
```shell
91+
项目根目录/
92+
├── sft_data_pipeline.py # pipeline执行文件
93+
└── .cache/ # 缓存目录
94+
└── train_config.yaml # llamafactory训练的默认配置文件
95+
```
96+
97+
98+
99+
## 第五步: 一键微调
100+
101+
```
102+
#--lf_yaml 可以指定训练所用llamafactory的yaml参数文件所在的路径(可选)
103+
#默认值为.cache/train_config.yaml
104+
#--input-keys 可以指定检测json文件中的字段
105+
#默认值为text
106+
dataflow text2model train
107+
```
108+
109+
微调完成完成后,项目目录变成:
110+
111+
```
112+
项目根目录/
113+
├── sft_data_pipeline.py # pipeline执行文件
114+
└── .cache/ # 缓存目录
115+
├── train_config.yaml # llamafactory训练的默认配置文件
116+
├── data/
117+
│ ├── dataset_info.json
118+
│ └── qa.json
119+
├── gpu/
120+
│ ├── batch_cleaning_step_step1.json
121+
│ ├── batch_cleaning_step_step2.json
122+
│ ├── batch_cleaning_step_step3.json
123+
│ ├── batch_cleaning_step_step4.json
124+
│ └── text_list.jsonl
125+
├── mineru/
126+
│ └── text_name/auto/
127+
└── saves/
128+
└── text2model_cache_{timestamp}/
129+
```
130+
131+
132+
133+
## 第六步: 与微调好的模型对话
134+
135+
```
136+
#用法一:--model 可以指定 对话模型的路径位置(可选)
137+
#默认值为.cache/saves/pdf2model_cache_{timestamp}
138+
#用法二:到模型文件夹下 运行dataflow chat
139+
dataflow chat --model ./custom_model_path
140+
```
141+

0 commit comments

Comments
 (0)