pdf2model&text2model-v2 (#109)

YalinFeng01 · SunnyHaze · web-flow · commit 217652f9a251 · 2025-09-01T15:32:07.000+08:00
Co-authored-by: Ma, Xiaochen &lt;mxch1122@126.com&gt;
diff --git a/docs/en/notes/guide/pipelines/Pdf2ModelPipeline.md b/docs/en/notes/guide/pipelines/Pdf2ModelPipeline.md
@@ -121,10 +121,14 @@ Project Root/
 
 
 
-## Step 6: Chat with Fine-tuned Model
+## **Step 6: Chat with Fine-tuned Model**
 
 ```bash
-# --model can specify the path location of the chat model (optional)
-# Default value is .cache/saves/qwen2.5_7b_sft_model
+# Method 1: Specify model path with --model flag (optional)
+# Default path: .cache/saves/pdf2model_cache_{timestamp}
 dataflow chat --model ./custom_model_path
+
+# Method 2: Navigate to model directory and run dataflow chat
+cd .cache/saves/pdf2model_cache_20250901_143022
+dataflow chat
 ```
diff --git a/docs/en/notes/guide/pipelines/Text2ModelPipeline.md b/docs/en/notes/guide/pipelines/Text2ModelPipeline.md
@@ -0,0 +1,148 @@
+---
+title: Text2ModelPipeline
+createTime: 2025/08/31 03:42:49
+permalink: /en/guide/uw6hfcwp/
+---
+# DataFlow-text2model & LlamaFactory
+
+A complete text processing and training pipeline with intelligent Text2QA generation capabilities.
+
+## Quick Start
+
+```bash
+# Environment setup
+conda create -n dataflow python=3.10
+conda activate dataflow
+git clone https://github.com/OpenDCAI/DataFlow.git
+cd DataFlow
+pip install -e .
+pip install llamafactory[torch,metrics]
+pip install open-dataflow[vllm]
+# Model download
+# First option: choose either one
+# Second option: select all
+mineru-models-download
+
+# Run program
+cd ..
+mkdir test
+cd test
+
+# Initialize
+dataflow text2model init
+
+# Train
+dataflow text2model train
+
+# Chat with trained model, or chat with local trained model
+dataflow chat
+```
+
+
+
+## Step 1: Install DataFlow Environment
+
+```bash
+# Create environment
+conda create -n dataflow python=3.10
+
+# Activate environment
+conda activate dataflow
+
+# Enter root directory
+cd DataFlow
+
+# Install mineru base environment
+pip install -e .
+
+# Install llamafactory environment
+pip install llamafactory[torch,metrics]
+pip install open-dataflow[vllm]
+mineru-models-download
+```
+
+
+
+## Step 2: Create New DataFlow Working Folder
+
+```bash
+mkdir run_dataflow
+cd run_dataflow
+```
+
+
+
+## Step 3: Setup Dataset
+
+Place appropriately sized dataset (data files in JSON or JSONL format) into the working folder.
+
+
+
+## Step 4: Initialize DataFlow-text2model
+
+```bash
+# Initialize
+# --cache can specify .cache directory location (optional)
+# Default value is current folder directory
+dataflow text2model init
+```
+
+After initialization, the project directory becomes:
+
+```
+Project Root/
+├── sft_data_pipeline.py      # Pipeline execution file
+├── text_2_qa_pipeline.py     # Text2QA generation pipeline
+├── merge_filter_qa_pairs.py  # QA format conversion script
+└── .cache/                   # Cache directory
+    └── train_config.yaml     # Default configuration file for llamafactory training
+```
+
+
+
+## Step 5: One-Click Fine-tuning
+
+```bash
+# --lf_yaml can specify the path of llamafactory yaml parameter file for training (optional)
+# Default value is .cache/train_config.yaml
+# --input-keys can specify fields to detect in json files
+# Default value is text
+dataflow text2model train
+```
+
+After fine-tuning completion, the project directory becomes:
+
+```
+Project Root/
+├── sft_data_pipeline.py      # Pipeline execution file
+├── text_2_qa_pipeline.py     # Text2QA generation pipeline
+├── merge_filter_qa_pairs.py  # QA format conversion script
+└── .cache/                   # Cache directory
+    ├── train_config.yaml     # Default configuration file for llamafactory training
+    ├── pt_input.jsonl        # Merged input data
+    ├── data/
+    │   ├── dataset_info.json
+    │   └── qa.json
+    ├── gpu/
+    │   ├── text_input.jsonl          # Text2QA input file (if using Text2QA)
+    │   ├── text2qa_step_step1.json
+    │   ├── text2qa_step_step2.json
+    │   ├── text2qa_step_step3.json   # Text2QA output
+    │   └── sft_dataflow_cache_step_*.jsonl  # SFT processing files
+    └── saves/
+        └── text2model_cache_{time}/
+```
+
+
+
+## **Step 6: Chat with Fine-tuned Model**
+
+```bash
+# Method 1: Specify model path with --model flag (optional)
+# Default path: .cache/saves/pdf2model_cache_{timestamp}
+dataflow chat --model ./custom_model_path
+
+# Method 2: Navigate to model directory and run dataflow chat
+cd .cache/saves/pdf2model_cache_{timestamp}
+dataflow chat
+```
diff --git a/docs/zh/notes/guide/pipelines/Pdf2ModelPipeline.md b/docs/zh/notes/guide/pipelines/Pdf2ModelPipeline.md
@@ -114,16 +114,15 @@ dataflow pdf2model train
     ├── mineru/
     │   └── sample-1-7/auto/
     └── saves/
-        └── qwen2.5_7b_sft_model/
+        └── pdf2model_cache_{timestamp}/
 ```
 
 
 
 ## 第六步: 与微调好的模型对话
 
 ```
-#--model 可以指定 对话模型的路径位置（可选）
-#默认值为.cache/saves/qwen2.5_7b_sft_model
-dataflow chat --model ./custom_model_path
+#用法一:--model 可以指定 对话模型的路径位置（可选）
+#默认值为.cache/saves/pdf2model_cache_{timestamp}
+#用法二:到模型文件夹下 运行dataflow chat
 ```
-
diff --git a/docs/zh/notes/guide/pipelines/Text2ModelPipeline.md b/docs/zh/notes/guide/pipelines/Text2ModelPipeline.md
@@ -0,0 +1,141 @@
+---
+title: Text2ModelPipeline
+createTime: 2025/08/31 03:42:26
+permalink: /zh/guide/ndyvouo2/
+---
+# DataFlow-pdf2model&LlaMA-Factory
+
+## 快速开始
+
+```
+#环境配置
+conda create -n dataflow python=3.10
+conda activate dataflow
+git clone https://github.com/OpenDCAI/DataFlow.git
+cd DataFlow
+pip install -e .
+pip install llamafactory[torch,metrics]
+pip install open-dataflow[vllm]
+#模型下载
+#第一个两者都可以选
+#第二个选all
+mineru-models-download
+
+#运行程序
+cd ..
+mkdir test
+cd test
+
+#初始化 
+dataflow text2model init
+
+#训练
+dataflow text2model train
+
+#与训练好的模型进行对话,也可以与本地训练好的模型对话
+dataflow chat
+```
+
+
+
+## 第一步: 安装dataflow环境
+
+```
+#创建环境
+conda create -n dataflow python=3.10
+
+#激活环境
+conda activate dataflow
+
+#进入根目录
+cd DataFlow
+
+#下载mineru基础环境
+pip install -e .
+
+#下载llamafactory环境
+pip install llamafactory[torch,metrics]
+pip install open-dataflow[vllm]
+mineru-models-download
+```
+
+
+
+## 第二步: 创建新的dataflow工作文件夹
+
+```
+mkdir run_dataflow
+cd run_dataflow
+```
+
+
+
+## 第三步: 设置数据集
+
+将合适大小的数据集(数据文件为json或jsonl格式)放到工作文件夹中
+
+
+
+## 第四步: 初始化dataflow-pdf2model
+
+```
+#初始化 
+#--cache 可以指定.cache目录的位置（可选）
+#默认值为当前文件夹目录
+dataflow pdf2model init
+```
+
+初始化完成后，项目目录变成：
+
+```shell
+项目根目录/
+├── sft_data_pipeline.py  # pipeline执行文件
+└── .cache/            # 缓存目录
+    └── train_config.yaml  # llamafactory训练的默认配置文件
+```
+
+
+
+## 第五步: 一键微调
+
+```
+#--lf_yaml 可以指定训练所用llamafactory的yaml参数文件所在的路径(可选)
+#默认值为.cache/train_config.yaml
+#--input-keys 可以指定检测json文件中的字段
+#默认值为text
+dataflow text2model train
+```
+
+微调完成完成后，项目目录变成：
+
+```
+项目根目录/
+├── sft_data_pipeline.py  # pipeline执行文件
+└── .cache/            # 缓存目录
+    ├── train_config.yaml  # llamafactory训练的默认配置文件
+    ├── data/
+    │   ├── dataset_info.json
+    │   └── qa.json
+    ├── gpu/
+    │   ├── batch_cleaning_step_step1.json
+    │   ├── batch_cleaning_step_step2.json
+    │   ├── batch_cleaning_step_step3.json
+    │   ├── batch_cleaning_step_step4.json
+    │   └── text_list.jsonl
+    ├── mineru/
+    │   └── text_name/auto/
+    └── saves/
+        └── text2model_cache_{timestamp}/
+```
+
+
+
+## 第六步: 与微调好的模型对话
+
+```
+#用法一:--model 可以指定 对话模型的路径位置（可选）
+#默认值为.cache/saves/pdf2model_cache_{timestamp}
+#用法二:到模型文件夹下 运行dataflow chat
+dataflow chat --model ./custom_model_path
+```
+