Skip to content

Commit c1159e7

Browse files
ZhaoyangHan04ZhaoyangHan04
andauthored
change kbc op param schema (#104)
Co-authored-by: ZhaoyangHan04 <319926404@qq.com>
1 parent 49d8d5f commit c1159e7

File tree

4 files changed

+94
-98
lines changed

4 files changed

+94
-98
lines changed

docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ For each operator, the following sections will detail its invocation methods and
7676

7777
## Detailed Operator Specifications
7878

79-
### 1. knowledge_extractor
79+
### 1. FileOrURLToMarkdownConverter
8080

8181
**Functional Description**:
8282

@@ -87,10 +87,12 @@ The Knowledge Extractor operator is a versatile document processing tool that su
8787
- `__init__()`
8888
- `intermediate_dir`: Intermediate file output directory (default: "intermediate")
8989
- `lang`: Document language (default: "ch" for Chinese)
90-
- `run()`
91-
- `storage`: Data flow storage interface object (required)
9290
- `raw_file`: Local file path (mutually exclusive with url)
9391
- `url`: Web URL address (mutually exclusive with raw_file)
92+
93+
- `run()`
94+
- `storage`: Data flow storage interface object (required)
95+
9496

9597
**Key Features**:
9698

@@ -116,14 +118,14 @@ The Knowledge Extractor operator is a versatile document processing tool that su
116118
**Usage Example**:
117119

118120
```python
119-
knowledge_extractor = KnowledgeExtractor(
121+
file_to_markdown_converter = FileOrURLToMarkdownConverter(
120122
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
121-
lang="en"
123+
lang="en",
124+
mineru_backend="vlm-sglang-engine",
125+
raw_file = raw_file,
122126
)
123-
extracted=knowledge_extractor.run(
127+
extracted=file_to_markdown_converter.run(
124128
storage=self.storage,
125-
raw_file=raw_file,
126-
url=url,
127129
)
128130
```
129131

docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md

Lines changed: 38 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ The main workflow of the pipeline includes:
2424

2525
### 1. Information Extraction
2626

27-
The first step of the pipeline is to extract textual knowledge from users' original documents or URLs using knowledge_extractor. This step is crucial as it converts various formats of raw documents into unified markdown text, facilitating subsequent cleaning processes.
27+
The first step of the pipeline is to extract textual knowledge from users' original documents or URLs using FileOrURLToMarkdownConverter. This step is crucial as it converts various formats of raw documents into unified markdown text, facilitating subsequent cleaning processes.
2828

2929
> *Since `MinerU` is primarily deployed based on `SGLang`, the `open-dataflow[minerU]` environment mainly operates on `Dataflow[SGLang]`. Currently, there is no tutorial available for processing based on `Dataflow[vllm]`.*
3030
@@ -138,16 +138,17 @@ PDF file extraction in this system is based on [MinerU](https://github.com/opend
138138
>
139139
> #### 5. Tool Usage
140140
>
141-
> The `KnowledgeExtractor` operator allows you to choose the desired backend engine of MinerU.
141+
> The `FileOrURLToMarkdownConverter` operator allows you to choose the desired backend engine of MinerU.
142142
>
143143
> * If using `MinerU1`: set the `MinerU_Backend` parameter to `"pipeline"`, which uses the traditional pipeline approach.
144144
> * If using `MinerU2` **(recommended by default)**: set the `MinerU_Backend` parameter to `"vlm-sglang-engine"` to enable the vision-language model engine.
145145
>
146146
> ```python
147-
> KnowledgeExtractor(
148-
> intermediate_dir="../example_data/KBCleaningPipeline/raw/",
149-
> lang="en",
150-
> MinerU_Backend="vlm-sglang-engine",
147+
> self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverter(
148+
> intermediate_dir="../example_data/KBCleaningPipeline/raw/",
149+
> lang="en",
150+
> mineru_backend="vlm-sglang-engine",
151+
> raw_file = raw_file,
151152
> )
152153
> ```
153154
>
@@ -160,15 +161,14 @@ PDF file extraction in this system is based on [MinerU](https://github.com/opend
160161
**Example**:
161162
162163
```python
163-
knowledge_extractor = KnowledgeExtractor(
164+
file_to_markdown_converter = FileOrURLToMarkdownConverter(
164165
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
165-
lang="en"
166-
MinerU_Backend="vlm-sglang-engine",
166+
lang="en",
167+
mineru_backend="vlm-sglang-engine",
168+
raw_file = raw_file,
167169
)
168-
extracted=knowledge_extractor.run(
170+
extracted=file_to_markdown_converter.run(
169171
storage=self.storage,
170-
raw_file=raw_file,
171-
url=url,
172172
)
173173
```
174174
@@ -283,22 +283,29 @@ from dataflow.operators.generate import (
283283
MultiHopQAGenerator,
284284
)
285285
from dataflow.utils.storage import FileStorage
286-
from dataflow.serving import LocalModelLLMServing_vllm
286+
from dataflow.serving import APILLMServing_request
287287
288-
class KBCleaningPipeline():
289-
def __init__(self):
288+
class KBCleaningPDF_APIPipeline():
289+
def __init__(self, url:str=None, raw_file:str=None):
290290
291291
self.storage = FileStorage(
292292
first_entry_file_name="../example_data/KBCleaningPipeline/kbc_placeholder.json",
293-
cache_path="./.cache/gpu",
293+
cache_path="./.cache/api",
294294
file_name_prefix="pdf_cleaning_step",
295295
cache_type="json",
296296
)
297297
298+
self.llm_serving = APILLMServing_request(
299+
api_url="https://api.openai.com/v1/chat/completions",
300+
model_name="gpt-4o",
301+
max_workers=100
302+
)
303+
298304
self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverter(
299305
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
300306
lang="en",
301307
mineru_backend="vlm-sglang-engine",
308+
raw_file = raw_file,
302309
)
303310
304311
self.knowledge_cleaning_step2 = CorpusTextSplitter(
@@ -307,37 +314,27 @@ class KBCleaningPipeline():
307314
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
308315
)
309316
310-
def forward(self, url:str=None, raw_file:str=None):
317+
self.knowledge_cleaning_step3 = KnowledgeCleaner(
318+
llm_serving=self.llm_serving,
319+
lang="en"
320+
)
321+
322+
self.knowledge_cleaning_step4 = MultiHopQAGenerator(
323+
llm_serving=self.llm_serving,
324+
lang="en"
325+
)
326+
327+
def forward(self):
311328
extracted=self.knowledge_cleaning_step1.run(
312329
storage=self.storage,
313-
raw_file=raw_file,
314-
url=url,
315330
)
316-
331+
317332
self.knowledge_cleaning_step2.run(
318333
storage=self.storage.step(),
319334
input_file=extracted,
320335
output_key="raw_content",
321336
)
322337
323-
local_llm_serving = LocalModelLLMServing_vllm(
324-
hf_model_name_or_path="Qwen/Qwen2.5-7B-Instruct",
325-
vllm_max_tokens=2048,
326-
vllm_tensor_parallel_size=4,
327-
vllm_gpu_memory_utilization=0.6,
328-
vllm_repetition_penalty=1.2
329-
)
330-
331-
self.knowledge_cleaning_step3 = KnowledgeCleaner(
332-
llm_serving=local_llm_serving,
333-
lang="en"
334-
)
335-
336-
self.knowledge_cleaning_step4 = MultiHopQAGenerator(
337-
llm_serving=local_llm_serving,
338-
lang="en"
339-
)
340-
341338
self.knowledge_cleaning_step3.run(
342339
storage=self.storage.step(),
343340
input_key= "raw_content",
@@ -348,9 +345,9 @@ class KBCleaningPipeline():
348345
input_key="cleaned",
349346
output_key="MultiHop_QA"
350347
)
351-
348+
352349
if __name__ == "__main__":
353-
model = KBCleaningPipeline()
354-
model.forward(raw_file="../example_data/KBCleaningPipeline/test.pdf")
350+
model = KBCleaningPDF_APIPipeline(raw_file="../example_data/KBCleaningPipeline/test.pdf")
351+
model.forward()
355352
```
356353

docs/zh/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ permalink: /zh/guide/Knowledgebase_QA_operators/
88

99
## 概述
1010

11-
知识库清洗算子适用于面向RAG,RARE,RAFT等下游任务的知识库提取,整理,精调,主要包括:**知识提取算子(KnowledgeExtractor**),**语料分块算子(CorpusTextSpliiter)****知识清洗算子(KnowledgeCleaner)**, **Multi-Hop QA Generation Operator**。这些算子能够用于多种原始格式的文件整理,以及爬取特定URL对应的网页内容,并将这些文本知识整理成可读、易用、安全的RAG知识库。
11+
知识库清洗算子适用于面向RAG,RARE,RAFT等下游任务的知识库提取,整理,精调,主要包括:**知识提取算子(FileOrURLToMarkdownConverter**),**语料分块算子(CorpusTextSpliiter)****知识清洗算子(KnowledgeCleaner)**, **Multi-Hop QA Generation Operator**。这些算子能够用于多种原始格式的文件整理,以及爬取特定URL对应的网页内容,并将这些文本知识整理成可读、易用、安全的RAG知识库。
1212

1313
本文中算子标记继承自[强推理算子](https://opendcai.github.io/DataFlow-Doc/zh/guide/Reasoning_operators/)
1414

@@ -72,7 +72,7 @@ self.storage = FileStorage(
7272

7373
## 详细算子说明
7474

75-
### 1. KnowledgeExtractor
75+
### 1. FileOrURLToMarkdownConverter
7676

7777
**功能描述**
7878

@@ -83,11 +83,11 @@ self.storage = FileStorage(
8383
- `__init__()`
8484
- `intermediate_dir`:中间文件输出目录(默认:"intermediate")
8585
- `lang`:文档语言(默认:"ch"中文)
86+
- `raw_file`:本地文件路径(与url二选一)
87+
- `url`:网页URL地址(与raw_file二选一)
8688

8789
- `run()`
8890
- `storage`:数据流存储接口对象(必须)
89-
- `raw_file`:本地文件路径(与url二选一)
90-
- `url`:网页URL地址(与raw_file二选一)
9191

9292
**主要特性**
9393

@@ -116,14 +116,14 @@ self.storage = FileStorage(
116116
**使用示例:**
117117

118118
```python
119-
knowledge_extractor = KnowledgeExtractor(
119+
file_to_markdown_converter = FileOrURLToMarkdownConverter(
120120
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
121-
lang="en"
121+
lang="en",
122+
mineru_backend="vlm-sglang-engine",
123+
raw_file = raw_file,
122124
)
123-
extracted=knowledge_extractor.run(
125+
extracted=file_to_markdown_converter.run(
124126
storage=self.storage,
125-
raw_file=raw_file,
126-
url=url,
127127
)
128128
```
129129

docs/zh/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md

Lines changed: 37 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -134,16 +134,17 @@ pip install -e .[mineru]
134134
>
135135
> #### 5. 工具使用
136136
>
137-
> `KnowledgeExtractor` 算子提供了 MinerU 版本的选择接口,允许用户根据需求选择合适的后端引擎。
137+
> `FileOrURLToMarkdownConverter` 算子提供了 MinerU 版本的选择接口,允许用户根据需求选择合适的后端引擎。
138138
>
139139
> * 如果用户使用 `MinerU1`:请将 `MinerU_Backend` 参数设置为 `"pipeline"`。这将启用传统的流水线处理方式。
140140
> * 如果用户使用 `MinerU2` **(默认推荐)**:请将 `MinerU_Backend` 参数设置为 `"vlm-sglang-engine"`。这将启用基于多模态语言模型的新引擎。
141141
>
142142
> ```bash
143-
> KnowledgeExtractor(
144-
> intermediate_dir="../example_data/KBCleaningPipeline/raw/",
145-
> lang="en",
146-
> MinerU_Backend="vlm-sglang-engine",
143+
> self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverter(
144+
> intermediate_dir="../example_data/KBCleaningPipeline/raw/",
145+
> lang="en",
146+
> mineru_backend="vlm-sglang-engine",
147+
> raw_file = raw_file,
147148
> )
148149
> ```
149150
>
@@ -154,15 +155,14 @@ pip install -e .[mineru]
154155
**示例**
155156
156157
```python
157-
knowledge_extractor = KnowledgeExtractor(
158+
file_to_markdown_converter = FileOrURLToMarkdownConverter(
158159
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
159-
lang="en"
160-
MinerU_Backend="vlm-sglang-engine",
160+
lang="en",
161+
mineru_backend="vlm-sglang-engine",
162+
raw_file = raw_file,
161163
)
162-
extracted=knowledge_extractor.run(
164+
extracted=file_to_markdown_converter.run(
163165
storage=self.storage,
164-
raw_file=raw_file,
165-
url=url,
166166
)
167167
```
168168
@@ -273,22 +273,29 @@ from dataflow.operators.generate import (
273273
MultiHopQAGenerator,
274274
)
275275
from dataflow.utils.storage import FileStorage
276-
from dataflow.serving import LocalModelLLMServing_vllm
276+
from dataflow.serving import APILLMServing_request
277277
278-
class KBCleaningPipeline():
279-
def __init__(self):
278+
class KBCleaningPDF_APIPipeline():
279+
def __init__(self, url:str=None, raw_file:str=None):
280280
281281
self.storage = FileStorage(
282282
first_entry_file_name="../example_data/KBCleaningPipeline/kbc_placeholder.json",
283-
cache_path="./.cache/gpu",
283+
cache_path="./.cache/api",
284284
file_name_prefix="pdf_cleaning_step",
285285
cache_type="json",
286286
)
287287
288+
self.llm_serving = APILLMServing_request(
289+
api_url="https://api.openai.com/v1/chat/completions",
290+
model_name="gpt-4o",
291+
max_workers=100
292+
)
293+
288294
self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverter(
289295
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
290296
lang="en",
291297
mineru_backend="vlm-sglang-engine",
298+
raw_file = raw_file,
292299
)
293300
294301
self.knowledge_cleaning_step2 = CorpusTextSplitter(
@@ -297,37 +304,27 @@ class KBCleaningPipeline():
297304
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
298305
)
299306
300-
def forward(self, url:str=None, raw_file:str=None):
307+
self.knowledge_cleaning_step3 = KnowledgeCleaner(
308+
llm_serving=self.llm_serving,
309+
lang="en"
310+
)
311+
312+
self.knowledge_cleaning_step4 = MultiHopQAGenerator(
313+
llm_serving=self.llm_serving,
314+
lang="en"
315+
)
316+
317+
def forward(self):
301318
extracted=self.knowledge_cleaning_step1.run(
302319
storage=self.storage,
303-
raw_file=raw_file,
304-
url=url,
305320
)
306-
321+
307322
self.knowledge_cleaning_step2.run(
308323
storage=self.storage.step(),
309324
input_file=extracted,
310325
output_key="raw_content",
311326
)
312327
313-
local_llm_serving = LocalModelLLMServing_vllm(
314-
hf_model_name_or_path="Qwen/Qwen2.5-7B-Instruct",
315-
vllm_max_tokens=2048,
316-
vllm_tensor_parallel_size=4,
317-
vllm_gpu_memory_utilization=0.6,
318-
vllm_repetition_penalty=1.2
319-
)
320-
321-
self.knowledge_cleaning_step3 = KnowledgeCleaner(
322-
llm_serving=local_llm_serving,
323-
lang="en"
324-
)
325-
326-
self.knowledge_cleaning_step4 = MultiHopQAGenerator(
327-
llm_serving=local_llm_serving,
328-
lang="en"
329-
)
330-
331328
self.knowledge_cleaning_step3.run(
332329
storage=self.storage.step(),
333330
input_key= "raw_content",
@@ -338,8 +335,8 @@ class KBCleaningPipeline():
338335
input_key="cleaned",
339336
output_key="MultiHop_QA"
340337
)
341-
338+
342339
if __name__ == "__main__":
343-
model = KBCleaningPipeline()
344-
model.forward(raw_file="../example_data/KBCleaningPipeline/test.pdf")
340+
model = KBCleaningPDF_APIPipeline(raw_file="../example_data/KBCleaningPipeline/test.pdf")
341+
model.forward()
345342
```

0 commit comments

Comments
 (0)