Skip to content

Commit 7dbd55f

Browse files
ZhaoyangHan04ZhaoyangHan04
andauthored
update kbc ops name and param schema (#113)
* change kbc op param schema * kbc rename doc update --------- Co-authored-by: ZhaoyangHan04 <319926404@qq.com>
1 parent 217e633 commit 7dbd55f

File tree

6 files changed

+86
-74
lines changed

6 files changed

+86
-74
lines changed

docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@ The Knowledge Base Cleaning Operator can perform extraction, organization, and c
2121

2222
| Name | Applicable Type | Description | Official Repository/Paper |
2323
| --------------------- | :-------------- | ------------------------------------------------------------ | ------------------------------------------------------ |
24-
| KnowledgeExtractor🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | - |
25-
| CorpusTextSplitter| Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | - |
26-
| KnowledgeCleaner🚀✨ | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | - |
27-
| MultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. | [MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD) |
24+
| FileOrURLToMarkdownConverter🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | - |
25+
| KBCChunkGenerator| Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | - |
26+
| KBCTextCleaner🚀✨ | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | - |
27+
| KBCMultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. | [MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD) |
2828

2929

3030

@@ -80,7 +80,7 @@ For each operator, the following sections will detail its invocation methods and
8080

8181
**Functional Description**:
8282

83-
The Knowledge Extractor operator is a versatile document processing tool that supports extracting structured content from multiple file formats and converting it to standard Markdown format. This operator integrates multiple professional parsing engines to achieve high-precision document content conversion.
83+
The FileOrURLToMarkdownConverter operator is a versatile document processing tool that supports extracting structured content from multiple file formats and converting it to standard Markdown format. This operator integrates multiple professional parsing engines to achieve high-precision document content conversion. Code: [FileOrURLToMarkdownConverter](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/file_or_url_to_markdown_converter.py)
8484

8585
**Input Parameters**:
8686

@@ -130,10 +130,10 @@ extracted=file_to_markdown_converter.run(
130130
```
131131

132132

133-
### 2. CorpusTextSplitter
133+
### 2. KBCChunkGenerator
134134

135135
**Functional Description**:
136-
CorpusTextSplitter is an efficient and flexible text chunking tool specifically designed for processing large-scale text corpora. This operator supports multiple chunking strategies to intelligently segment text for various NLP tasks, with special optimization for RAG (Retrieval-Augmented Generation) applications.
136+
KBCChunkGenerator is an efficient and flexible text chunking tool specifically designed for processing large-scale text corpora. This operator supports multiple chunking strategies to intelligently segment text for various NLP tasks, with special optimization for RAG (Retrieval-Augmented Generation) applications. Code:[KBCChunkGenerator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_chunk_generator.py)
137137

138138
**Input Parameters**:
139139

@@ -169,7 +169,7 @@ extracted=file_to_markdown_converter.run(
169169
**Usage Example**:
170170

171171
```python
172-
text_splitter = CorpusTextSplitter(
172+
text_splitter = KBCChunkGenerator(
173173
split_method="token",
174174
chunk_size=512,
175175
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
@@ -182,10 +182,10 @@ text_splitter.run(
182182
```
183183

184184

185-
### 3. KnowledgeCleaner
185+
### 3. KBCTextCleaner
186186

187187
**Functional Description**:
188-
KnowledgeCleaner is a professional knowledge cleaning operator specifically designed for standardizing raw content in RAG (Retrieval-Augmented Generation) systems. By leveraging large language model interfaces, it intelligently cleans and formats unstructured knowledge to enhance the accuracy and readability of knowledge bases.
188+
KBCTextCleaner is a professional knowledge cleaning operator specifically designed for standardizing raw content in RAG (Retrieval-Augmented Generation) systems. By leveraging large language model interfaces, it intelligently cleans and formats unstructured knowledge to enhance the accuracy and readability of knowledge bases. Code:[KBCTextCleaner](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_text_cleaner.py)
189189

190190
**Input Parameters**:
191191

@@ -217,7 +217,7 @@ KnowledgeCleaner is a professional knowledge cleaning operator specifically desi
217217
**Usage Example**:
218218

219219
```python
220-
knowledge_cleaner = KnowledgeCleaner(
220+
knowledge_cleaner = KBCTextCleaner(
221221
llm_serving=api_llm_serving,
222222
lang="en"
223223
)
@@ -228,9 +228,9 @@ extracted_path = knowledge_cleaner.run(
228228
)
229229
```
230230

231-
### 4. MultiHopQAGenerator
231+
### 4. KBCMultiHopQAGenerator
232232

233-
**Function Description**: MultiHopQAGenerator is a professional multi-hop QA pair generation operator, specifically designed to automatically generate question-answer pairs requiring multi-step reasoning from text data. Through its large language model interface, this operator performs intelligent text analysis and complex question construction, making it suitable for building high-quality multi-hop QA datasets.
233+
**Function Description**: KBCMultiHopQAGenerator is a professional multi-hop QA pair generation operator, specifically designed to automatically generate question-answer pairs requiring multi-step reasoning from text data. Through its large language model interface, this operator performs intelligent text analysis and complex question construction, making it suitable for building high-quality multi-hop QA datasets. Code:[KBCMultiHopQAGenerator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_multihop_qa_generator.py)
234234

235235
**Input Parameters**:
236236

@@ -268,7 +268,7 @@ extracted_path = knowledge_cleaner.run(
268268
**Usage Example**:
269269

270270
```python
271-
multi_hop_qa_generator = MultiHopQAGenerator(
271+
multi_hop_qa_generator = KBCMultiHopQAGenerator(
272272
llm_serving=local_llm_serving,
273273
lang="en"
274274
)

docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -176,15 +176,15 @@ extracted=file_to_markdown_converter.run(
176176
177177
### 2. Text Chunking
178178
179-
After document extraction, the text chunking step(CorpusTextSplitter) divides the extracted long text into chunks. The system supports chunking by token, character, sentence, or semantic dimensions.
179+
After document extraction, the text chunking step(KBCChunkGenerator) divides the extracted long text into chunks. The system supports chunking by token, character, sentence, or semantic dimensions.
180180
181181
**Input**: Extracted Markdown text
182182
**​Output​**: Chunked JSON file
183183
184184
**Example**:
185185
186186
```python
187-
text_splitter = CorpusTextSplitter(
187+
text_splitter = KBCChunkGenerator(
188188
split_method="token",
189189
chunk_size=512,
190190
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
@@ -198,13 +198,13 @@ text_splitter.run(
198198
199199
### 3. Knowledge Cleaning
200200
201-
After text chunking, the Knowledge Cleaning(KnowledgeCleaner) specializes in standardizing raw knowledge content for RAG (Retrieval-Augmented Generation) systems. This process utilizes large language model interfaces to intelligently clean and format unstructured knowledge, improving the accuracy and readability of the knowledge base.
201+
After text chunking, the Knowledge Cleaning(KBCTextCleaner) specializes in standardizing raw knowledge content for RAG (Retrieval-Augmented Generation) systems. This process utilizes large language model interfaces to intelligently clean and format unstructured knowledge, improving the accuracy and readability of the knowledge base.
202202
203203
**Input**: Chunked JSON file
204204
**​Output​**: Cleaned JSON file
205205
206206
```python
207-
knowledge_cleaner = KnowledgeCleaner(
207+
knowledge_cleaner = KBCTextCleaner(
208208
llm_serving=api_llm_serving,
209209
lang="en"
210210
)
@@ -217,15 +217,15 @@ extracted_path = knowledge_cleaner.run(
217217
218218
### 4. QA Generation
219219
220-
After knowledge cleaning, the MultiHop-QA Generation(MultiHopQAGenerator) specializes in automatically generating multi-step reasoning question-answer pairs from text data. This process uses large language model interfaces for intelligent text analysis and complex question construction, suitable for building high-quality multi-hop QA datasets. According to experiments from [MIRIAD](https://github.com/eth-medical-ai-lab/MIRIAD), this QA-formatted knowledge significantly enhances RAG reasoning accuracy.
220+
After knowledge cleaning, the MultiHop-QA Generation(KBCMultiHopQAGenerator) specializes in automatically generating multi-step reasoning question-answer pairs from text data. This process uses large language model interfaces for intelligent text analysis and complex question construction, suitable for building high-quality multi-hop QA datasets. According to experiments from [MIRIAD](https://github.com/eth-medical-ai-lab/MIRIAD), this QA-formatted knowledge significantly enhances RAG reasoning accuracy.
221221
222222
**Input**: JSON-formatted plain text
223223
**​Output​**: For each text segment, generates a set of multi-hop QAs (output in JSON format)
224224
225225
**Usage Example**:
226226
227227
```python
228-
multi_hop_qa_generator = MultiHopQAGenerator(
228+
multi_hop_qa_generator = KBCMultiHopQAGenerator(
229229
llm_serving=local_llm_serving,
230230
lang="en"
231231
)
@@ -261,26 +261,31 @@ Users can execute the following scripts to meet different data requirements. Not
261261
- Knowledge base cleaning and construction for PDF files:
262262
263263
```shell
264-
python gpu_pipelines/kbcleaning_pipeline_pdf_vllm.py
265-
python gpu_pipelines/kbcleaning_pipeline_pdf_sglang.py
264+
python gpu_pipelines/kbcleaning/kbcleaning_pipeline_pdf_vllm.py
265+
python gpu_pipelines/kbcleaningkbcleaning_pipeline_pdf_sglang.py
266266
```
267+
[kbcleaning_pipeline_pdf_vllm.py](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_pdf_vllm.py)
268+
[kbcleaning_pipeline_pdf_sglang.py ](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_pdf_sglang.py)
269+
267270
- Knowledge base cleaning and construction after URL crawling:
268271
269272
```shell
270-
python gpu_pipelines/kbcleaning_pipeline_url_vllm.py
271-
python gpu_pipelines/kbcleaning_pipeline_url_sglang.py
273+
python gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_vllm.py
274+
python gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_sglang.py
272275
```
276+
[kbcleaning_pipeline_url_vllm.py](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_vllm.py)
277+
[kbcleaning_pipeline_url_sglang.py](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_sglang.py)
273278
274279
## 4. Pipeline Example
275280
276281
The following provides an example pipeline configured for the `Dataflow[vllm]` environment, demonstrating how to use multiple operators for knowledge base cleaning. This example shows how to initialize a knowledge base cleaning pipeline and sequentially execute each extraction and cleaning step.
277282
278283
```python
279-
from dataflow.operators.generate import (
280-
CorpusTextSplitter,
284+
from dataflow.operators.knowledge_cleaning import (
285+
KBCChunkGenerator,
281286
FileOrURLToMarkdownConverter,
282-
KnowledgeCleaner,
283-
MultiHopQAGenerator,
287+
KBCTextCleaner,
288+
KBCMultiHopQAGenerator,
284289
)
285290
from dataflow.utils.storage import FileStorage
286291
from dataflow.serving import APILLMServing_request
@@ -308,18 +313,18 @@ class KBCleaningPDF_APIPipeline():
308313
raw_file = raw_file,
309314
)
310315
311-
self.knowledge_cleaning_step2 = CorpusTextSplitter(
316+
self.knowledge_cleaning_step2 = KBCChunkGenerator(
312317
split_method="token",
313318
chunk_size=512,
314319
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
315320
)
316321
317-
self.knowledge_cleaning_step3 = KnowledgeCleaner(
322+
self.knowledge_cleaning_step3 = KBCTextCleaner(
318323
llm_serving=self.llm_serving,
319324
lang="en"
320325
)
321326
322-
self.knowledge_cleaning_step4 = MultiHopQAGenerator(
327+
self.knowledge_cleaning_step4 = KBCMultiHopQAGenerator(
323328
llm_serving=self.llm_serving,
324329
lang="en"
325330
)

docs/en/notes/guide/quickstart/knowledge_cleaning.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,21 +46,22 @@ dataflow init
4646
&#x9;Enter the script directory:
4747

4848
```shell
49-
cd gpu_pipelines/
49+
cd gpu_pipelines/kbcleaning
5050
```
5151

5252
## Step 4: One-click execution
5353

5454
```bash
55-
python kbcleaning_pipeline_batch_sglang.py
55+
python kbcleaning_pipeline_batch_sglang.py
5656
```
57+
[code](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_batch_sglang.py)
5758

5859
During execution, this pipeline will sequentially call:
5960

6061
1. FileOrURLToMarkdownConverter Converts original files/URLs into Markdown
61-
2. CorpusTextSplitter Segments the text into chunks
62-
3. KnowledgeCleaner Performs comprehensive cleaning on the segmented text
63-
4. MultiHopQAGenerator Synthesizes QA data based on the cleaned knowledge
62+
2. KBCChunkGenerator Segments the text into chunks
63+
3. KBCTextCleaner Performs comprehensive cleaning on the segmented text
64+
4. KBCMultiHopQAGenerator Synthesizes QA data based on the cleaned knowledge
6465

6566
For detailed descriptions of each operator, refer to the "Knowledge Base Cleaning and QA Generation" section. Once executed, a JSON file will be generated in the `.cache` directory with contents as shown below.
6667

0 commit comments

Comments
 (0)