You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|KnowledgeExtractor🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | - |
25
-
|CorpusTextSplitter✨ | Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | - |
26
-
|KnowledgeCleaner🚀✨ | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | - |
27
-
|MultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. |[MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD)|
24
+
|FileOrURLToMarkdownConverter🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | - |
25
+
|KBCChunkGenerator✨ | Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | - |
26
+
|KBCTextCleaner🚀✨ | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | - |
27
+
|KBCMultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. |[MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD)|
28
28
29
29
30
30
@@ -80,7 +80,7 @@ For each operator, the following sections will detail its invocation methods and
80
80
81
81
**Functional Description**:
82
82
83
-
The Knowledge Extractor operator is a versatile document processing tool that supports extracting structured content from multiple file formats and converting it to standard Markdown format. This operator integrates multiple professional parsing engines to achieve high-precision document content conversion.
83
+
The FileOrURLToMarkdownConverter operator is a versatile document processing tool that supports extracting structured content from multiple file formats and converting it to standard Markdown format. This operator integrates multiple professional parsing engines to achieve high-precision document content conversion. Code: [FileOrURLToMarkdownConverter](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/file_or_url_to_markdown_converter.py)
CorpusTextSplitter is an efficient and flexible text chunking tool specifically designed for processing large-scale text corpora. This operator supports multiple chunking strategies to intelligently segment text for various NLP tasks, with special optimization for RAG (Retrieval-Augmented Generation) applications.
136
+
KBCChunkGenerator is an efficient and flexible text chunking tool specifically designed for processing large-scale text corpora. This operator supports multiple chunking strategies to intelligently segment text for various NLP tasks, with special optimization for RAG (Retrieval-Augmented Generation) applications. Code:[KBCChunkGenerator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_chunk_generator.py)
KnowledgeCleaner is a professional knowledge cleaning operator specifically designed for standardizing raw content in RAG (Retrieval-Augmented Generation) systems. By leveraging large language model interfaces, it intelligently cleans and formats unstructured knowledge to enhance the accuracy and readability of knowledge bases.
188
+
KBCTextCleaner is a professional knowledge cleaning operator specifically designed for standardizing raw content in RAG (Retrieval-Augmented Generation) systems. By leveraging large language model interfaces, it intelligently cleans and formats unstructured knowledge to enhance the accuracy and readability of knowledge bases. Code:[KBCTextCleaner](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_text_cleaner.py)
189
189
190
190
**Input Parameters**:
191
191
@@ -217,7 +217,7 @@ KnowledgeCleaner is a professional knowledge cleaning operator specifically desi
**Function Description**: MultiHopQAGenerator is a professional multi-hop QA pair generation operator, specifically designed to automatically generate question-answer pairs requiring multi-step reasoning from text data. Through its large language model interface, this operator performs intelligent text analysis and complex question construction, making it suitable for building high-quality multi-hop QA datasets.
233
+
**Function Description**: KBCMultiHopQAGenerator is a professional multi-hop QA pair generation operator, specifically designed to automatically generate question-answer pairs requiring multi-step reasoning from text data. Through its large language model interface, this operator performs intelligent text analysis and complex question construction, making it suitable for building high-quality multi-hop QA datasets. Code:[KBCMultiHopQAGenerator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_multihop_qa_generator.py)
After document extraction, the text chunking step(CorpusTextSplitter) divides the extracted long text into chunks. The system supports chunking by token, character, sentence, or semantic dimensions.
179
+
After document extraction, the text chunking step(KBCChunkGenerator) divides the extracted long text into chunks. The system supports chunking by token, character, sentence, or semantic dimensions.
180
180
181
181
**Input**: Extracted Markdown text
182
182
**Output**: Chunked JSON file
183
183
184
184
**Example**:
185
185
186
186
```python
187
-
text_splitter = CorpusTextSplitter(
187
+
text_splitter = KBCChunkGenerator(
188
188
split_method="token",
189
189
chunk_size=512,
190
190
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
@@ -198,13 +198,13 @@ text_splitter.run(
198
198
199
199
### 3. Knowledge Cleaning
200
200
201
-
After text chunking, the Knowledge Cleaning(KnowledgeCleaner) specializes in standardizing raw knowledge content for RAG (Retrieval-Augmented Generation) systems. This process utilizes large language model interfaces to intelligently clean and format unstructured knowledge, improving the accuracy and readability of the knowledge base.
201
+
After text chunking, the Knowledge Cleaning(KBCTextCleaner) specializes in standardizing raw knowledge content for RAG (Retrieval-Augmented Generation) systems. This process utilizes large language model interfaces to intelligently clean and format unstructured knowledge, improving the accuracy and readability of the knowledge base.
After knowledge cleaning, the MultiHop-QA Generation(MultiHopQAGenerator) specializes in automatically generating multi-step reasoning question-answer pairs from text data. This process uses large language model interfaces for intelligent text analysis and complex question construction, suitable for building high-quality multi-hop QA datasets. According to experiments from [MIRIAD](https://github.com/eth-medical-ai-lab/MIRIAD), this QA-formatted knowledge significantly enhances RAG reasoning accuracy.
220
+
After knowledge cleaning, the MultiHop-QA Generation(KBCMultiHopQAGenerator) specializes in automatically generating multi-step reasoning question-answer pairs from text data. This process uses large language model interfaces for intelligent text analysis and complex question construction, suitable for building high-quality multi-hop QA datasets. According to experiments from [MIRIAD](https://github.com/eth-medical-ai-lab/MIRIAD), this QA-formatted knowledge significantly enhances RAG reasoning accuracy.
221
221
222
222
**Input**: JSON-formatted plain text
223
223
**Output**: For each text segment, generates a set of multi-hop QAs (output in JSON format)
224
224
225
225
**Usage Example**:
226
226
227
227
```python
228
-
multi_hop_qa_generator = MultiHopQAGenerator(
228
+
multi_hop_qa_generator = KBCMultiHopQAGenerator(
229
229
llm_serving=local_llm_serving,
230
230
lang="en"
231
231
)
@@ -261,26 +261,31 @@ Users can execute the following scripts to meet different data requirements. Not
261
261
- Knowledge base cleaning and construction for PDF files:
The following provides an example pipeline configured for the `Dataflow[vllm]` environment, demonstrating how to use multiple operators for knowledge base cleaning. This example shows how to initialize a knowledge base cleaning pipeline and sequentially execute each extraction and cleaning step.
277
282
278
283
```python
279
-
from dataflow.operators.generate import (
280
-
CorpusTextSplitter,
284
+
from dataflow.operators.knowledge_cleaning import (
285
+
KBCChunkGenerator,
281
286
FileOrURLToMarkdownConverter,
282
-
KnowledgeCleaner,
283
-
MultiHopQAGenerator,
287
+
KBCTextCleaner,
288
+
KBCMultiHopQAGenerator,
284
289
)
285
290
from dataflow.utils.storage import FileStorage
286
291
from dataflow.serving import APILLMServing_request
@@ -308,18 +313,18 @@ class KBCleaningPDF_APIPipeline():
During execution, this pipeline will sequentially call:
59
60
60
61
1. FileOrURLToMarkdownConverter Converts original files/URLs into Markdown
61
-
2. CorpusTextSplitter Segments the text into chunks
62
-
3. KnowledgeCleaner Performs comprehensive cleaning on the segmented text
63
-
4. MultiHopQAGenerator Synthesizes QA data based on the cleaned knowledge
62
+
2. KBCChunkGenerator Segments the text into chunks
63
+
3. KBCTextCleaner Performs comprehensive cleaning on the segmented text
64
+
4. KBCMultiHopQAGenerator Synthesizes QA data based on the cleaned knowledge
64
65
65
66
For detailed descriptions of each operator, refer to the "Knowledge Base Cleaning and QA Generation" section. Once executed, a JSON file will be generated in the `.cache` directory with contents as shown below.
0 commit comments