OpenDCAI
diff --git a/‎docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md‎
Lines changed: 14 additions & 14 deletions b/‎docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md‎
Lines changed: 14 additions & 14 deletions
diff --git a/‎docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md‎
Lines changed: 22 additions & 17 deletions b/‎docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md‎
Lines changed: 22 additions & 17 deletions
diff --git a/‎docs/en/notes/guide/quickstart/knowledge_cleaning.md‎
Lines changed: 6 additions & 5 deletions b/‎docs/en/notes/guide/quickstart/knowledge_cleaning.md‎
Lines changed: 6 additions & 5 deletions
@@ -21,10 +21,10 @@ The Knowledge Base Cleaning Operator can perform extraction, organization, and c
 
 | Name                  | Applicable Type | Description                                                  | Official Repository/Paper                              |
 | --------------------- | :-------------- | ------------------------------------------------------------ | ------------------------------------------------------ |
-| KnowledgeExtractor🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | -                                                      |
-| CorpusTextSplitter✨   | Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | -                                                      |
-| KnowledgeCleaner🚀✨    | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | -                                                      |
-| MultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. | [MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD) |
+| FileOrURLToMarkdownConverter🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | -                                                      |
+| KBCChunkGenerator✨   | Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | -                                                      |
+| KBCTextCleaner🚀✨    | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | -                                                      |
+| KBCMultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. | [MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD) |
 
 
 
@@ -80,7 +80,7 @@ For each operator, the following sections will detail its invocation methods and
 
 **Functional Description**:
 
-The Knowledge Extractor operator is a versatile document processing tool that supports extracting structured content from multiple file formats and converting it to standard Markdown format. This operator integrates multiple professional parsing engines to achieve high-precision document content conversion.
+The FileOrURLToMarkdownConverter operator is a versatile document processing tool that supports extracting structured content from multiple file formats and converting it to standard Markdown format. This operator integrates multiple professional parsing engines to achieve high-precision document content conversion. Code: [FileOrURLToMarkdownConverter](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/file_or_url_to_markdown_converter.py)
 
 **Input Parameters**:
 
@@ -130,10 +130,10 @@ extracted=file_to_markdown_converter.run(
 ```
 
 
-### 2. CorpusTextSplitter
+### 2. KBCChunkGenerator
 
 **Functional Description**:
- CorpusTextSplitter is an efficient and flexible text chunking tool specifically designed for processing large-scale text corpora. This operator supports multiple chunking strategies to intelligently segment text for various NLP tasks, with special optimization for RAG (Retrieval-Augmented Generation) applications.
+ KBCChunkGenerator is an efficient and flexible text chunking tool specifically designed for processing large-scale text corpora. This operator supports multiple chunking strategies to intelligently segment text for various NLP tasks, with special optimization for RAG (Retrieval-Augmented Generation) applications. Code:[KBCChunkGenerator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_chunk_generator.py)
 
 **Input Parameters**:
 
@@ -169,7 +169,7 @@ extracted=file_to_markdown_converter.run(
 **Usage Example**:
 
 ```python
-text_splitter = CorpusTextSplitter(
+text_splitter = KBCChunkGenerator(
     split_method="token",
     chunk_size=512,
     tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
@@ -182,10 +182,10 @@ text_splitter.run(
 ```
 
 
-### 3. KnowledgeCleaner
+### 3. KBCTextCleaner
 
 **Functional Description**:
-KnowledgeCleaner is a professional knowledge cleaning operator specifically designed for standardizing raw content in RAG (Retrieval-Augmented Generation) systems. By leveraging large language model interfaces, it intelligently cleans and formats unstructured knowledge to enhance the accuracy and readability of knowledge bases.
+KBCTextCleaner is a professional knowledge cleaning operator specifically designed for standardizing raw content in RAG (Retrieval-Augmented Generation) systems. By leveraging large language model interfaces, it intelligently cleans and formats unstructured knowledge to enhance the accuracy and readability of knowledge bases. Code:[KBCTextCleaner](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_text_cleaner.py)
 
 **Input Parameters**:
 
@@ -217,7 +217,7 @@ KnowledgeCleaner is a professional knowledge cleaning operator specifically desi
 **Usage Example**:
 
 ```python
-knowledge_cleaner = KnowledgeCleaner(
+knowledge_cleaner = KBCTextCleaner(
     llm_serving=api_llm_serving,
     lang="en"
 )
@@ -228,9 +228,9 @@ extracted_path = knowledge_cleaner.run(
 )
 ```
 
-###    4. MultiHopQAGenerator
+###    4. KBCMultiHopQAGenerator
 
-**Function Description**: MultiHopQAGenerator is a professional multi-hop QA pair generation operator, specifically designed to automatically generate question-answer pairs requiring multi-step reasoning from text data. Through its large language model interface, this operator performs intelligent text analysis and complex question construction, making it suitable for building high-quality multi-hop QA datasets.
+**Function Description**: KBCMultiHopQAGenerator is a professional multi-hop QA pair generation operator, specifically designed to automatically generate question-answer pairs requiring multi-step reasoning from text data. Through its large language model interface, this operator performs intelligent text analysis and complex question construction, making it suitable for building high-quality multi-hop QA datasets. Code:[KBCMultiHopQAGenerator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/kbc_multihop_qa_generator.py)
 
 **Input Parameters**:
 
@@ -268,7 +268,7 @@ extracted_path = knowledge_cleaner.run(
   **Usage Example**:
 
     ```python
-  multi_hop_qa_generator = MultiHopQAGenerator(
+  multi_hop_qa_generator = KBCMultiHopQAGenerator(
       llm_serving=local_llm_serving,
       lang="en"
   )
 
@@ -176,15 +176,15 @@ extracted=file_to_markdown_converter.run(
 
 ### 2. Text Chunking
 
-After document extraction, the text chunking step(CorpusTextSplitter) divides the extracted long text into chunks. The system supports chunking by token, character, sentence, or semantic dimensions.
+After document extraction, the text chunking step(KBCChunkGenerator) divides the extracted long text into chunks. The system supports chunking by token, character, sentence, or semantic dimensions.
 
 **Input**: Extracted Markdown text
  **Output**: Chunked JSON file
 
 **Example**:
 
 ```python
-text_splitter = CorpusTextSplitter(
+text_splitter = KBCChunkGenerator(
     split_method="token",
     chunk_size=512,
     tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
@@ -198,13 +198,13 @@ text_splitter.run(
 
 ### 3. Knowledge Cleaning
 
-After text chunking, the Knowledge Cleaning(KnowledgeCleaner) specializes in standardizing raw knowledge content for RAG (Retrieval-Augmented Generation) systems. This process utilizes large language model interfaces to intelligently clean and format unstructured knowledge, improving the accuracy and readability of the knowledge base.
+After text chunking, the Knowledge Cleaning(KBCTextCleaner) specializes in standardizing raw knowledge content for RAG (Retrieval-Augmented Generation) systems. This process utilizes large language model interfaces to intelligently clean and format unstructured knowledge, improving the accuracy and readability of the knowledge base.
 
 **Input**: Chunked JSON file
  **Output**: Cleaned JSON file
 
 ```python
-knowledge_cleaner = KnowledgeCleaner(
+knowledge_cleaner = KBCTextCleaner(
     llm_serving=api_llm_serving,
     lang="en"
 )
@@ -217,15 +217,15 @@ extracted_path = knowledge_cleaner.run(
 
 ### 4. QA Generation
 
-After knowledge cleaning, the MultiHop-QA Generation(MultiHopQAGenerator) specializes in automatically generating multi-step reasoning question-answer pairs from text data. This process uses large language model interfaces for intelligent text analysis and complex question construction, suitable for building high-quality multi-hop QA datasets. According to experiments from [MIRIAD](https://github.com/eth-medical-ai-lab/MIRIAD), this QA-formatted knowledge significantly enhances RAG reasoning accuracy.
+After knowledge cleaning, the MultiHop-QA Generation(KBCMultiHopQAGenerator) specializes in automatically generating multi-step reasoning question-answer pairs from text data. This process uses large language model interfaces for intelligent text analysis and complex question construction, suitable for building high-quality multi-hop QA datasets. According to experiments from [MIRIAD](https://github.com/eth-medical-ai-lab/MIRIAD), this QA-formatted knowledge significantly enhances RAG reasoning accuracy.
 
 **Input**: JSON-formatted plain text
  **Output**: For each text segment, generates a set of multi-hop QAs (output in JSON format)
 
 **Usage Example**:
 
 ```python
-  multi_hop_qa_generator = MultiHopQAGenerator(
+  multi_hop_qa_generator = KBCMultiHopQAGenerator(
       llm_serving=local_llm_serving,
       lang="en"
   )
@@ -261,26 +261,31 @@ Users can execute the following scripts to meet different data requirements. Not
 - Knowledge base cleaning and construction for PDF files:
 
   ```shell
-  python gpu_pipelines/kbcleaning_pipeline_pdf_vllm.py
-  python gpu_pipelines/kbcleaning_pipeline_pdf_sglang.py
+  python gpu_pipelines/kbcleaning/kbcleaning_pipeline_pdf_vllm.py 
+  python gpu_pipelines/kbcleaningkbcleaning_pipeline_pdf_sglang.py 
   ```
+    [kbcleaning_pipeline_pdf_vllm.py](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_pdf_vllm.py) 
+    [kbcleaning_pipeline_pdf_sglang.py ](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_pdf_sglang.py)
+
 - Knowledge base cleaning and construction after URL crawling:
 
   ```shell
-  python gpu_pipelines/kbcleaning_pipeline_url_vllm.py
-  python gpu_pipelines/kbcleaning_pipeline_url_sglang.py
+  python gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_vllm.py 
+  python gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_sglang.py 
   ```
+    [kbcleaning_pipeline_url_vllm.py](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_vllm.py)
+    [kbcleaning_pipeline_url_sglang.py](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_url_sglang.py)
 
 ## 4. Pipeline Example
 
 The following provides an example pipeline configured for the `Dataflow[vllm]` environment, demonstrating how to use multiple operators for knowledge base cleaning. This example shows how to initialize a knowledge base cleaning pipeline and sequentially execute each extraction and cleaning step.
 
 ```python
-from dataflow.operators.generate import (
-    CorpusTextSplitter,
+from dataflow.operators.knowledge_cleaning import (
+    KBCChunkGenerator,
     FileOrURLToMarkdownConverter,
-    KnowledgeCleaner,
-    MultiHopQAGenerator,
+    KBCTextCleaner,
+    KBCMultiHopQAGenerator,
 )
 from dataflow.utils.storage import FileStorage
 from dataflow.serving import APILLMServing_request
@@ -308,18 +313,18 @@ class KBCleaningPDF_APIPipeline():
             raw_file = raw_file,
         )
 
-        self.knowledge_cleaning_step2 = CorpusTextSplitter(
+        self.knowledge_cleaning_step2 = KBCChunkGenerator(
             split_method="token",
             chunk_size=512,
             tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
         )
 
-        self.knowledge_cleaning_step3 = KnowledgeCleaner(
+        self.knowledge_cleaning_step3 = KBCTextCleaner(
             llm_serving=self.llm_serving,
             lang="en"
         )
 
-        self.knowledge_cleaning_step4 = MultiHopQAGenerator(
+        self.knowledge_cleaning_step4 = KBCMultiHopQAGenerator(
             llm_serving=self.llm_serving,
             lang="en"
         )
 
@@ -46,21 +46,22 @@ dataflow init
 &#x9;Enter the script directory:
 
 ```shell
-cd gpu_pipelines/
+cd gpu_pipelines/kbcleaning
 ```
 
 ## Step 4: One-click execution
 
 ```bash
-python kbcleaning_pipeline_batch_sglang.py
+python kbcleaning_pipeline_batch_sglang.py 
 ```
+[code](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/statics/pipelines/gpu_pipelines/kbcleaning/kbcleaning_pipeline_batch_sglang.py)
 
 During execution, this pipeline will sequentially call:
 
 1. FileOrURLToMarkdownConverter  Converts original files/URLs into Markdown
-2. CorpusTextSplitter  Segments the text into chunks
-3. KnowledgeCleaner  Performs comprehensive cleaning on the segmented text
-4. MultiHopQAGenerator  Synthesizes QA data based on the cleaned knowledge
+2. KBCChunkGenerator  Segments the text into chunks
+3. KBCTextCleaner  Performs comprehensive cleaning on the segmented text
+4. KBCMultiHopQAGenerator  Synthesizes QA data based on the cleaned knowledge
 
 For detailed descriptions of each operator, refer to the "Knowledge Base Cleaning and QA Generation" section. Once executed, a JSON file will be generated in the `.cache` directory with contents as shown below.