PDF VQA Extract Pipeline docs

wongzhenhao · HeRunming · commit b18e855c5829 · 2025-11-17T14:57:35.000+08:00
diff --git a/docs/.vuepress/notes/en/guide.ts b/docs/.vuepress/notes/en/guide.ts
@@ -67,7 +67,8 @@ export const Guide: ThemeNote = defineNoteConfig({
                 "AgenticRAGPipeline",
                 "KnowledgeBaseCleaningPipeline",
                 "FuncCallPipeline",
-                "Pdf2ModelPipeline"
+                "Pdf2ModelPipeline",
+                "PDFVQAExtractPipeline",
             ]
         },
         {
diff --git a/docs/.vuepress/notes/zh/guide.ts b/docs/.vuepress/notes/zh/guide.ts
@@ -66,7 +66,8 @@ export const Guide: ThemeNote = defineNoteConfig({
                 "AgenticRAGPipeline",
                 "KnowledgeBaseCleaningPipeline",
                 "FuncCallPipeline",
-                "Pdf2ModelPipeline"
+                "Pdf2ModelPipeline",
+                "PDFVQAExtractPipeline",
             ]
         },
         {
diff --git a/docs/en/notes/guide/pipelines/PDFVQAExtractPipeline.md b/docs/en/notes/guide/pipelines/PDFVQAExtractPipeline.md
@@ -0,0 +1,203 @@
+---
+title: PDF VQA Extraction Pipeline
+createTime: 2025/11/17 14:01:55
+permalink: /en/guide/vqa_extract_optimized/
+icon: heroicons:document-text
+---
+
+# PDF VQA Extraction Pipeline
+
+## 1. Overview
+
+The **PDF VQA Extraction Pipeline** automatically extracts high-quality Q&A pairs from textbook-style PDFs. It supports both separated question/answer PDFs and interleaved PDFs, and chains together layout parsing (MinerU), subject-aware LLM extraction, and structured post-processing. Typical use cases:
+
+- Building math/physics/chemistry QA corpora from scanned books
+- Creating QA pairs' markdown/JSONL exports that preserve figure references
+
+Major stages:
+
+1. **Document layout extraction**: call MinerU to dump structured JSON + rendered page images.
+2. **LLM-based QA extraction**: prompt the `VQAExtractor` operator with subject-specific rules.
+3. **Merging & filtering**: consolidate question/answer streams, filter invalid entries, emit JSONL/Markdown plus copied images.
+
+## 2. Quick Start
+
+### Step 1: Install Dataflow (and MinerU)
+```shell
+pip install open-dataflow
+pip install mineru[pipeline]
+mineru-models-download
+```
+The `vlm-vllm-engine` backend requires GPU support.
+
+### Step 2: Create a workspace
+```shell
+mkdir run_dataflow
+cd run_dataflow
+```
+
+### Step 3: Initialize Dataflow
+```shell
+dataflow init
+```
+You can then add your pipeline script under `pipelines/` or any custom path.
+
+### Step 4: Configure API credentials
+Linux / macOS:
+```shell
+export DF_API_KEY="sk-xxxxx"
+```
+Windows PowerShell:
+```powershell
+$env:DF_API_KEY = "sk-xxxxx"
+```
+In the pipeline script, set your API endpoint:
+```python
+self.llm_serving = APILLMServing_request(
+    api_url="https://api.openai.com/v1/chat/completions",
+    model_name="gemini-2.5-pro",
+    max_workers=100,
+)
+```
+
+### Step 5: One-click run
+```bash
+python pipelines/vqa_extract_optimized_pipeline.py
+```
+You can also import the operators into other workflows; the remainder of this doc explains the data flow in detail.
+
+## 3. Data Flow and Pipeline Logic
+
+### 1. Input data
+
+Each job is defined by a JSONL row. Two modes are supported:
+
+- **Separate PDFs**
+  ```jsonl
+  {"question_pdf_path": "/abs/path/questions.pdf", "answer_pdf_path": "/abs/path/answers.pdf", "subject": "math", "output_dir": "./output/math"}
+  ```
+- **Interleaved PDFs**
+  ```jsonl
+  {"pdf_path": "/abs/path/qa_mixed.pdf", "subject": "physics", "output_dir": "./output/physics"}
+  ```
+
+`FileStorage` handles batching/cache management:
+```python
+self.storage = FileStorage(
+    first_entry_file_name="./examples/VQA/vqa_extract_interleaved_test.jsonl",
+    cache_path="./vqa_extract_optimized_cache",
+    file_name_prefix="vqa",
+    cache_type="jsonl",
+)
+```
+
+### 2. Document layout extraction (MinerU)
+
+For each PDF (question, answer, or mixed), the pipeline calls `_extract_doc_layout` inside `VQAExtractor`. MinerU outputs:
+
+- `<book>/<backend>/<book>_content_list.json`: structured layout tokens (texts, figures, tables, IDs)
+- `<book>/<backend>/images/`: cropped page images
+
+The backend can be:
+
+- `vlm-transformers`: CPU/GPU compatible
+- `vlm-vllm-engine`: high-throughput GPU mode (requires CUDA)
+
+### 3. QA extraction (VQAExtractor)
+
+`VQAExtractor` chunks the layout JSON to respect token limits, builds subject-aware prompts (`QAExtractPrompt`), and batches LLM calls via `APILLMServing_request`. Key behaviors:
+
+- Supports `question_pdf_path` + `answer_pdf_path`, or a single `pdf_path` (auto-detect interleaved mode).
+- Copies rendered images into `output_dir/question_images` and/or `answer_images`.
+- Parses `<qa_pair>`, `<question>`, `<answer>`, `<solution>`, `<chapter>` tags from the LLM response, with figure references preserved as `<pic>tag:box</pic>`.
+
+### 4. Post-processing and outputs
+
+For each `output_dir`, the pipeline writes:
+
+1. `vqa_extracted_questions.jsonl`
+2. `vqa_extracted_answers.jsonl` (if separate mode)
+3. `vqa_merged_qa_pairs.jsonl`
+4. `vqa_filtered_qa_pairs.jsonl`
+5. `vqa_filtered_qa_pairs.md`
+6. `question_images/`, `answer_images/` (depending on mode)
+
+Filtering keeps entries where the question exists and either `answer` or `solution` is non-empty. Markdown conversion (`jsonl_to_md`) provides a human-readable summary.
+
+## 4. Output Data
+
+Each filtered record includes:
+
+- `question`: question text (with inline `<pic>` tags if figures are referenced)
+- `answer`: answer text (if extracted from answer PDF)
+- `solution`: optional worked solution (if present)
+- `label`: original numbering (e.g., “Example 3”, “习题2”)
+- `chapter_title`: chapter/section header detected on the same page
+
+Example:
+```json
+{
+  "question": "Solve for x in x^2 - 1 = 0.",
+  "answer": "x = 1 or x = -1",
+  "solution": "Factor as (x-1)(x+1)=0.",
+  "label": "Example 1",
+  "chapter_title": "Chapter 1 Quadratic Equations"
+}
+```
+
+## 5. Pipeline Example
+
+```python
+import os
+import sys
+
+parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
+if parent_dir not in sys.path:
+    sys.path.insert(0, parent_dir)
+
+from dataflow.serving import APILLMServing_request
+from dataflow.utils.storage import FileStorage
+from operators.vqa_extractor import VQAExtractor
+
+class VQA_extract_optimized_pipeline:
+    def __init__(self):
+        self.storage = FileStorage(
+            first_entry_file_name="./examples/VQA/vqa_extract_interleaved_test.jsonl",
+            cache_path="./vqa_extract_optimized_cache",
+            file_name_prefix="vqa",
+            cache_type="jsonl",
+        )
+
+        self.llm_serving = APILLMServing_request(
+            api_url="https://api.openai.com/v1/chat/completions",
+            key_name_of_api_key="DF_API_KEY",
+            model_name="gpt-4o",
+            max_workers=100,
+        )
+
+        self.vqa_extractor = VQAExtractor(
+            llm_serving=self.llm_serving
+        )
+
+    def forward(self):
+        self.vqa_extractor.run(
+            storage=self.storage.step(),
+            question_pdf_path_key="question_pdf_path",
+            answer_pdf_path_key="answer_pdf_path",
+            pdf_path_key="pdf_path",
+            subject_key="subject",
+            output_dir_key="output_dir",
+            output_jsonl_key="output_jsonl_path",
+            mineru_backend='vlm-vllm-engine',
+        )
+
+if __name__ == "__main__":
+    pipeline = VQA_extract_optimized_pipeline()
+    pipeline.forward()
+```
+
+---
+
+Pipeline source: `DataFlow/pipelines/vqa_extract_optimized_pipeline.py`
+
+Use this pipeline whenever you need structured QA data distilled directly from PDF textbooks with figure references intact.
diff --git a/docs/zh/notes/guide/pipelines/PDFVQAExtractPipeline.md b/docs/zh/notes/guide/pipelines/PDFVQAExtractPipeline.md

Original file line number	Diff line number	Diff line change
`@@ -67,7 +67,8 @@ export const Guide: ThemeNote = defineNoteConfig({`
`67`	`67`	`"AgenticRAGPipeline",`
`68`	`68`	`"KnowledgeBaseCleaningPipeline",`
`69`	`69`	`"FuncCallPipeline",`
`70`		`- "Pdf2ModelPipeline"`
	`70`	`+ "Pdf2ModelPipeline",`
	`71`	`+ "PDFVQAExtractPipeline",`
`71`	`72`	`]`
`72`	`73`	`},`
`73`	`74`	`{`
Original file line number	Diff line number	Diff line change
`@@ -66,7 +66,8 @@ export const Guide: ThemeNote = defineNoteConfig({`
`66`	`66`	`"AgenticRAGPipeline",`
`67`	`67`	`"KnowledgeBaseCleaningPipeline",`
`68`	`68`	`"FuncCallPipeline",`
`69`		`- "Pdf2ModelPipeline"`
	`69`	`+ "Pdf2ModelPipeline",`
	`70`	`+ "PDFVQAExtractPipeline",`
`70`	`71`	`]`
`71`	`72`	`},`
`72`	`73`	`{`