Skip to content

Commit b18e855

Browse files
wongzhenhaoHeRunming
authored andcommitted
PDF VQA Extract Pipeline docs
1 parent 25915a3 commit b18e855

File tree

4 files changed

+410
-2
lines changed

4 files changed

+410
-2
lines changed

docs/.vuepress/notes/en/guide.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,8 @@ export const Guide: ThemeNote = defineNoteConfig({
6767
"AgenticRAGPipeline",
6868
"KnowledgeBaseCleaningPipeline",
6969
"FuncCallPipeline",
70-
"Pdf2ModelPipeline"
70+
"Pdf2ModelPipeline",
71+
"PDFVQAExtractPipeline",
7172
]
7273
},
7374
{

docs/.vuepress/notes/zh/guide.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,8 @@ export const Guide: ThemeNote = defineNoteConfig({
6666
"AgenticRAGPipeline",
6767
"KnowledgeBaseCleaningPipeline",
6868
"FuncCallPipeline",
69-
"Pdf2ModelPipeline"
69+
"Pdf2ModelPipeline",
70+
"PDFVQAExtractPipeline",
7071
]
7172
},
7273
{
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
---
2+
title: PDF VQA Extraction Pipeline
3+
createTime: 2025/11/17 14:01:55
4+
permalink: /en/guide/vqa_extract_optimized/
5+
icon: heroicons:document-text
6+
---
7+
8+
# PDF VQA Extraction Pipeline
9+
10+
## 1. Overview
11+
12+
The **PDF VQA Extraction Pipeline** automatically extracts high-quality Q&A pairs from textbook-style PDFs. It supports both separated question/answer PDFs and interleaved PDFs, and chains together layout parsing (MinerU), subject-aware LLM extraction, and structured post-processing. Typical use cases:
13+
14+
- Building math/physics/chemistry QA corpora from scanned books
15+
- Creating QA pairs' markdown/JSONL exports that preserve figure references
16+
17+
Major stages:
18+
19+
1. **Document layout extraction**: call MinerU to dump structured JSON + rendered page images.
20+
2. **LLM-based QA extraction**: prompt the `VQAExtractor` operator with subject-specific rules.
21+
3. **Merging & filtering**: consolidate question/answer streams, filter invalid entries, emit JSONL/Markdown plus copied images.
22+
23+
## 2. Quick Start
24+
25+
### Step 1: Install Dataflow (and MinerU)
26+
```shell
27+
pip install open-dataflow
28+
pip install mineru[pipeline]
29+
mineru-models-download
30+
```
31+
The `vlm-vllm-engine` backend requires GPU support.
32+
33+
### Step 2: Create a workspace
34+
```shell
35+
mkdir run_dataflow
36+
cd run_dataflow
37+
```
38+
39+
### Step 3: Initialize Dataflow
40+
```shell
41+
dataflow init
42+
```
43+
You can then add your pipeline script under `pipelines/` or any custom path.
44+
45+
### Step 4: Configure API credentials
46+
Linux / macOS:
47+
```shell
48+
export DF_API_KEY="sk-xxxxx"
49+
```
50+
Windows PowerShell:
51+
```powershell
52+
$env:DF_API_KEY = "sk-xxxxx"
53+
```
54+
In the pipeline script, set your API endpoint:
55+
```python
56+
self.llm_serving = APILLMServing_request(
57+
api_url="https://api.openai.com/v1/chat/completions",
58+
model_name="gemini-2.5-pro",
59+
max_workers=100,
60+
)
61+
```
62+
63+
### Step 5: One-click run
64+
```bash
65+
python pipelines/vqa_extract_optimized_pipeline.py
66+
```
67+
You can also import the operators into other workflows; the remainder of this doc explains the data flow in detail.
68+
69+
## 3. Data Flow and Pipeline Logic
70+
71+
### 1. Input data
72+
73+
Each job is defined by a JSONL row. Two modes are supported:
74+
75+
- **Separate PDFs**
76+
```jsonl
77+
{"question_pdf_path": "/abs/path/questions.pdf", "answer_pdf_path": "/abs/path/answers.pdf", "subject": "math", "output_dir": "./output/math"}
78+
```
79+
- **Interleaved PDFs**
80+
```jsonl
81+
{"pdf_path": "/abs/path/qa_mixed.pdf", "subject": "physics", "output_dir": "./output/physics"}
82+
```
83+
84+
`FileStorage` handles batching/cache management:
85+
```python
86+
self.storage = FileStorage(
87+
first_entry_file_name="./examples/VQA/vqa_extract_interleaved_test.jsonl",
88+
cache_path="./vqa_extract_optimized_cache",
89+
file_name_prefix="vqa",
90+
cache_type="jsonl",
91+
)
92+
```
93+
94+
### 2. Document layout extraction (MinerU)
95+
96+
For each PDF (question, answer, or mixed), the pipeline calls `_extract_doc_layout` inside `VQAExtractor`. MinerU outputs:
97+
98+
- `<book>/<backend>/<book>_content_list.json`: structured layout tokens (texts, figures, tables, IDs)
99+
- `<book>/<backend>/images/`: cropped page images
100+
101+
The backend can be:
102+
103+
- `vlm-transformers`: CPU/GPU compatible
104+
- `vlm-vllm-engine`: high-throughput GPU mode (requires CUDA)
105+
106+
### 3. QA extraction (VQAExtractor)
107+
108+
`VQAExtractor` chunks the layout JSON to respect token limits, builds subject-aware prompts (`QAExtractPrompt`), and batches LLM calls via `APILLMServing_request`. Key behaviors:
109+
110+
- Supports `question_pdf_path` + `answer_pdf_path`, or a single `pdf_path` (auto-detect interleaved mode).
111+
- Copies rendered images into `output_dir/question_images` and/or `answer_images`.
112+
- Parses `<qa_pair>`, `<question>`, `<answer>`, `<solution>`, `<chapter>` tags from the LLM response, with figure references preserved as `<pic>tag:box</pic>`.
113+
114+
### 4. Post-processing and outputs
115+
116+
For each `output_dir`, the pipeline writes:
117+
118+
1. `vqa_extracted_questions.jsonl`
119+
2. `vqa_extracted_answers.jsonl` (if separate mode)
120+
3. `vqa_merged_qa_pairs.jsonl`
121+
4. `vqa_filtered_qa_pairs.jsonl`
122+
5. `vqa_filtered_qa_pairs.md`
123+
6. `question_images/`, `answer_images/` (depending on mode)
124+
125+
Filtering keeps entries where the question exists and either `answer` or `solution` is non-empty. Markdown conversion (`jsonl_to_md`) provides a human-readable summary.
126+
127+
## 4. Output Data
128+
129+
Each filtered record includes:
130+
131+
- `question`: question text (with inline `<pic>` tags if figures are referenced)
132+
- `answer`: answer text (if extracted from answer PDF)
133+
- `solution`: optional worked solution (if present)
134+
- `label`: original numbering (e.g., “Example 3”, “习题2”)
135+
- `chapter_title`: chapter/section header detected on the same page
136+
137+
Example:
138+
```json
139+
{
140+
"question": "Solve for x in x^2 - 1 = 0.",
141+
"answer": "x = 1 or x = -1",
142+
"solution": "Factor as (x-1)(x+1)=0.",
143+
"label": "Example 1",
144+
"chapter_title": "Chapter 1 Quadratic Equations"
145+
}
146+
```
147+
148+
## 5. Pipeline Example
149+
150+
```python
151+
import os
152+
import sys
153+
154+
parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
155+
if parent_dir not in sys.path:
156+
sys.path.insert(0, parent_dir)
157+
158+
from dataflow.serving import APILLMServing_request
159+
from dataflow.utils.storage import FileStorage
160+
from operators.vqa_extractor import VQAExtractor
161+
162+
class VQA_extract_optimized_pipeline:
163+
def __init__(self):
164+
self.storage = FileStorage(
165+
first_entry_file_name="./examples/VQA/vqa_extract_interleaved_test.jsonl",
166+
cache_path="./vqa_extract_optimized_cache",
167+
file_name_prefix="vqa",
168+
cache_type="jsonl",
169+
)
170+
171+
self.llm_serving = APILLMServing_request(
172+
api_url="https://api.openai.com/v1/chat/completions",
173+
key_name_of_api_key="DF_API_KEY",
174+
model_name="gpt-4o",
175+
max_workers=100,
176+
)
177+
178+
self.vqa_extractor = VQAExtractor(
179+
llm_serving=self.llm_serving
180+
)
181+
182+
def forward(self):
183+
self.vqa_extractor.run(
184+
storage=self.storage.step(),
185+
question_pdf_path_key="question_pdf_path",
186+
answer_pdf_path_key="answer_pdf_path",
187+
pdf_path_key="pdf_path",
188+
subject_key="subject",
189+
output_dir_key="output_dir",
190+
output_jsonl_key="output_jsonl_path",
191+
mineru_backend='vlm-vllm-engine',
192+
)
193+
194+
if __name__ == "__main__":
195+
pipeline = VQA_extract_optimized_pipeline()
196+
pipeline.forward()
197+
```
198+
199+
---
200+
201+
Pipeline source: `DataFlow/pipelines/vqa_extract_optimized_pipeline.py`
202+
203+
Use this pipeline whenever you need structured QA data distilled directly from PDF textbooks with figure references intact.

0 commit comments

Comments
 (0)