Extract content grouped by PDF pages #4546
-
|
I'm trying MinerU to extract content from PDFs, and I'm impressed by its quality. However, I would like to extract the content split (grouped) by pages (page 1 content, page 2 content, etc), because this is how I'll index the content afterwards. MinerU returns a single (merged) blob of content of all pages, which doesn't work for me. So, would it be possible to get the content spit by pages somehow? I know I can split the PDF into separate PDFs (one per page) and process them separately, but that increases the processing time 5-10 times, which is too much for me. Another option I'm considering is cloning and creating a new PDF with interlaced pages having special markers, and then split the output based on those markers. Is there a better way to do this?. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Hi @pablohoffman! I'm Dosu and I’m helping the MinerU team. MinerU already preserves page boundaries in its output! You don't need to split the PDF. Use the mineru -p input.pdf -o output_dir --dump-content-listThen you can group content by page in Python: import json
from collections import defaultdict
with open("content_list.json", "r", encoding="utf-8") as f:
content = json.load(f)
pages = defaultdict(list)
for block in content:
pages[block.get("page_idx")].append(block.get("text", ""))The Note on cross-page tables: By default, MinerU merges tables spanning multiple pages (keeping only the first page's entry). If you need per-page table data, set To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
You can use the results from |
Beta Was this translation helpful? Give feedback.
You can use the results from
content_list.jsonto identify which specific page a block belongs to. For detailed documentation on this format, please refer to https://opendatalab.github.io/MinerU/reference/output_files/#content-list-content_listjson_1