Extract content grouped by PDF pages #4546

pablohoffman · 2026-02-25T01:12:02Z

pablohoffman
Feb 25, 2026

I'm trying MinerU to extract content from PDFs, and I'm impressed by its quality.

However, I would like to extract the content split (grouped) by pages (page 1 content, page 2 content, etc), because this is how I'll index the content afterwards.

MinerU returns a single (merged) blob of content of all pages, which doesn't work for me.

So, would it be possible to get the content spit by pages somehow?

I know I can split the PDF into separate PDFs (one per page) and process them separately, but that increases the processing time 5-10 times, which is too much for me.

Another option I'm considering is cloning and creating a new PDF with interlaced pages having special markers, and then split the output based on those markers.

Is there a better way to do this?. Thanks!

Answered by myhloli

Feb 25, 2026

You can use the results from content_list.json to identify which specific page a block belongs to. For detailed documentation on this format, please refer to https://opendatalab.github.io/MinerU/reference/output_files/#content-list-content_listjson_1

View full answer

@pablohoffman · 2026-02-25T01:14:01Z

dosubot[bot]
bot Feb 25, 2026

Hi @pablohoffman! I'm Dosu and I’m helping the MinerU team.

MinerU already preserves page boundaries in its output! You don't need to split the PDF. Use the --dump-content-list flag to generate a content_list.json where every content block includes a page_idx field:

mineru -p input.pdf -o output_dir --dump-content-list

Then you can group content by page in Python:

import json
from collections import defaultdict

with open("content_list.json", "r", encoding="utf-8") as f:
    content = json.load(f)

pages = defaultdict(list)
for block in content:
    pages[block.get("page_idx")].append(block.get("text", ""))

The page_idx is 0-indexed and included in all content blocks.

Note on cross-page tables: By default, MinerU merges tables spanning multiple pages (keeping only the first page's entry). If you need per-page table data, set MINERU_TABLE_MERGE_ENABLE=0 to disable this behavior.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

myhloli · 2026-02-25T08:53:05Z

myhloli
Feb 25, 2026
Maintainer

You can use the results from content_list.json to identify which specific page a block belongs to. For detailed documentation on this format, please refer to https://opendatalab.github.io/MinerU/reference/output_files/#content-list-content_listjson_1

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract content grouped by PDF pages #4546

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extract content grouped by PDF pages #4546

Uh oh!

pablohoffman Feb 25, 2026

Replies: 2 comments

Uh oh!

dosubot[bot] bot Feb 25, 2026

Uh oh!

myhloli Feb 25, 2026 Maintainer

pablohoffman
Feb 25, 2026

dosubot[bot]
bot Feb 25, 2026

myhloli
Feb 25, 2026
Maintainer