Skip to content

Conversation

@ryyhan
Copy link
Contributor

@ryyhan ryyhan commented Jan 5, 2026

Issue resolved by this Pull Request:
Resolves #2654

Description

Standardizes Page.page_no to use 1-based indexing throughout the pipeline, ensuring consistency with DoclingDocument.pages keys and backend conventions.

Changes:

  • Updated StandardPdfPipeline to initialize Page objects with 1-based page_no.
  • Removed redundant page number adjustments/offsets in ReadingOrderModel and other pipeline stages to respect the canonical 1-based page_no.
  • Ensures that Page.page_no aligns with DoclingDocument.pages keys (e.g., Page 1 is key 1).

Checklist:

  • Documentation has been updated, if necessary. (Pending maintainer feedback on whether/where to document this change)
  • Examples have been added, if necessary.
  • Tests have been added, if necessary. (Verified locally; ready to add formal test case if requested)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

DCO Check Passed

Thanks @ryyhan, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Jan 5, 2026

Related Documentation

Checked 7 published document(s) in 0 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Jan 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@ryyhan ryyhan force-pushed the fix/issue-2654-page-indexing branch from f848670 to 4858d71 Compare January 5, 2026 19:10
@ryyhan ryyhan changed the title Fix/issue 2654 page indexing fix: standardize page_no to 1-based indexing Jan 5, 2026
Copy link

@Coderxrohan Coderxrohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue:
page_no indexing was inconsistent across the pipeline and model layers. Some components treated page_no as 0-based, while others assumed 1-based indexing. This led to off-by-one errors when resolving page sizes, loading backends, building provenance metadata, assembling documents, and reporting errors—especially visible in reading order processing and PDF pipelines.

Fix:
Standardized page_no to be 1-based everywhere:

Removed +1 adjustments in readingorder_model.py and aligned provenance generation with 1-based indexing.
Updated pipeline document construction to create pages using Page(page_no=i + 1).
Adjusted PDF backend access to compensate correctly (load_page(page.page_no - 1)).
Normalized page image assembly and error reporting to use the same 1-based convention.

Impact:
Page numbering is now consistent and predictable throughout the system. This eliminates off-by-one bugs, aligns internal data with user-facing page numbers, and improves correctness in OCR, reading order extraction, provenance tracking, error messages, and image generation.

@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/pipeline/standard_pdf_pipeline.py 60.00% 2 Missing ⚠️
docling/pipeline/legacy_standard_pdf_pipeline.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ryyhan ryyhan force-pushed the fix/issue-2654-page-indexing branch from 7e223c7 to 937b876 Compare January 9, 2026 13:47
@ryyhan ryyhan requested a review from PeterStaar-IBM January 9, 2026 13:50
@cau-git
Copy link
Member

cau-git commented Jan 14, 2026

@ryyhan I would love to get this alignment into main. I just cloned your branch and ran the test data re-generation. There I find some strange test errors which happen on the picture classifier and code-formula conversion. Could you please check how they may relate to page indexing? These do not fail on main. It may be because the actual values in some assertions are now coming from an off-by-one page, so they no longer compare to expected values?

FAILED tests/test_code_formula.py::test_code_and_formula_conversion - AssertionError: mismatch in text predicted='volumuta.  At vero eos et accusam :\n    kasd gubergren, no sea takimata saam\n    ipsum dolor sit amet, consetetur\n    temper invidunt ut labore et dolore\n    vereo eos et accusam et justo duo do', gt='function add(a, b) {\n    return a + b;\n}\nconsole.log(add(3, 5));'
assert 'volumuta.  At vero eos et accusam :\n    kasd gubergren, no sea takimata saam\n    ipsum dolor sit amet, consetetur\n    temper invidunt ut labore et dolore\n    vereo eos et accusam et justo duo do' == 'function add(a, b) {\n    return a + b;\n}\nconsole.log(add(3, 5));'
  
  - function add(a, b) {
  -     return a + b;
  - }
  - console.log(add(3, 5));
  + volumuta.  At vero eos et accusam :
  +     kasd gubergren, no sea takimata saam
  +     ipsum dolor sit amet, consetetur
  +     temper invidunt ut labore et dolore
  +     vereo eos et accusam et justo duo do
FAILED tests/test_document_picture_classifier.py::test_picture_classifier - AssertionError: The prediction is wrong for the bar chart image.
assert 'map' == 'bar_chart'
  
  - bar_chart
  + map

Also, I would still like to understand if we need also internal alignment in other places than the PDF pipelines. There are several backends e.g. for Powerpoint, some XML dialects, etc which do support pagination, that we should check with regard to this. Also, there are enrichment pipelines for pictures, code etc. which may suffer from indexing problems now.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request standardizes Page.page_no to use 1-based indexing throughout the Docling pipeline, resolving inconsistencies where page numbers were sometimes 0-indexed and sometimes 1-indexed. This ensures that Page.page_no aligns with DoclingDocument.pages keys and user-facing page references.

Changes:

  • Updated page initialization across pipelines to create Page objects with 1-based page_no values
  • Removed redundant +1 and -1 adjustments in reading order model and document assembly stages
  • Updated backend page loading calls to subtract 1 from the now-1-based page_no when calling 0-based backend APIs

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
docling/pipeline/standard_pdf_pipeline.py Changed page initialization loop to 1-based indexing; updated backend.load_page() calls to subtract 1; removed page_no adjustments in error reporting and image generation
docling/pipeline/legacy_standard_pdf_pipeline.py Updated backend.load_page() to subtract 1; removed page_no adjustments in image generation and page lookup
docling/pipeline/extraction_vlm_pipeline.py Changed page processing loop to 1-based indexing; updated backend.load_page() calls and log messages
docling/pipeline/base_pipeline.py Changed page initialization loop in _build_document to use 1-based indexing
docling/models/readingorder_model.py Removed all +1 adjustments when accessing doc.pages or creating ProvenanceItem objects with page_no
docling/experimental/pipeline/threaded_layout_vlm_pipeline.py Changed page initialization loop to 1-based indexing; updated backend.load_page() call to subtract 1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cau-git
Copy link
Member

cau-git commented Jan 16, 2026

@ryyhan I analyzed the problem more deeply and found that the only place which needs fixing still is to remove the -1 from here: https://github.com/ryyhan/docling/blob/fix/issue-2654-page-indexing/docling/models/base_model.py#L216

Could you please apply that to your PR and also rebase to main? Then I think this is ready!

@ryyhan ryyhan force-pushed the fix/issue-2654-page-indexing branch from 937b876 to b80aaa4 Compare January 16, 2026 13:50
@ryyhan
Copy link
Contributor Author

ryyhan commented Jan 16, 2026

@ryyhan I analyzed the problem more deeply and found that the only place which needs fixing still is to remove the -1 from here: https://github.com/ryyhan/docling/blob/fix/issue-2654-page-indexing/docling/models/base_model.py#L216

Could you please apply that to your PR and also rebase to main? Then I think this is ready!

Thanks for the review @cau-git! I've removed the -1 offset from docling/models/base_model.py as requested and rebased the branch onto the latest main. I also squashed the commits into a single signed-off commit. It should be ready for another look.

@cau-git cau-git self-assigned this Jan 16, 2026
@ryyhan ryyhan force-pushed the fix/issue-2654-page-indexing branch 2 times, most recently from 59191a2 to 923cb9c Compare January 16, 2026 17:56
@ryyhan ryyhan force-pushed the fix/issue-2654-page-indexing branch from 923cb9c to 1bcbe7b Compare January 16, 2026 18:10
Copy link
Member

@cau-git cau-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🏆

@cau-git cau-git merged commit 1b4d82d into docling-project:main Jan 19, 2026
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

page_no is sometimes 0 indexed and sometimes 1 indexed

4 participants