-
Notifications
You must be signed in to change notification settings - Fork 3.5k
fix: standardize page_no to 1-based indexing #2847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: standardize page_no to 1-based indexing #2847
Conversation
|
✅ DCO Check Passed Thanks @ryyhan, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
f848670 to
4858d71
Compare
Coderxrohan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue:
page_no indexing was inconsistent across the pipeline and model layers. Some components treated page_no as 0-based, while others assumed 1-based indexing. This led to off-by-one errors when resolving page sizes, loading backends, building provenance metadata, assembling documents, and reporting errors—especially visible in reading order processing and PDF pipelines.
Fix:
Standardized page_no to be 1-based everywhere:
Removed +1 adjustments in readingorder_model.py and aligned provenance generation with 1-based indexing.
Updated pipeline document construction to create pages using Page(page_no=i + 1).
Adjusted PDF backend access to compensate correctly (load_page(page.page_no - 1)).
Normalized page image assembly and error reporting to use the same 1-based convention.
Impact:
Page numbering is now consistent and predictable throughout the system. This eliminates off-by-one bugs, aligns internal data with user-facing page numbers, and improves correctness in OCR, reading order extraction, provenance tracking, error messages, and image generation.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
7e223c7 to
937b876
Compare
|
@ryyhan I would love to get this alignment into main. I just cloned your branch and ran the test data re-generation. There I find some strange test errors which happen on the picture classifier and code-formula conversion. Could you please check how they may relate to page indexing? These do not fail on Also, I would still like to understand if we need also internal alignment in other places than the PDF pipelines. There are several backends e.g. for Powerpoint, some XML dialects, etc which do support pagination, that we should check with regard to this. Also, there are enrichment pipelines for pictures, code etc. which may suffer from indexing problems now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request standardizes Page.page_no to use 1-based indexing throughout the Docling pipeline, resolving inconsistencies where page numbers were sometimes 0-indexed and sometimes 1-indexed. This ensures that Page.page_no aligns with DoclingDocument.pages keys and user-facing page references.
Changes:
- Updated page initialization across pipelines to create
Pageobjects with 1-basedpage_novalues - Removed redundant
+1and-1adjustments in reading order model and document assembly stages - Updated backend page loading calls to subtract 1 from the now-1-based
page_nowhen calling 0-based backend APIs
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| docling/pipeline/standard_pdf_pipeline.py | Changed page initialization loop to 1-based indexing; updated backend.load_page() calls to subtract 1; removed page_no adjustments in error reporting and image generation |
| docling/pipeline/legacy_standard_pdf_pipeline.py | Updated backend.load_page() to subtract 1; removed page_no adjustments in image generation and page lookup |
| docling/pipeline/extraction_vlm_pipeline.py | Changed page processing loop to 1-based indexing; updated backend.load_page() calls and log messages |
| docling/pipeline/base_pipeline.py | Changed page initialization loop in _build_document to use 1-based indexing |
| docling/models/readingorder_model.py | Removed all +1 adjustments when accessing doc.pages or creating ProvenanceItem objects with page_no |
| docling/experimental/pipeline/threaded_layout_vlm_pipeline.py | Changed page initialization loop to 1-based indexing; updated backend.load_page() call to subtract 1 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@ryyhan I analyzed the problem more deeply and found that the only place which needs fixing still is to remove the Could you please apply that to your PR and also rebase to main? Then I think this is ready! |
937b876 to
b80aaa4
Compare
Thanks for the review @cau-git! I've removed the |
59191a2 to
923cb9c
Compare
…2654) Signed-off-by: ryyhan <dayel.rehan@gmail.com>
923cb9c to
1bcbe7b
Compare
cau-git
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🏆
Issue resolved by this Pull Request:
Resolves #2654
Description
Standardizes
Page.page_noto use 1-based indexing throughout the pipeline, ensuring consistency withDoclingDocument.pageskeys and backend conventions.Changes:
page_no.page_no.Page.page_noaligns withDoclingDocument.pageskeys (e.g., Page 1 is key 1).Checklist: