fix: standardize page_no to 1-based indexing #2847

ryyhan · 2026-01-05T19:08:09Z

Issue resolved by this Pull Request:
Resolves #2654

Description

Standardizes Page.page_no to use 1-based indexing throughout the pipeline, ensuring consistency with DoclingDocument.pages keys and backend conventions.

Changes:

Updated StandardPdfPipeline to initialize Page objects with 1-based page_no.
Removed redundant page number adjustments/offsets in ReadingOrderModel and other pipeline stages to respect the canonical 1-based page_no.
Ensures that Page.page_no aligns with DoclingDocument.pages keys (e.g., Page 1 is key 1).

Checklist:

Documentation has been updated, if necessary. (Pending maintainer feedback on whether/where to document this change)
Examples have been added, if necessary.
Tests have been added, if necessary. (Verified locally; ready to add formal test case if requested)

github-actions · 2026-01-05T19:08:21Z

✅ DCO Check Passed

Thanks @ryyhan, all your commits are properly signed off. 🎉

dosubot · 2026-01-05T19:08:22Z

Related Documentation

Checked 7 published document(s) in 0 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2026-01-05T19:08:44Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Coderxrohan

Issue:
page_no indexing was inconsistent across the pipeline and model layers. Some components treated page_no as 0-based, while others assumed 1-based indexing. This led to off-by-one errors when resolving page sizes, loading backends, building provenance metadata, assembling documents, and reporting errors—especially visible in reading order processing and PDF pipelines.

Fix:
Standardized page_no to be 1-based everywhere:

Removed +1 adjustments in readingorder_model.py and aligned provenance generation with 1-based indexing.
Updated pipeline document construction to create pages using Page(page_no=i + 1).
Adjusted PDF backend access to compensate correctly (load_page(page.page_no - 1)).
Normalized page image assembly and error reporting to use the same 1-based convention.

Impact:
Page numbering is now consistent and predictable throughout the system. This eliminates off-by-one bugs, aligns internal data with user-facing page numbers, and improves correctness in OCR, reading order extraction, provenance tracking, error messages, and image generation.

codecov · 2026-01-08T07:50:46Z

Codecov Report

❌ Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/pipeline/standard_pdf_pipeline.py	60.00%	2 Missing ⚠️
docling/pipeline/legacy_standard_pdf_pipeline.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

docling/pipeline/base_pipeline.py

cau-git · 2026-01-14T14:04:17Z

@ryyhan I would love to get this alignment into main. I just cloned your branch and ran the test data re-generation. There I find some strange test errors which happen on the picture classifier and code-formula conversion. Could you please check how they may relate to page indexing? These do not fail on main. It may be because the actual values in some assertions are now coming from an off-by-one page, so they no longer compare to expected values?

FAILED tests/test_code_formula.py::test_code_and_formula_conversion - AssertionError: mismatch in text predicted='volumuta.  At vero eos et accusam :\n    kasd gubergren, no sea takimata saam\n    ipsum dolor sit amet, consetetur\n    temper invidunt ut labore et dolore\n    vereo eos et accusam et justo duo do', gt='function add(a, b) {\n    return a + b;\n}\nconsole.log(add(3, 5));'
assert 'volumuta.  At vero eos et accusam :\n    kasd gubergren, no sea takimata saam\n    ipsum dolor sit amet, consetetur\n    temper invidunt ut labore et dolore\n    vereo eos et accusam et justo duo do' == 'function add(a, b) {\n    return a + b;\n}\nconsole.log(add(3, 5));'
  
  - function add(a, b) {
  -     return a + b;
  - }
  - console.log(add(3, 5));
  + volumuta.  At vero eos et accusam :
  +     kasd gubergren, no sea takimata saam
  +     ipsum dolor sit amet, consetetur
  +     temper invidunt ut labore et dolore
  +     vereo eos et accusam et justo duo do
FAILED tests/test_document_picture_classifier.py::test_picture_classifier - AssertionError: The prediction is wrong for the bar chart image.
assert 'map' == 'bar_chart'
  
  - bar_chart
  + map

Also, I would still like to understand if we need also internal alignment in other places than the PDF pipelines. There are several backends e.g. for Powerpoint, some XML dialects, etc which do support pagination, that we should check with regard to this. Also, there are enrichment pipelines for pictures, code etc. which may suffer from indexing problems now.

Copilot

Pull request overview

This pull request standardizes Page.page_no to use 1-based indexing throughout the Docling pipeline, resolving inconsistencies where page numbers were sometimes 0-indexed and sometimes 1-indexed. This ensures that Page.page_no aligns with DoclingDocument.pages keys and user-facing page references.

Changes:

Updated page initialization across pipelines to create Page objects with 1-based page_no values
Removed redundant +1 and -1 adjustments in reading order model and document assembly stages
Updated backend page loading calls to subtract 1 from the now-1-based page_no when calling 0-based backend APIs

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
docling/pipeline/standard_pdf_pipeline.py	Changed page initialization loop to 1-based indexing; updated backend.load_page() calls to subtract 1; removed page_no adjustments in error reporting and image generation
docling/pipeline/legacy_standard_pdf_pipeline.py	Updated backend.load_page() to subtract 1; removed page_no adjustments in image generation and page lookup
docling/pipeline/extraction_vlm_pipeline.py	Changed page processing loop to 1-based indexing; updated backend.load_page() calls and log messages
docling/pipeline/base_pipeline.py	Changed page initialization loop in _build_document to use 1-based indexing
docling/models/readingorder_model.py	Removed all +1 adjustments when accessing doc.pages or creating ProvenanceItem objects with page_no
docling/experimental/pipeline/threaded_layout_vlm_pipeline.py	Changed page initialization loop to 1-based indexing; updated backend.load_page() call to subtract 1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cau-git · 2026-01-16T13:30:59Z

@ryyhan I analyzed the problem more deeply and found that the only place which needs fixing still is to remove the -1 from here: https://github.com/ryyhan/docling/blob/fix/issue-2654-page-indexing/docling/models/base_model.py#L216

Could you please apply that to your PR and also rebase to main? Then I think this is ready!

ryyhan · 2026-01-16T13:52:40Z

@ryyhan I analyzed the problem more deeply and found that the only place which needs fixing still is to remove the -1 from here: https://github.com/ryyhan/docling/blob/fix/issue-2654-page-indexing/docling/models/base_model.py#L216

Could you please apply that to your PR and also rebase to main? Then I think this is ready!

Thanks for the review @cau-git! I've removed the -1 offset from docling/models/base_model.py as requested and rebased the branch onto the latest main. I also squashed the commits into a single signed-off commit. It should be ready for another look.

Addressed.

docling/pipeline/standard_pdf_pipeline.py

…2654) Signed-off-by: ryyhan <dayel.rehan@gmail.com>

cau-git

LGTM 🏆

ryyhan force-pushed the fix/issue-2654-page-indexing branch from f848670 to 4858d71 Compare January 5, 2026 19:10

ryyhan changed the title ~~Fix/issue 2654 page indexing~~ fix: standardize page_no to 1-based indexing Jan 5, 2026

Coderxrohan approved these changes Jan 5, 2026

View reviewed changes

dolfim-ibm requested review from cau-git and dolfim-ibm January 6, 2026 08:01

PeterStaar-IBM previously requested changes Jan 9, 2026

View reviewed changes

docling/pipeline/base_pipeline.py Show resolved Hide resolved

ryyhan force-pushed the fix/issue-2654-page-indexing branch from 7e223c7 to 937b876 Compare January 9, 2026 13:47

ryyhan requested a review from PeterStaar-IBM January 9, 2026 13:50

cau-git requested a review from Copilot January 15, 2026 19:15

Copilot started reviewing on behalf of cau-git January 15, 2026 19:15 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

ryyhan force-pushed the fix/issue-2654-page-indexing branch from 937b876 to b80aaa4 Compare January 16, 2026 13:50

cau-git self-assigned this Jan 16, 2026

cau-git reviewed Jan 16, 2026

View reviewed changes

docling/pipeline/standard_pdf_pipeline.py Outdated Show resolved Hide resolved

ryyhan force-pushed the fix/issue-2654-page-indexing branch 2 times, most recently from 59191a2 to 923cb9c Compare January 16, 2026 17:56

fix: standardization of page_no to 1-based indexing (docling-project#…

1bcbe7b

…2654) Signed-off-by: ryyhan <dayel.rehan@gmail.com>

ryyhan force-pushed the fix/issue-2654-page-indexing branch from 923cb9c to 1bcbe7b Compare January 16, 2026 18:10

PeterStaar-IBM approved these changes Jan 17, 2026

View reviewed changes

cau-git approved these changes Jan 17, 2026

View reviewed changes

cau-git merged commit 1b4d82d into docling-project:main Jan 19, 2026
25 of 26 checks passed

dosubot bot mentioned this pull request Jan 21, 2026

Off-by-one bug in PyPdfiumPageBackend: 1-based page_no used with 0-based pypdfium2 array indexing #2901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: standardize page_no to 1-based indexing #2847

fix: standardize page_no to 1-based indexing #2847

ryyhan commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

dosubot bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

Coderxrohan left a comment

Uh oh!

codecov bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

cau-git commented Jan 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

cau-git commented Jan 16, 2026

Uh oh!

ryyhan commented Jan 16, 2026

Uh oh!

Uh oh!

cau-git left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: standardize page_no to 1-based indexing #2847

fix: standardize page_no to 1-based indexing #2847

Conversation

ryyhan commented Jan 5, 2026

Description

Uh oh!

github-actions bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

Coderxrohan left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

cau-git commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

cau-git commented Jan 16, 2026

Uh oh!

ryyhan commented Jan 16, 2026

Uh oh!

Uh oh!

cau-git left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Jan 5, 2026 •

edited

Loading

dosubot bot commented Jan 5, 2026 •

edited

Loading

mergify bot commented Jan 5, 2026 •

edited

Loading

codecov bot commented Jan 8, 2026 •

edited

Loading

cau-git commented Jan 14, 2026 •

edited

Loading