-
Notifications
You must be signed in to change notification settings - Fork 45
ENH: Add page-range support to extract-images, extract-annotated-pages, and extract-text #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…s, and extract-text This enhancement introduces consistent page-range filtering across three CLI subcommands: extract-images, extract-annotated-pages, and extract-text. Each now supports two new optional arguments, --from-page and --to-page, enabling selective processing of only a portion of the PDF rather than the entire document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces page-range filtering functionality to three CLI subcommands (extract-images, extract-annotated-pages, and extract-text) by adding optional --from and --end parameters. These parameters enable users to selectively process specific portions of a PDF document rather than the entire file, using 0-based inclusive indexing.
Key Changes:
- Added
startandendoptional parameters to all three extraction commands - Implemented filtering logic in each command's main function
- Added comprehensive test coverage for the new range functionality
- Updated documentation with usage examples for the new parameters
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| pdfly/extract_images.py | Added range filtering for image extraction using global image index |
| pdfly/extract_annotated_pages.py | Added range filtering for page extraction based on page index |
| pdfly/cli.py | Added --from and --end options to extract-images, extract-annotated-pages, and extract-text commands |
| tests/test_extract_images.py | Added tests for single-image and multi-image range extraction |
| tests/test_extract_annotated_pages.py | Added tests for page range filtering with annotations |
| tests/test_cli.py | Added tests for extract-text command with range parameters |
| docs/user/subcommand-extract-text.md | Updated documentation with range parameter usage examples |
| docs/user/subcommand-extract-images.md | Updated documentation with range parameter usage examples |
| docs/user/subcommand-extract-annotated-pages.md | Updated documentation with range parameter usage examples |
| resources/file-with-invalid-offsets.pdf | File appears modified but change seems unrelated to this PR |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Summary
This enhancement introduces consistent page-range filtering across three CLI subcommands: extract-images, extract-annotated-pages, and extract-text. Each now supports two new optional arguments, --from and --end, enabling selective processing of only a portion of the PDF rather than the entire document.
The Code was generated by GitHub Copilot. (GPT5)
e.g. Closes #194
Checklist:
A unit test is covering the code added / modified by this PR
In case of a new feature, docstrings have been added, with also some documentation in the
docs/folderA mention of the change is present in
CHANGELOG.mdThis PR is ready to be merged
By submitting this pull request, I confirm that my contribution is made under the terms of the BSD 3-Clause license.