Skip to content

Conversation

@jordan-homan
Copy link
Contributor

@jordan-homan jordan-homan commented Nov 21, 2024

Notes

Improves client logic when a PDF page is very long: trims the x/y coordinates down to a reasonable size (hi-res only). Note: this does not affect output of text: the reader is still able to process the entire page for text.

Testing

Manually tested changes on large file. Added integration test verifying large pages now process successfully.

@jordan-homan jordan-homan force-pushed the add_page_split_logic_pdf branch 3 times, most recently from 3608682 to 3e15249 Compare November 21, 2024 16:20
@jordan-homan jordan-homan marked this pull request as ready for review November 21, 2024 17:04
@jordan-homan jordan-homan force-pushed the add_page_split_logic_pdf branch from 3e15249 to 241da00 Compare November 22, 2024 14:53
@jordan-homan jordan-homan changed the title adding logic to split pages that are too large to process adding logic to trim pages that are too large to process Nov 22, 2024
Copy link
Contributor

@Klaijan Klaijan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the test by locally pip install -v -e . the checked out PR.

INFO: HTTP Request: GET https://api.unstructuredapp.io/general/docs "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
Runtime type is 'ModelMetaclass'
{'type': 'UncategorizedText', 'element_id': '0607d9a606c4a0d5355c730cea79e38a', 'text': '🔥', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'super_long_pages.pdf'}}

@jordan-homan jordan-homan merged commit 2082d4f into main Nov 22, 2024
13 checks passed
@jordan-homan jordan-homan deleted the add_page_split_logic_pdf branch November 22, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants