chore Use pdf library to check file without extension #184

yuming-long · 2024-09-25T21:50:45Z

Summary

Instead of manually checking filename with .pdf extension and return is_pdf = false -> use currently pdf library to read file content to decide if the file content is valid pdf

awalker4 · 2024-09-26T19:19:26Z

LGTM! I'd just suggest removing the logging when a file can't be loaded in pypdf. For a .doc, I see this:

INFO: Preparing to split document for partition.
ERROR: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
WARNING: The file does not appear to be a valid PDF.
INFO: Partitioning without split.

If any non pdf is expected to hit this path, we can remove the scary ERROR and WARNING.

…e_extend

awalker4

LGTM pending the last comment!

src/unstructured_client/_hooks/custom/pdf_utils.py

…e_extend

without forcing file extension

a487104

yuming-long requested review from Klaijan, awalker4 and micmarty-deepsense September 25, 2024 22:33

awalker4 mentioned this pull request Sep 26, 2024

chore Use pdf library to check file without extension Unstructured-IO/unstructured-js-client#115

Merged

yuming-long changed the title ~~chore [RUDOLPH-98] Use pdf library to check file without extension~~ chore Use pdf library to check file without extension Sep 26, 2024

yuming-long and others added 3 commits September 26, 2024 15:07

remove unnecessary log

0a7370a

Merge branch 'main' into yuming/RUDOLPH-98_pdf_split_not_relay_on_fil…

9439add

…e_extend

move log to info

44c7a0e

awalker4 approved these changes Sep 26, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/pdf_utils.py Outdated Show resolved Hide resolved

remove log at all

f4a990c

yuming-long enabled auto-merge (squash) September 26, 2024 23:19

yuming-long and others added 2 commits October 2, 2024 08:45

Merge branch 'main' into yuming/RUDOLPH-98_pdf_split_not_relay_on_fil…

b008cb5

…e_extend

remove caplog check in integration test

f5a188c

yuming-long merged commit 0c95c76 into main Oct 2, 2024
7 checks passed

yuming-long deleted the yuming/RUDOLPH-98_pdf_split_not_relay_on_file_extend branch October 2, 2024 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore Use pdf library to check file without extension #184

chore Use pdf library to check file without extension #184

Uh oh!

yuming-long commented Sep 25, 2024

Uh oh!

awalker4 commented Sep 26, 2024

Uh oh!

awalker4 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore Use pdf library to check file without extension #184

chore Use pdf library to check file without extension #184

Uh oh!

Conversation

yuming-long commented Sep 25, 2024

Summary

Uh oh!

awalker4 commented Sep 26, 2024

Uh oh!

awalker4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants