Skip to content

Conversation

@pprados
Copy link
Contributor

@pprados pprados commented Feb 13, 2025

With langchain or other libraries, forcing a version of pdfminer.six makes it impossible to combine different modules.

@pprados pprados marked this pull request as ready for review February 13, 2025 14:45
@pprados
Copy link
Contributor Author

pprados commented Feb 13, 2025

@Coniferish can you revue this tiny PR ?

Coniferish
Coniferish previously approved these changes Feb 13, 2025
Copy link
Contributor

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Coniferish
Copy link
Contributor

@pprados, I may have to clone your branch and make the PR myself to get the aws-region credential that's making the tests fail right now. I'm unsure why it worked with your earlier PR, but will work on this for you.

scipy
pypdfium2
pdfminer-six==20240706
pdfminer-six>=20240706
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this always resolve to the same package, currently the latest one? just remove the pin altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in future the following versions will be mandatory.

Langchain follows the evolution of versions for PDFMinerLoader and it will not be possible to combine it with unstructured.
The final objective of my series of Pull Request for LangChain is to be able to choose the parser for each case, with PDFRouterLoader. This means being able to have several parsers at the same time. Freezing a version prevents this.

no problem for you to do it yourself.
Take this opportunity to publish a new version, and adjust, in unstructured, extra-pdf-image.in, with the new version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an fyi, @cragwolfe, but the initial pin for pdfminer-six was added just a few weeks ago when removing pdfplumber (here) to maintain required packages used by scripts to pass CI. It sounds like it's a workaround we might want to fix. For easy reference, though, here's extra-pdf-image.in

@pprados, I'm confused what you're saying needs to be adjusted in extra-pdf-image.in. The pdfminer.six version isn't pinned there, so it should be install the latest as @cragwolfe mentioned, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cragwolfe, if this PR looks good to you, can you approve my duplicate PR here? The CI failure in this one is due to a secret that's needed in CI itself and not just particular tests, so I'm unsure how to fix that for contributor PRs at the moment.

@Coniferish Coniferish self-requested a review February 14, 2025 21:26
@Coniferish Coniferish dismissed their stale review February 14, 2025 21:26

pending discussion with @cragwolfe

@Coniferish Coniferish force-pushed the pprados/fix_pdfminer_dep branch from 46b67eb to 64ecdc0 Compare February 14, 2025 22:04
@Coniferish Coniferish mentioned this pull request Feb 17, 2025
@Coniferish Coniferish merged commit 5d6e50b into Unstructured-IO:main Feb 20, 2025
48 of 56 checks passed
@dhdaines
Copy link

dhdaines commented Mar 2, 2025

Hi, this is bad idea (as opposed to your previous version pin which was a really bad idea). The minimum 20240706 version of pdfminer.six has a lot of showstopper bugs, which is why pdfplumber depends on an older version, and there aren't any new versions on the horizon. So, unstructured has gone from (vacuously, it seems) depending on pdfplumber to now making it impossible to install pdfplumber and unstructured at the same time.

Is there any particular reason we can't set a lower bound on an older, less broken version of pdfminer.six?

badGarnet added a commit that referenced this pull request Mar 5, 2025
Duplicate of
#410
because of CI issues with secrets from contributor-initiated PR

---------

Co-authored-by: Philippe Prados <[email protected]>
Co-authored-by: Yao You <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants