community[minor]: 07 - Refactoring ZeroxPDFLoader #30094

pprados · 2025-03-04T12:03:28Z

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the PDFPlumber parser.

For more details, see #28970.

@eyurtsev
to speed up the integration of the different layers, I suggest you work in batch mode, with 2 PRs in parallel.

vercel · 2025-03-04T12:03:33Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Apr 29, 2025 9:49am

pprados · 2025-03-05T14:17:49Z

@eyurtsev
to speed up the integration of the different layers, I suggest you work in batch mode, with 2 PRs in parallel.
This one and this one.

eyurtsev

@pprados The original implementation is 30 lines of code. The new implementation is significantly more complex, and there is not clear value add to users or maintainers.

Have you considered publishing a pypi package that implements that PDFparsers in the way you'd like to handle them?

libs/langchain/tests/unit_tests/test_imports.py

libs/langchain/tests/unit_tests/document_loaders/parsers/test_public_api.py

libs/langchain/langchain/document_loaders/parsers/__init__.py

libs/community/langchain_community/document_loaders/parsers/pdf.py

libs/community/langchain_community/document_loaders/pdf.py

libs/community/langchain_community/document_loaders/parsers/pdf.py

pprados · 2025-03-13T13:29:16Z

@eyurtsev
This parser is probably the best of all. All standardization work is intended to help you choose a parser dynamically, according to the characteristics of your files. Not correcting this parser, which is already present in langchain-common, does not allow us to unify the different parsers. With unstructured, this is the last of the series. All the others are now consistent, with similar parameters and identical properties.
I think it would be a real shame to leave this non-compliant parser in langchain-common, when the code is already there and works consistently with all the others.

I understand that the addition of new loaders/parsers can be done via a dedicated module, as for langchain-unstructured (which I'm currently modifying for this), or as could be the integration of IBM Docling.

But for those already present, I don't think this will make it easier to migrate to more virtuous models.

For nearly 3 months now, I've been trying to add the modifications for the PDF parsers, little by little, as indicated to my AXA customer. I'd be very disappointed if I couldn't finish the job because of the last parser.

eyurtsev · 2025-03-13T21:19:50Z

This parser is probably the best of all. All standardization work is intended to help you choose a parser dynamically, according to the characteristics of your files

I do not want to merge any changes to this parser or other parsers for the purpose of choosing a parser dynamically.

The reasons are the following:

Dynamically choosing a parser is not a feature that will be used by the vast majority of users.
For users who would consider using it, i's value has not been proven. There are no benchmarking results against actual datasets.
If the goal is to get good PDF extraction results, swapping dynamically between parsers, isn't the blocker. Dynamically swapping between parsers is not going to lead to significant overall performances in the quality of extraction.

Bigger picture, I don't think I want to modify any other parsers at all because I do not believe that there's much of a gain there considering the amount of effort this is taking and the downsides to changing code.

For nearly 3 months now, I've been trying to add the modifications for the PDF parsers, little by little, as indicated to my AXA customer. I'd be very disappointed if I couldn't finish the job because of the last parser.

This can be done in a separate package if it's important for you and AXA. Given that this work needs to be validated, I don't see why it needs to be done in lagnchain-community or existing integrations.

pprados · 2025-03-15T07:25:51Z

Hi @eyurtsev

From the analysis we've made of all the parsers out there, there isn't one that's right for every situation. Some are very efficient for documents coming from Word, as they rely on the PDF format, others are better for PowerPoint documents, as they rely on an image analysis of the pages, and are able to find the reader order, understand schematics, etc. It doesn't seem possible to have one solution that works well in all situations. Cost impacts can also be an important variable.

In my experience, every project tells me that it works well for one type of document, but not for another. Regardless of the parser used.

That's why I wanted to propose a simple, GenericLoader compatible approach, with a split between the BlobLoader and the Parser.

To be able to write in 20 lines what is usually written in 2000 lines.

vector_store=...
record_manager=...
loader=GenericLoader(
    blob_loader=FileSystemBlobLoader(  # Or CloudBlobLoader
        path="mydata/",
        glob="**/*",
        show_progress=True,
    ),
    blob_parser=MimeTypeBasedParser(
        handlers={
          "application/pdf": PDFPlumberParser(),  # `ZeroxPDFParser` not found
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
            MsWordParser(),
        },
        fallback_parser=TextParser(),
    )
)
index(
    loader.lazy_load(),
    record_manager,
    vector_store,
    batch_size=100,
)

At present, it is not possible to use GenericLoader with ZeroxPDFLoader because there is no ZeroxPDFParser.

Making an external component is possible, for PDFRouterParser or others advanced strategies, of course. The visibility of this approach will not be the same as if the solution were directly integrated into langchain-community. I'm afraid that, for several years to come, we'll see wobbly projects that implement solutions outside langchain, out of ignorance or fear of complexity. Absolutely all the projects I come across don't use Loaders correctly. It's a real shame. I wanted to improve this situation.

My aim is really to simplify it. To be able to get the maximum value out of langchain, without running away from it when you can't figure out how to do it. I'll be presenting a conference next Tuesday on this very subject. (sorry, it's in French). I'm going to explain that, with a little study of langchain code, we can come up with much more efficient solutions than having to write code on the side of LangChain.

langchain-unstructured

For example, there is a new langchain-unstructured project. From what I understand, this project is intended to retrieve all loaders of type UnstructuredXXXLoader. Current implementations don't respect the Loader/parser split, don't accept a Path as a filename, and don't all have unit tests. These loaders are not compatible with GenericLoader.

I'm currently working on migrating all these UnstructuredXXXLoader into the langchain-unstructured project, improving the Loader/Parser breakdown, adding all the tests, etc. (Draft PR here)

Am I on the right track? I don't know.

I probably should have talked about it before embarking on this work. It seems relevant to me, to be able to use GenericLoader, which isn't possible at the moment.

I don't think I've managed to convince you of the relevance of having a clear separation between Loaders and Parsers, in order to simplify many codes.

In the end, it's up to you or your team. Should I continue down this path? Am I stopping here?

eyurtsev · 2025-03-27T15:41:17Z

Hi @pprados,

Making an external component is possible, for PDFRouterParser or others advanced strategies, of course. The visibility of this approach will not be the same as if the solution were directly integrated into langchain-community.

We've been migrating many of our integrations outside of the langchain-community package, but hosting documentation in langchain. It solves for discovery, maintenance and release cycle issues.

In general, I'm very happy w/ parsing different file types. However, this goal is different from adding multiple implementations for PDFs.

Having a lot of different implementations without guidance in the form of hard benchmarks is likely hurting our users more than its helping them.

I don't think I've managed to convince you of the relevance of having a clear separation between Loaders and Parsers, in order to simplify many codes.

I created the BlobParser abstraction for this very purpose.

The issue here isn't that I don't want a BlobParser, but that we're changing both the interface (making it more complex) and making changes in the implementation (creating a more complex implementation) without any hard bench-marking numbers.

What will be very valuable is a single implementation of a PDF parser that's benchmarked properly and has a great interface.

pprados · 2025-03-31T11:00:00Z

@eyurtsev
Can I create a langchain-pdf project?
who will host it? langchain-ai like langchain-unstructured?

I can indeed do it. It makes sense to combine the various PDF parsers into a single project, although this will only group some of them together. For example, langchain-unstructured or docling

what do you think?

pprados · 2025-04-01T16:18:41Z

@eyurtsev

if you can create a skeleton of a langchain-rag project, I could make a fork of it, and propose an integration of the various PDF parsers. The advantage of this approach is that there's no need to maintain compatibility.
I think this new project should be available at http://github.com/langchain-ai/langchain-pdf.

What do you think?

eyurtsev · 2025-04-01T22:07:02Z

@pprados guidelines are here: https://python.langchain.com/docs/contributing/how_to/integrations/

It's a repository that you'll need to either manage under your own user name or under your own org.

I'll check in w/ @ccurme that we're OK adding blob parsers to the list of accepted integrations (don't see why not) -- it's just not on the list right now.

eyurtsev · 2025-04-04T21:32:04Z

@pprados

we're OK with either BlobParser or DocumentLoaders
here's an example of a community maintained document loader (https://python.langchain.com/docs/integrations/document_loaders/docling/)

pprados · 2025-04-29T09:36:10Z

@ccurme or @baskaryan, can you review this PR?

@tylermaran, what do you think? The aim is to make the integration compatible with other PDF parsers (see here).

As I'm not a pyzerox contributor, I don't intend to propose a dedicated langchain-zerox, to improve the code. I suggest you make the changes in langchain-community.

ccurme

Closing as langchain-community has been moved to a standalone repo: https://github.com/langchain-ai/langchain-community

pprados mentioned this pull request Mar 4, 2025

Refactoring PDF loaders: all #28970

Closed

2 tasks

pprados force-pushed the pprados/07-zeroxpdf branch 2 times, most recently from b7be3e6 to 8a985b4 Compare March 4, 2025 12:12

vercel bot deployed to Preview March 4, 2025 12:30 View deployment

pprados force-pushed the pprados/07-zeroxpdf branch from 8a985b4 to 1e66bb2 Compare March 5, 2025 10:39

vercel bot deployed to Preview March 5, 2025 10:48 View deployment

Refactor ZeroxPDFLoader

cf24209

pprados force-pushed the pprados/07-zeroxpdf branch from 1e66bb2 to cf24209 Compare March 5, 2025 10:54

vercel bot deployed to Preview March 5, 2025 11:02 View deployment

Fix pillow version

2474cd8

pprados force-pushed the pprados/07-zeroxpdf branch from 7167308 to 2474cd8 Compare March 5, 2025 11:27

vercel bot deployed to Preview March 5, 2025 11:37 View deployment

pprados changed the title ~~community[minor]: 06 - Refactoring ZeroxPDFLoader~~ community[minor]: 07 - Refactoring ZeroxPDFLoader Mar 5, 2025

Merge branch 'master' into pprados/07-zeroxpdf

abe56f0

pprados marked this pull request as ready for review March 5, 2025 13:57

dosubot bot added the size:XXL label Mar 5, 2025

vercel bot deployed to Preview March 5, 2025 13:57 View deployment

dosubot bot added community labels Mar 5, 2025

ccurme assigned eyurtsev Mar 6, 2025

Undo changes to langchain namespace

b76e9bd

vercel bot deployed to Preview March 7, 2025 03:28 View deployment

eyurtsev requested changes Mar 7, 2025

View reviewed changes

pprados added 3 commits March 7, 2025 16:25

Fix revue

496a933

Fix revu

6922950

Merge remote-tracking branch 'upstream/master' into pprados/07-zeroxpdf

951e577

pprados marked this pull request as draft March 10, 2025 14:05

vercel bot deployed to Preview March 10, 2025 14:14 View deployment

vercel bot deployed to Preview March 13, 2025 13:46 View deployment

Fix revue

0d701f0

pprados force-pushed the pprados/07-zeroxpdf branch from 88a9f5b to 0d701f0 Compare March 13, 2025 14:21

vercel bot deployed to Preview March 13, 2025 14:30 View deployment

Merge branch 'master' into pprados/07-zeroxpdf

c032aa7

vercel bot deployed to Preview March 13, 2025 14:44 View deployment

Fix revue

739e9f0

pprados force-pushed the pprados/07-zeroxpdf branch from c10541f to 739e9f0 Compare March 13, 2025 15:51

vercel bot deployed to Preview March 13, 2025 16:03 View deployment

Merge branch 'master' into pprados/07-zeroxpdf

2a5c399

vercel bot deployed to Preview March 14, 2025 14:02 View deployment

pprados marked this pull request as ready for review March 14, 2025 14:32

dosubot bot added the 🤖:docs label Mar 14, 2025

pprados mentioned this pull request Mar 24, 2025

ZeroxPDFLoader is not compatible with GenericLoader #30455

Closed

5 tasks

pprados requested a review from eyurtsev March 27, 2025 14:19

pprados mentioned this pull request Apr 15, 2025

unstructured[minor]: 08 - Refactoring 17 unstructured loaders langchain-ai/langchain-unstructured#17

Open

pprados added 2 commits April 29, 2025 11:26

Merge branch 'master' into pprados/07-zeroxpdf

bd9b9fc

Remove type: ignore

cb32a76

vercel bot deployed to Preview April 29, 2025 09:49 View deployment

ccurme reviewed Apr 29, 2025

View reviewed changes

ccurme closed this Apr 29, 2025

community[minor]: 07 - Refactoring ZeroxPDFLoader #30094

community[minor]: 07 - Refactoring ZeroxPDFLoader #30094

Uh oh!

Conversation

pprados commented Mar 4, 2025

Uh oh!

vercel bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pprados commented Mar 5, 2025

Uh oh!

eyurtsev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pprados commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eyurtsev commented Mar 13, 2025

Uh oh!

pprados commented Mar 15, 2025

langchain-unstructured

Uh oh!

eyurtsev commented Mar 27, 2025

Uh oh!

pprados commented Mar 31, 2025

Uh oh!

pprados commented Apr 1, 2025

Uh oh!

eyurtsev commented Apr 1, 2025

Uh oh!

eyurtsev commented Apr 4, 2025

Uh oh!

pprados commented Apr 29, 2025

Uh oh!

ccurme left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Mar 4, 2025 •

edited

Loading

pprados commented Mar 13, 2025 •

edited

Loading