Skip to content

Conversation

@pprados
Copy link
Contributor

@pprados pprados commented Mar 4, 2025

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the PDFPlumber parser.

For more details, see #28970.

@eyurtsev
to speed up the integration of the different layers, I suggest you work in batch mode, with 2 PRs in parallel.

@vercel
Copy link

vercel bot commented Mar 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 29, 2025 9:49am

@pprados pprados force-pushed the pprados/07-zeroxpdf branch from 7167308 to 2474cd8 Compare March 5, 2025 11:27
@pprados pprados changed the title community[minor]: 06 - Refactoring ZeroxPDFLoader community[minor]: 07 - Refactoring ZeroxPDFLoader Mar 5, 2025
@pprados pprados marked this pull request as ready for review March 5, 2025 13:57
@dosubot dosubot bot added the size:XXL label Mar 5, 2025
@pprados
Copy link
Contributor Author

pprados commented Mar 5, 2025

@eyurtsev
to speed up the integration of the different layers, I suggest you work in batch mode, with 2 PRs in parallel.
This one and this one.

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pprados The original implementation is 30 lines of code. The new implementation is significantly more complex, and there is not clear value add to users or maintainers.

Have you considered publishing a pypi package that implements that PDFparsers in the way you'd like to handle them?

@pprados
Copy link
Contributor Author

pprados commented Mar 13, 2025

@eyurtsev
This parser is probably the best of all. All standardization work is intended to help you choose a parser dynamically, according to the characteristics of your files. Not correcting this parser, which is already present in langchain-common, does not allow us to unify the different parsers. With unstructured, this is the last of the series. All the others are now consistent, with similar parameters and identical properties.
I think it would be a real shame to leave this non-compliant parser in langchain-common, when the code is already there and works consistently with all the others.

I understand that the addition of new loaders/parsers can be done via a dedicated module, as for langchain-unstructured (which I'm currently modifying for this), or as could be the integration of IBM Docling.

But for those already present, I don't think this will make it easier to migrate to more virtuous models.

For nearly 3 months now, I've been trying to add the modifications for the PDF parsers, little by little, as indicated to my AXA customer. I'd be very disappointed if I couldn't finish the job because of the last parser.

@eyurtsev
Copy link
Collaborator

This parser is probably the best of all. All standardization work is intended to help you choose a parser dynamically, according to the characteristics of your files

I do not want to merge any changes to this parser or other parsers for the purpose of choosing a parser dynamically.

The reasons are the following:

  • Dynamically choosing a parser is not a feature that will be used by the vast majority of users.
  • For users who would consider using it, i's value has not been proven. There are no benchmarking results against actual datasets.
  • If the goal is to get good PDF extraction results, swapping dynamically between parsers, isn't the blocker. Dynamically swapping between parsers is not going to lead to significant overall performances in the quality of extraction.

Bigger picture, I don't think I want to modify any other parsers at all because I do not believe that there's much of a gain there considering the amount of effort this is taking and the downsides to changing code.

For nearly 3 months now, I've been trying to add the modifications for the PDF parsers, little by little, as indicated to my AXA customer. I'd be very disappointed if I couldn't finish the job because of the last parser.

This can be done in a separate package if it's important for you and AXA. Given that this work needs to be validated, I don't see why it needs to be done in lagnchain-community or existing integrations.

@pprados pprados marked this pull request as ready for review March 14, 2025 14:32
@dosubot dosubot bot added the 🤖:docs label Mar 14, 2025
@pprados
Copy link
Contributor Author

pprados commented Mar 15, 2025

Hi @eyurtsev

From the analysis we've made of all the parsers out there, there isn't one that's right for every situation. Some are very efficient for documents coming from Word, as they rely on the PDF format, others are better for PowerPoint documents, as they rely on an image analysis of the pages, and are able to find the reader order, understand schematics, etc. It doesn't seem possible to have one solution that works well in all situations. Cost impacts can also be an important variable.

In my experience, every project tells me that it works well for one type of document, but not for another. Regardless of the parser used.

That's why I wanted to propose a simple, GenericLoader compatible approach, with a split between the BlobLoader and the Parser.

To be able to write in 20 lines what is usually written in 2000 lines.

vector_store=...
record_manager=...
loader=GenericLoader(
    blob_loader=FileSystemBlobLoader(  # Or CloudBlobLoader
        path="mydata/",
        glob="**/*",
        show_progress=True,
    ),
    blob_parser=MimeTypeBasedParser(
        handlers={
          "application/pdf": PDFPlumberParser(),  # `ZeroxPDFParser` not found
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
            MsWordParser(),
        },
        fallback_parser=TextParser(),
    )
)
index(
    loader.lazy_load(),
    record_manager,
    vector_store,
    batch_size=100,
)

At present, it is not possible to use GenericLoader with ZeroxPDFLoader because there is no ZeroxPDFParser.

Making an external component is possible, for PDFRouterParser or others advanced strategies, of course. The visibility of this approach will not be the same as if the solution were directly integrated into langchain-community. I'm afraid that, for several years to come, we'll see wobbly projects that implement solutions outside langchain, out of ignorance or fear of complexity. Absolutely all the projects I come across don't use Loaders correctly. It's a real shame. I wanted to improve this situation.

My aim is really to simplify it. To be able to get the maximum value out of langchain, without running away from it when you can't figure out how to do it. I'll be presenting a conference next Tuesday on this very subject. (sorry, it's in French). I'm going to explain that, with a little study of langchain code, we can come up with much more efficient solutions than having to write code on the side of LangChain.

langchain-unstructured

For example, there is a new langchain-unstructured project. From what I understand, this project is intended to retrieve all loaders of type UnstructuredXXXLoader. Current implementations don't respect the Loader/parser split, don't accept a Path as a filename, and don't all have unit tests. These loaders are not compatible with GenericLoader.

I'm currently working on migrating all these UnstructuredXXXLoader into the langchain-unstructured project, improving the Loader/Parser breakdown, adding all the tests, etc. (Draft PR here)

Am I on the right track? I don't know.

I probably should have talked about it before embarking on this work. It seems relevant to me, to be able to use GenericLoader, which isn't possible at the moment.

I don't think I've managed to convince you of the relevance of having a clear separation between Loaders and Parsers, in order to simplify many codes.

In the end, it's up to you or your team. Should I continue down this path? Am I stopping here?

@eyurtsev
Copy link
Collaborator

Hi @pprados,

Making an external component is possible, for PDFRouterParser or others advanced strategies, of course. The visibility of this approach will not be the same as if the solution were directly integrated into langchain-community.

We've been migrating many of our integrations outside of the langchain-community package, but hosting documentation in langchain. It solves for discovery, maintenance and release cycle issues.

In general, I'm very happy w/ parsing different file types. However, this goal is different from adding multiple implementations for PDFs.

Having a lot of different implementations without guidance in the form of hard benchmarks is likely hurting our users more than its helping them.

I don't think I've managed to convince you of the relevance of having a clear separation between Loaders and Parsers, in order to simplify many codes.

I created the BlobParser abstraction for this very purpose.

The issue here isn't that I don't want a BlobParser, but that we're changing both the interface (making it more complex) and making changes in the implementation (creating a more complex implementation) without any hard bench-marking numbers.

What will be very valuable is a single implementation of a PDF parser that's benchmarked properly and has a great interface.

@pprados
Copy link
Contributor Author

pprados commented Mar 31, 2025

@eyurtsev
Can I create a langchain-pdf project?
who will host it? langchain-ai like langchain-unstructured?

I can indeed do it. It makes sense to combine the various PDF parsers into a single project, although this will only group some of them together. For example, langchain-unstructured or docling

what do you think?

@pprados
Copy link
Contributor Author

pprados commented Apr 1, 2025

@eyurtsev

if you can create a skeleton of a langchain-rag project, I could make a fork of it, and propose an integration of the various PDF parsers. The advantage of this approach is that there's no need to maintain compatibility.
I think this new project should be available at http://github.com/langchain-ai/langchain-pdf.

What do you think?

@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 1, 2025

@pprados guidelines are here: https://python.langchain.com/docs/contributing/how_to/integrations/

It's a repository that you'll need to either manage under your own user name or under your own org.

I'll check in w/ @ccurme that we're OK adding blob parsers to the list of accepted integrations (don't see why not) -- it's just not on the list right now.

@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 4, 2025

@pprados

  1. we're OK with either BlobParser or DocumentLoaders
  2. here's an example of a community maintained document loader (https://python.langchain.com/docs/integrations/document_loaders/docling/)

@pprados
Copy link
Contributor Author

pprados commented Apr 29, 2025

@ccurme or @baskaryan, can you review this PR?

@tylermaran, what do you think? The aim is to make the integration compatible with other PDF parsers (see here).

As I'm not a pyzerox contributor, I don't intend to propose a dedicated langchain-zerox, to improve the code. I suggest you make the changes in langchain-community.

Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing as langchain-community has been moved to a standalone repo: https://github.com/langchain-ai/langchain-community

@ccurme ccurme closed this Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants