-
Notifications
You must be signed in to change notification settings - Fork 19.6k
community[minor]: 07 - Refactoring ZeroxPDFLoader #30094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
b7be3e6 to
8a985b4
Compare
8a985b4 to
1e66bb2
Compare
1e66bb2 to
cf24209
Compare
7167308 to
2474cd8
Compare
eyurtsev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pprados The original implementation is 30 lines of code. The new implementation is significantly more complex, and there is not clear value add to users or maintainers.
Have you considered publishing a pypi package that implements that PDFparsers in the way you'd like to handle them?
libs/langchain/tests/unit_tests/document_loaders/parsers/test_public_api.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
|
@eyurtsev I understand that the addition of new loaders/parsers can be done via a dedicated module, as for langchain-unstructured (which I'm currently modifying for this), or as could be the integration of IBM Docling. But for those already present, I don't think this will make it easier to migrate to more virtuous models. For nearly 3 months now, I've been trying to add the modifications for the PDF parsers, little by little, as indicated to my AXA customer. I'd be very disappointed if I couldn't finish the job because of the last parser. |
88a9f5b to
0d701f0
Compare
c10541f to
739e9f0
Compare
I do not want to merge any changes to this parser or other parsers for the purpose of choosing a parser dynamically. The reasons are the following:
Bigger picture, I don't think I want to modify any other parsers at all because I do not believe that there's much of a gain there considering the amount of effort this is taking and the downsides to changing code.
This can be done in a separate package if it's important for you and AXA. Given that this work needs to be validated, I don't see why it needs to be done in lagnchain-community or existing integrations. |
|
Hi @eyurtsev From the analysis we've made of all the parsers out there, there isn't one that's right for every situation. Some are very efficient for documents coming from Word, as they rely on the PDF format, others are better for PowerPoint documents, as they rely on an image analysis of the pages, and are able to find the reader order, understand schematics, etc. It doesn't seem possible to have one solution that works well in all situations. Cost impacts can also be an important variable. In my experience, every project tells me that it works well for one type of document, but not for another. Regardless of the parser used. That's why I wanted to propose a simple, To be able to write in 20 lines what is usually written in 2000 lines. vector_store=...
record_manager=...
loader=GenericLoader(
blob_loader=FileSystemBlobLoader( # Or CloudBlobLoader
path="mydata/",
glob="**/*",
show_progress=True,
),
blob_parser=MimeTypeBasedParser(
handlers={
"application/pdf": PDFPlumberParser(), # `ZeroxPDFParser` not found
"application/vnd.openxmlformats-officedocument.wordprocessingml.document":
MsWordParser(),
},
fallback_parser=TextParser(),
)
)
index(
loader.lazy_load(),
record_manager,
vector_store,
batch_size=100,
)At present, it is not possible to use Making an external component is possible, for My aim is really to simplify it. To be able to get the maximum value out of langchain-unstructuredFor example, there is a new I'm currently working on migrating all these Am I on the right track? I don't know. I probably should have talked about it before embarking on this work. It seems relevant to me, to be able to use I don't think I've managed to convince you of the relevance of having a clear separation between In the end, it's up to you or your team. Should I continue down this path? Am I stopping here? |
|
Hi @pprados,
We've been migrating many of our integrations outside of the langchain-community package, but hosting documentation in langchain. It solves for discovery, maintenance and release cycle issues. In general, I'm very happy w/ parsing different file types. However, this goal is different from adding multiple implementations for PDFs. Having a lot of different implementations without guidance in the form of hard benchmarks is likely hurting our users more than its helping them.
I created the BlobParser abstraction for this very purpose. The issue here isn't that I don't want a BlobParser, but that we're changing both the interface (making it more complex) and making changes in the implementation (creating a more complex implementation) without any hard bench-marking numbers. What will be very valuable is a single implementation of a PDF parser that's benchmarked properly and has a great interface. |
|
@eyurtsev I can indeed do it. It makes sense to combine the various PDF parsers into a single project, although this will only group some of them together. For example, langchain-unstructured or docling what do you think? |
|
if you can create a skeleton of a langchain-rag project, I could make a fork of it, and propose an integration of the various PDF parsers. The advantage of this approach is that there's no need to maintain compatibility. What do you think? |
|
@pprados guidelines are here: https://python.langchain.com/docs/contributing/how_to/integrations/ It's a repository that you'll need to either manage under your own user name or under your own org. I'll check in w/ @ccurme that we're OK adding blob parsers to the list of accepted integrations (don't see why not) -- it's just not on the list right now. |
|
|
@ccurme or @baskaryan, can you review this PR? @tylermaran, what do you think? The aim is to make the integration compatible with other PDF parsers (see here). As I'm not a |
ccurme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing as langchain-community has been moved to a standalone repo: https://github.com/langchain-ai/langchain-community
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the PDFPlumber parser.
For more details, see #28970.
@eyurtsev
to speed up the integration of the different layers, I suggest you work in batch mode, with 2 PRs in parallel.