-
Notifications
You must be signed in to change notification settings - Fork 7
unstructured[minor]: 08 - Refactoring 17 unstructured loaders #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks @pprados! Unsure when I or someone else will get to this, but wanted to let you know we're aware. |
|
@Coniferish I'm also having trouble getting 2 other PRs validated, which are currently blocked. I don't understand why. Can you take a look? |
|
Hey @pprados, I no longer work at unstructured, so I'm not sure I can help out. |
|
Hey @Coniferish |
|
I'm unsure, sorry |
|
Hey @efriis I'm also having trouble getting 2 other PRs validated, which are currently blocked. I don't understand why. Can you take a look? langchain-ai/langchain#29709 Issue langchain-ai/langchain#30454 |
|
@ccurme, can you help out this contributor? I'm no longer at unstructured and am unsure if I'm able to continue working on this. |
|
@baskaryan can you review this PR or assign it to someone? |
|
@badGarnet, What do you think? |
|
@ccurme |
|
@ccurme can you approve the workflow? |
In this PR, we propose a migration of the various
Unstructured*Loaderimplementations to thelangchain-unstructuredpackage.Improvements
We’ve made several key improvements:
langchain-community)langchain-community(seetest_migration.py)Loaderis split into aLoader/Parserto allow usage withGenericLoader#prefixes) and tables in either Markdown or HTML format. It’s possible to revert to the original behavior by changing a few parameters.keep_header_footer=False)Pathobjects or stringsweb_urlIOobjectauto,fast,hi_res, andocr_only)lazy_load()UnstructuredLoaderadditionally supports a list of PATHs infile_path. While we don’t consider this very clean (why only this loader? Why no plural? The user could just loop), we replicate the behavior fromlangchain-community.langchain-unstructureddependencies offer the same extras asunstructured(csv, pdf, docx, etc.). This allows specifying a dependency onlangchain-unstructuredlimited to certain file types (langchain-unstructured[pdf]). The previous behavior pulled in all possible formats, resulting in a package too large for environments like AWS Lambda.With this PR, it will be possible to mark 17 Loader as "deprecated". There will remain 5 dependencies on
unstructuredinlangchain-community.Once this version is released, we plan to propose a PR to
langchain-communityto mark allUnstructured*Loaderas@deprecated. Any changes to default parameter values will be explained in the comments.Other dependencies on Unstructured in
langchain-communityThere are not part of unstructured
Unstructured, even thoughUnstructuredCHMLoaderexists. Thelangchain-communityversion doesn’t work with the files we tested. We are leaving this loader as-is.UnstructuredLakeFSLoaderSeleniumURLLoaderS3FileLoader. UseGenericLoader+CloudBlobLoaderUnstructuredHtmlEvaluatorpyproject.toml
unstructuredis a framework that can pull in a large number of dependencies, depending on the file formats it needs to process. The framework offers various extras to include only the strictly necessary dependencies, for example:unstructured[pdf,csv].langchain-unstructureddoes not currently work this way. It pulls in all dependencies fromunstructured, resulting in very large projects that are incompatible with environments that have size limitations, such as AWS Lambda.The change to
pyproject.tomlreplicates the different extras provided byunstructuredand propagates them intolangchain-unstructured.PDF
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the
UnstructuredPDFParserandUnstructuredPDFLoader.For more details, see here
Note
I will not split this PR into multiple smaller PRs, each covering a single loader. That approach would take too much time for zero benefit (I’ve had some bad experiences with it). Either this PR works for you, and I’ll make the requested changes, or you can close it and ignore it. It will then be up to another contributor to migrate the various
Unstructured*Loaderto this project.