How to use PyPDFLoader with a BytesIO object? #17408
Replies: 2 comments 3 replies
-
🤖 Hey @nikhilmakan02! Great to see you diving into another interesting challenge with LangChain. How's everything going on your end? Based on the current implementation of LangChain, the class PyPDFLoader(BasePDFLoader):
"""Load PDF using pypdf into list of documents.
Loader chunks by page and stores page numbers in metadata.
"""
def __init__(
self,
file_path: str,
password: Optional[Union[str, bytes]] = None,
headers: Optional[Dict] = None,
extract_images: bool = False,
) -> None:
"""Initialize with a file path."""
try:
import pypdf # noqa:F401
except ImportError:
raise ImportError(
"pypdf package not found, please install it with " "`pip install pypdf`"
)
super().__init__(file_path, headers=headers)
self.parser = PyPDFParser(password=password, extract_images=extract_images)
def lazy_load(
self,
) -> Iterator[Document]:
"""Lazy load given path as pages."""
if self.web_path:
blob = Blob.from_data(open(self.file_path, "rb").read(), path=self.web_path)
else:
blob = Blob.from_path(self.file_path)
yield from self.parser.parse(blob) As you can see, the However, you can create a custom loader that inherits from from io import BytesIO
from langchain.parsers import PyPDFParser
from langchain.document_loaders import BaseLoader
class CustomPDFLoader(BaseLoader):
def __init__(self, stream: BytesIO, password: Optional[Union[str, bytes]] = None, extract_images: bool = False):
self.stream = stream
self.parser = PyPDFParser(password=password, extract_images=extract_images)
def load(self) -> List[Document]:
blob = Blob.from_data(self.stream.getvalue())
return list(self.parser.parse(blob)) In this example, Please note that this is a rough example and might need adjustments based on your specific use case. Also, remember to handle exceptions and edge cases as necessary. I hope this helps! If you have any other questions or need further clarification, feel free to ask. Sources
|
Beta Was this translation helpful? Give feedback.
-
Another solution is to write it to your filesystem using with open("/tmp/myfile.pdf", "wb") as f:
f.write(stream.getvalue())
loader = PyPDFLoader("/tmp/myfile.pdf") But you will need a job to clean the folder regularly, and beware that On Azure you can also use File Share instead of Storage Container, and mount it into your filesystem But I also agree with you that langchain loaders should have an interface like |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I would like to use PyPDFLoader to read a PDF in from a stream as opposed to a file path. I am downloading the pdf from an Azure Blob Storage. There is a bit of logic on determining which file to read hence I am not using the LangChain Azure Blob Storage Document Loader
I would prefer to not download the document to a temp storage and then read from a path if possible. I know pypdf can read a stream it just seems the langchain wrapper around it does not allow for this.
Any thoughts on how to work around this, perhaps using pydf directly to read and split the document, then convert this to langchain document object?
System Info
langchain==0.1.3
langchain-community==0.0.16
langchain-core==0.1.17
Beta Was this translation helpful? Give feedback.
All reactions