How to use PyPDFLoader with a BytesIO object? #17408

nikhilmakan02 · 2024-02-12T12:32:49Z

nikhilmakan02
Feb 12, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_community.document_loaders import PyPDFLoader

stream = io.BytesIO()
container_client.download_blob(blob_name).readinto(stream)

loader = PyPDFLoader(stream)
pages = loader.load_and_split()

Description

I would like to use PyPDFLoader to read a PDF in from a stream as opposed to a file path. I am downloading the pdf from an Azure Blob Storage. There is a bit of logic on determining which file to read hence I am not using the LangChain Azure Blob Storage Document Loader

I would prefer to not download the document to a temp storage and then read from a path if possible. I know pypdf can read a stream it just seems the langchain wrapper around it does not allow for this.

Any thoughts on how to work around this, perhaps using pydf directly to read and split the document, then convert this to langchain document object?

System Info

langchain==0.1.3
langchain-community==0.0.16
langchain-core==0.1.17

@nikhilmakan02 · 2024-02-12T12:33:03Z

dosubot[bot]
bot Feb 12, 2024

🤖

Hey @nikhilmakan02! Great to see you diving into another interesting challenge with LangChain. How's everything going on your end?

Based on the current implementation of LangChain, the PyPDFLoader class does not support loading from a BytesIO object. It is designed to work with a file path, which it uses to open and read the file's contents. Here is the relevant code from the PyPDFLoader class:

class PyPDFLoader(BasePDFLoader):
    """Load PDF using pypdf into list of documents.

    Loader chunks by page and stores page numbers in metadata.
    """

    def __init__(
        self,
        file_path: str,
        password: Optional[Union[str, bytes]] = None,
        headers: Optional[Dict] = None,
        extract_images: bool = False,
    ) -> None:
        """Initialize with a file path."""
        try:
            import pypdf  # noqa:F401
        except ImportError:
            raise ImportError(
                "pypdf package not found, please install it with " "`pip install pypdf`"
            )
        super().__init__(file_path, headers=headers)
        self.parser = PyPDFParser(password=password, extract_images=extract_images)

    def lazy_load(
        self,
    ) -> Iterator[Document]:
        """Lazy load given path as pages."""
        if self.web_path:
            blob = Blob.from_data(open(self.file_path, "rb").read(), path=self.web_path)
        else:
            blob = Blob.from_path(self.file_path)
        yield from self.parser.parse(blob)

As you can see, the file_path is used to open the file and read its contents. There is no provision for a BytesIO object in the current implementation.

However, you can create a custom loader that inherits from BaseLoader and uses PyPDFParser directly to parse a BytesIO object. Here's a rough example of how you might do this:

from io import BytesIO
from langchain.parsers import PyPDFParser
from langchain.document_loaders import BaseLoader

class CustomPDFLoader(BaseLoader):
    def __init__(self, stream: BytesIO, password: Optional[Union[str, bytes]] = None, extract_images: bool = False):
        self.stream = stream
        self.parser = PyPDFParser(password=password, extract_images=extract_images)

    def load(self) -> List[Document]:
        blob = Blob.from_data(self.stream.getvalue())
        return list(self.parser.parse(blob))

In this example, CustomPDFLoader takes a BytesIO object as input and uses PyPDFParser to parse the data into a list of Document objects. This should allow you to load a PDF from a BytesIO object without having to write it to a temporary file first.

Please note that this is a rough example and might need adjustments based on your specific use case. Also, remember to handle exceptions and edge cases as necessary.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

1 reply

rajkong Oct 14, 2024

@dosu-bot This is a nice solution. You may also want to pass in the metadata in the constructor so that it can be included in Blob.from_data as an argument

mycaule · 2024-06-28T16:08:53Z

mycaule
Jun 28, 2024

Another solution is to write it to your filesystem using /tmp

with open("/tmp/myfile.pdf", "wb") as f:
    f.write(stream.getvalue())

loader = PyPDFLoader("/tmp/myfile.pdf")

But you will need a job to clean the folder regularly, and beware that page.metadata['source'] will copy that filepath .

On Azure you can also use File Share instead of Storage Container, and mount it into your filesystem

https://learn.microsoft.com/en-us/azure/app-service/configure-connect-to-azure-storage?tabs=basic%2Cportal&pivots=container-linux

But I also agree with you that langchain loaders should have an interface like pandas.read_csv, where it can read both a buffer or a filesystem path.

2 replies

DenysMoskalenko Jul 22, 2024

As an addition, you can use python tempfile to cleanup everything.

with tempfile.NamedTemporaryFile() as fp:
    fp.write(stream.getvalue())
    fp.seek(0)  # move the cursor to read the file

    loader = PyPDFLoader(fp.name)
    loader.load()

mycaule Oct 15, 2024

But, you will lose the original filename in the metadata of the final List[Document] object

How to use PyPDFLoader with a BytesIO object? #17408

Uh oh!

nikhilmakan02 Feb 12, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

dosubot[bot] bot Feb 12, 2024

Sources

About Dosu

Uh oh!

Uh oh!

rajkong Oct 14, 2024

Uh oh!

Uh oh!

mycaule Jun 28, 2024

Uh oh!

DenysMoskalenko Jul 22, 2024

Uh oh!

Uh oh!

mycaule Oct 15, 2024

nikhilmakan02
Feb 12, 2024

Replies: 2 comments 3 replies

dosubot[bot]
bot Feb 12, 2024

mycaule
Jun 28, 2024