Replies: 1 comment
-
To load embeddings for a 500-page PDF document without causing your laptop to become unresponsive, you can process the PDF in smaller batches. This approach will help manage memory usage and prevent your system from becoming overwhelmed. Here is an updated version of your code that processes the PDF in batches: import os
from pathlib import Path
import pdfplumber
from langchain.vectorstores import PGVector
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
def process_pdf_in_batches(file_path, batch_size=50):
with pdfplumber.open(file_path) as pdf:
n_pages = len(pdf.pages)
for start_page in range(0, n_pages, batch_size):
end_page = min(start_page + batch_size, n_pages)
yield [pdf.pages[i].extract_text() for i in range(start_page, end_page)]
# Load PDF
file_path = "path/to/your/500_pages_document.pdf"
# Load embedding function
embedding = OpenAIEmbeddings()
# Create text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=6000, chunk_overlap=100)
# Process PDF in batches
for batch_texts in process_pdf_in_batches(file_path):
documents = [{"page_content": text} for text in batch_texts]
texts = text_splitter.split_documents(documents=documents)
# Add embeddings
PGVector.from_texts(
embedding=embedding,
texts=texts,
collection_name="your_collection_name",
connection_string="your_connection_string",
)
print("done") This code processes the PDF in batches of 50 pages at a time, which should help prevent your laptop from becoming unresponsive. Adjust the Additionally, LangChain has built-in asynchronous methods or utilities for processing large documents. For instance, the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I am trying to upload 500 pages pdf, but during running above code(embedding process), my laptop stucks and hanged,
Can't we do this process asyncrously or what modification need to do in my code to do so?
System Info
langchain
linux
Beta Was this translation helpful? Give feedback.
All reactions