You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tutorials/how-to-implement-rag/index.mdx
+55-35Lines changed: 55 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -97,19 +97,18 @@ load_dotenv()
97
97
98
98
# Establish connection to PostgreSQL database using environment variables
99
99
conn = psycopg2.connect(
100
-
database=os.getenv("SCW_DB_NAME"),
101
-
user=os.getenv("SCW_DB_USER"),
102
-
password=os.getenv("SCW_DB_PASSWORD"),
103
-
host=os.getenv("SCW_DB_HOST"),
104
-
port=os.getenv("SCW_DB_PORT")
105
-
)
100
+
database=os.getenv("SCW_DB_NAME"),
101
+
user=os.getenv("SCW_DB_USER"),
102
+
password=os.getenv("SCW_DB_PASSWORD"),
103
+
host=os.getenv("SCW_DB_HOST"),
104
+
port=os.getenv("SCW_DB_PORT")
105
+
)
106
106
107
107
# Create a cursor to execute SQL commands
108
108
cur = conn.cursor()
109
109
```
110
110
111
111
112
-
113
112
### Set Up Document Loaders for Object Storage
114
113
115
114
In this section, we will use LangChain to load documents stored in your Scaleway Object Storage bucket. The document loader retrieves the contents of each document for further processing, such as vectorization or embedding generation.
@@ -197,45 +196,66 @@ PGVector: This creates the vector store in your PostgreSQL database to store the
197
196
198
197
Use the S3FileLoader to load documents and split them into chunks. Then, embed and store them in your PostgreSQL database.
199
198
200
-
1. Lazy loadings documents: This method is designed to efficiently load and process documents one by one from Scaleway Object Storage. Instead of loading all documents at once, it loads them lazily, allowing us to inspect each file before deciding whether to embed it.
199
+
1. Load Metadata for Improved Efficiency: By loading the metadata for all objects in your bucket, you can speed up the process significantly. This allows you to quickly check if a document has already been embedded without the need to load the entire document.
The key reason for using lazy loading here is to avoid reprocessing documents that have already been embedded. In the context of Retrieval-Augmented Generation (RAG), reprocessing the same document multiple times is redundant and inefficient. Lazy loading enables us to check if a document has already been embedded (by querying the database) before actually loading and embedding it.
211
+
212
+
In this code sample we:
213
+
- Set Up a Boto3 Session: We initialize a Boto3 session, which is the AWS SDK for Python, fully compatible with Scaleway Object Storage. This session manages configuration, including credentials and settings, that Boto3 uses for API requests.
214
+
- Create an S3 Client: We establish an S3 client to interact with the Scaleway Object storage service.
215
+
- Set Up Pagination for Listing Objects: We prepare pagination to handle potentially large lists of objects efficiently.
216
+
- Iterate Through the Bucket: This initiates the pagination process, allowing us to list all objects within the specified Scaleway Object bucket seamlessly.
217
+
218
+
2. Iterate Through Metadata: Next, we will iterate through the metadata to determine if each object has already been embedded. If an object hasn’t been processed yet, we will embed it and load it into the database.
cur.execute("INSERT INTO object_loaded (object_key) VALUES (%s)",
241
+
(obj['Key'],))
242
+
exceptExceptionas e:
243
+
logger.error(f"An error occurred: {e}")
244
+
245
+
conn.commit()
226
246
```
227
247
228
-
1. S3FileLoader: The S3FileLoader loads each file individually from your ***Scaleway Object Storage bucket*** using the file's object_key (extracted from the file's metadata). It ensures that only the specific file is loaded from the bucket, minimizing the amount of data being retrieved at any given time.
229
-
2. RecursiveCharacterTextSplitter: The RecursiveCharacterTextSplitter breaks each document into smaller chunks of text. This is crucial because embeddings models, like those used in Retrieval-Augmented Generation (RAG), typically have a limited context window (the number of tokens they can process at once).
248
+
- S3FileLoader: The S3FileLoader loads each file individually from your ***Scaleway Object Storage bucket*** using the file's object_key (extracted from the file's metadata). It ensures that only the specific file is loaded from the bucket, minimizing the amount of data being retrieved at any given time.
249
+
- RecursiveCharacterTextSplitter: The RecursiveCharacterTextSplitter breaks each document into smaller chunks of text. This is crucial because embeddings models, like those used in Retrieval-Augmented Generation (RAG), typically have a limited context window (the number of tokens they can process at once).
230
250
- Chunk Size: Here, the chunk size is set to 480 characters, with an overlap of 20 characters. The choice of 480 characters is based on the context size supported by the embeddings model. Models have a maximum number of tokens they can process in a single pass, often around 512 tokens or fewer, depending on the specific model you are using. To ensure that each chunk fits within this limit, 380 characters provide a buffer, as different models tokenize characters into variable-length tokens.
231
251
- Chunk Overlap: The 20-character overlap ensures continuity between chunks, which helps prevent loss of meaning or context between segments.
232
-
3. Embedding the Chunks: For each document, the text is split into smaller chunks using the text splitter, and an embedding is generated for each chunk using the embeddings.embed_query(chunk) function. This function transforms each chunk into a vector representation that can later be used for similarity search.
233
-
4. Embedding Storage: After generating the embeddings for each chunk, they are stored in a vector database (e.g., PostgreSQL with pgvector) using the vector_store.add_embeddings(embedding, chunk) method. Each embedding is stored alongside its corresponding text chunk, enabling retrieval during a query.
234
-
5. Avoiding Redundant Processing: The script checks the object_loaded table in PostgreSQL to see if a document has already been processed (i.e., the object_key exists in the table). If it has, the file is skipped, avoiding redundant downloads, vectorization, and database inserts. This ensures that only new or modified documents are processed, reducing the system's computational load and saving both time and resources.
252
+
- Embedding the Chunks: For each document, the text is split into smaller chunks using the text splitter, and an embedding is generated for each chunk using the embeddings.embed_query(chunk) function. This function transforms each chunk into a vector representation that can later be used for similarity search.
253
+
- Embedding Storage: After generating the embeddings for each chunk, they are stored in a vector database (e.g., PostgreSQL with pgvector) using the vector_store.add_embeddings(embedding, chunk) method. Each embedding is stored alongside its corresponding text chunk, enabling retrieval during a query.
254
+
- Avoiding Redundant Processing: The script checks the object_loaded table in PostgreSQL to see if a document has already been processed (i.e., the object_key exists in the table). If it has, the file is skipped, avoiding redundant downloads, vectorization, and database inserts. This ensures that only new or modified documents are processed, reducing the system's computational load and saving both time and resources.
235
255
236
-
#### Why 480 characters?
256
+
#### Why 500 characters?
237
257
238
-
The chunk size of 480 characters is chosen to fit comfortably within the context size limits of typical embeddings models, which often range between 512 and 1024 tokens. Since most models tokenize text into smaller units (tokens) based on words, punctuation, and subwords, the exact number of tokens for 480 characters will vary depending on the language and the content. By keeping chunks small, we avoid exceeding the model’s context window, which could lead to truncated embeddings or poor performance during inference.
258
+
The chunk size of 500 characters is chosen to fit comfortably within the context size limits of typical embeddings models, which often range between 512 and 1024 tokens. Since most models tokenize text into smaller units (tokens) based on words, punctuation, and subwords, the exact number of tokens for 480 characters will vary depending on the language and the content. By keeping chunks small, we avoid exceeding the model’s context window, which could lead to truncated embeddings or poor performance during inference.
239
259
240
260
This approach ensures that only new or modified documents are loaded into memory and embedded, saving significant computational resources and reducing redundant work.
0 commit comments