Dynamic Index Refresh for RAG Application on AWS Cloud with S3 Integration #13881

crman · 2024-06-01T05:58:58Z

crman
Jun 1, 2024

Hello,

I am working on creating a Retrieval-Augmented Generation (RAG) application on the AWS Cloud, with a specific requirement to dynamically update the index and vector store upon the arrival of new or modified documents in S3 storage.

Use Case:

Initial Setup: The RAG application index and vector store are created using documents stored in an S3 bucket.
Dynamic Updates: When new documents are added to or existing documents are modified in the S3 bucket, the index and vector store should automatically refresh.
Expected Outcome: New queries should return answers related to the newly inserted or updated documents in near real-time.

Current Progress:

I have successfully implemented this functionality on my local machine using Python's observer and watcher libraries to monitor file changes and update the index and vector store accordingly.
The challenge is to achieve similar functionality in the AWS cloud environment.

Question:

Is there a built-in way in Llamaindex to achieve dynamic index refresh based on changes in S3?
If not, what would be the recommended approach to implement this functionality on AWS using Llamaindex?

Objective:
I am looking for guidance on:

How to monitor and handle S3 events to trigger index and vector store updates in an AWS environment.
Any best practices or recommended patterns for integrating Llamaindex with AWS services for this purpose.

Thank you for your assistance!

@crman · 2024-06-01T06:04:35Z

dosubot[bot]
bot Jun 1, 2024

Hey @crman! 👋 I'm here to help you out with any bugs, questions, or contributions you have. Let's work together to solve this issue!

To achieve dynamic index refresh for your Retrieval-Augmented Generation (RAG) application on AWS with S3 integration using LlamaIndex, you can leverage AWS Lambda functions to handle S3 events and update the index and vector store accordingly. Here’s a step-by-step guide:

Step-by-Step Guide

Set up S3 Bucket and Event Notifications:
- Configure your S3 bucket to send event notifications (e.g., object creation, modification) to an AWS Lambda function.
Create an AWS Lambda Function:
- This Lambda function will be triggered by S3 events and will update the RelytVectorStore.
Lambda Function Code:
- Below is an example of a Lambda function that processes S3 events and updates the RelytVectorStore.

import json
import boto3
import os
from llama_index.vector_stores.relyt import RelytVectorStore
from pgvecto_rs.sdk import PGVectoRs, Record
from llama_index.core.schema import BaseNode, MetadataMode

# Initialize the Relyt client
URL = "postgresql+psycopg://{username}:{password}@{host}:{port}/{db_name}".format(
    port=os.getenv("RELYT_PORT", "5432"),
    host=os.getenv("RELYT_HOST", "localhost"),
    username=os.getenv("RELYT_USER", "postgres"),
    password=os.getenv("RELYT_PASS", "mysecretpassword"),
    db_name=os.getenv("RELYT_NAME", "postgres"),
)

client = PGVectoRs(
    db_url=URL,
    collection_name="example",
    dimension=1536,  # Using OpenAI’s text-embedding-ada-002
)

vector_store = RelytVectorStore(client=client)

def lambda_handler(event, context):
    # Process the S3 event
    s3 = boto3.client('s3')
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']
        
        # Get the object from S3
        response = s3.get_object(Bucket=bucket, Key=key)
        content = response['Body'].read().decode('utf-8')
        
        # Create a BaseNode from the content
        node = BaseNode(
            id_=key,
            text=content,
            metadata={"ref_doc_id": key}
        )
        
        # Add the node to the vector store
        vector_store.add([node])
    
    return {
        'statusCode': 200,
        'body': json.dumps('Successfully processed S3 event and updated vector store')
    }

Deploy the Lambda Function:
- Deploy the Lambda function and set the necessary environment variables for the database connection.
Configure S3 Bucket to Trigger the Lambda Function:
- Set up the S3 bucket to trigger the Lambda function on specific events (e.g., object creation).

Best Practices

Error Handling and Retries: Implement error handling and retries in your Lambda function to handle transient errors and ensure robustness.
Monitoring and Logging: Use AWS CloudWatch to monitor and log the performance and errors of your Lambda function.
Scalability: Ensure your Lambda function can handle the expected load by configuring appropriate memory and timeout settings.

This setup will ensure that whenever a new document is added or an existing document is modified in the S3 bucket, the Lambda function will be triggered, read the content of the document, create a BaseNode, and update the RelytVectorStore with the new data ^[1].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic Index Refresh for RAG Application on AWS Cloud with S3 Integration #13881

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dynamic Index Refresh for RAG Application on AWS Cloud with S3 Integration #13881

Uh oh!

crman Jun 1, 2024

Replies: 1 comment

Uh oh!

dosubot[bot] bot Jun 1, 2024

Step-by-Step Guide

Best Practices

crman
Jun 1, 2024

dosubot[bot]
bot Jun 1, 2024