Skip to content

Commit cebe1de

Browse files
authored
[doc] sqs example (#1317)
1 parent 7a458ab commit cebe1de

File tree

10 files changed

+155
-0
lines changed

10 files changed

+155
-0
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
title: Real-time data transformation pipeline with Amazon S3 bucket, SQS and CocoIndex
3+
description: Build real-time data transformation pipeline with S3 and CocoIndex.
4+
sidebar_class_name: hidden
5+
slug: /examples/s3_sqs_pipeline
6+
canonicalUrl: '/examples/s3_sqs_pipeline'
7+
sidebar_custom_props:
8+
image: /img/integrations/sqs/cover.png
9+
tags: [vector-index, s3, sqs, realtime, etl]
10+
image: /img/integrations/sqs/cover.png
11+
tags: [vector-index, s3, sqs, realtime, etl]
12+
---
13+
import { DocumentationButton } from '../../../src/components/GitHubButton';
14+
15+
![cover](/img/integrations/sqs/cover.png)
16+
17+
[CocoIndex](https://github.com/cocoindex-io/cocoindex) natively supports Amazon S3 as a source and integrates with AWS SQS for real-time, incremental S3 data processing.
18+
19+
## AWS SQS
20+
21+
[Amazon SQS](https://aws.amazon.com/sqs/) (Simple Queue Service) is a message queuing service that provides a reliable, highly-scalable hosted queue for storing messages as they travel between applications or microservices. When S3 files change, SQS queues event messages containing details like the event type, bucket, object key, and timestamp. Messages stay in the queue until processed, so no events are lost.
22+
23+
## Live update out of the box with SQS
24+
CocoIndex provides two modes to run your pipeline, one time update and live update, both leverage the incremental processing. Particularly with AWS SQS, you could leverage the live update mode -
25+
where CocoIndex continuously monitors and reacts to the events in SQS, updating the target data in real-time. This is ideal for use cases where data freshness is critical.
26+
27+
<DocumentationButton url="http://localhost:3000/docs/tutorials/live_updates" text="Live Update Tutorial" margin="0 0 16px 0" />
28+
29+
30+
## How does it work?
31+
Let's take a look at simple example of how to build a real-time data transformation pipeline with S3 and CocoIndex. It builds a vector database of text embeddings from markdown files in S3.
32+
33+
### S3 bucket and SQS setup
34+
Please follow the [documentation](https://cocoindex.io/docs/sources/amazons3) to setup S3 bucket and SQS queue.
35+
36+
37+
<DocumentationButton url="https://cocoindex.io/docs/sources/amazons3" text="Amazon S3 Source" margin="0 0 16px 0" />
38+
39+
40+
#### S3 bucket
41+
- Creating an AWS account.
42+
- Configuring IAM permissions.
43+
- Configure policies. You'll need at least the `AmazonS3ReadOnlyAccess` policy, and if you want to enable change notifications, you'll also need the `AmazonSQSFullAccess` policy.
44+
![Permission Config](/img/integrations/sqs/permission.png)
45+
46+
#### SQS queue
47+
For real-time change detection, you'll need to create an SQS queue and configure it to receive notifications from your S3 bucket.
48+
Please follow the [documentation](https://cocoindex.io/docs/sources/amazons3#optional-setup-sqs-queue-for-event-notifications) to configure the S3 bucket to send event notifications to the SQS queue.
49+
![SQS Queue](/img/integrations/sqs/sqs.png)
50+
51+
Particularly, the SQS queue needs a specific access policy that allows S3 to send messages to it.
52+
53+
```json
54+
{
55+
...
56+
"Statement": [
57+
...
58+
{
59+
"Sid": "__publish_statement",
60+
"Effect": "Allow",
61+
"Principal": {
62+
"Service": "s3.amazonaws.com"
63+
},
64+
"Resource": "${SQS_QUEUE_ARN}",
65+
"Action": "SQS:SendMessage",
66+
"Condition": {
67+
"ArnLike": {
68+
"aws:SourceArn": "${S3_BUCKET_ARN}"
69+
}
70+
}
71+
}
72+
]
73+
}
74+
```
75+
76+
Then you can upload your files to the S3 bucket.
77+
![S3 Bucket](/img/integrations/sqs/s3.png)
78+
79+
80+
## Define Indexing Flow
81+
82+
### Flow Design
83+
![CocoIndex Flow for Text Embedding](/img/integrations/sqs/flow.png)
84+
85+
The flow diagram illustrates how we'll process our codebase:
86+
1. Read text files from the Amazon S3 bucket
87+
2. Chunk each document
88+
3. For each chunk, embed it with a text embedding model
89+
4. Store the embeddings in a vector database for retrieval
90+
91+
92+
### AWS File Ingestion
93+
94+
Define the AWS endpoint and the SQS queue name in `.env` file:
95+
96+
```bash
97+
# Database Configuration
98+
DATABASE_URL=postgresql://localhost:5432/cocoindex
99+
100+
# Amazon S3 Configuration
101+
AMAZON_S3_BUCKET_NAME=your-bucket-name
102+
AMAZON_S3-SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/123456789/S3ChangeNotifications
103+
```
104+
105+
Define indexing flow and ingest from Amazon S3 SQS queue:
106+
107+
```python
108+
@cocoindex.flow_def(name="AmazonS3TextEmbedding")
109+
def amazon_s3_text_embedding_flow(
110+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
111+
):
112+
bucket_name = os.environ["AMAZON_S3_BUCKET_NAME"]
113+
prefix = os.environ.get("AMAZON_S3_PREFIX", None)
114+
sqs_queue_url = os.environ.get("AMAZON_S3_SQS_QUEUE_URL", None)
115+
116+
data_scope["documents"] = flow_builder.add_source(
117+
cocoindex.sources.AmazonS3(
118+
bucket_name=bucket_name,
119+
prefix=prefix,
120+
included_patterns=["*.md", "*.mdx", "*.txt", "*.docx"],
121+
binary=False,
122+
sqs_queue_url=sqs_queue_url,
123+
)
124+
)
125+
126+
```
127+
128+
This defines a flow that reads text files from the Amazon S3 bucket.
129+
130+
![AWS File Ingestion](/img/integrations/sqs/ingest.png)
131+
132+
### Rest of the flow
133+
For the rest of the flow, we can follow the tutorial
134+
[Simple Vector Index](https://cocoindex.io/docs/examples/simple_vector_index).
135+
The entire project is available [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/amazon_s3_embedding).
136+
137+
## Run the flow with live update
138+
```bash
139+
cocoindex update main.py -L
140+
```
141+
142+
`-L` option means live update, see the [documentation](https://cocoindex.io/docs/core/flow_methods#live-update) for more details.
143+
And you will have a continuous long running process that will update the vector database with any updates in the S3 bucket.
144+

docs/sidebars.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,17 @@ const sidebars: SidebarsConfig = {
116116
},
117117
],
118118
},
119+
{
120+
type: 'category',
121+
label: 'Integrations',
122+
collapsed: false,
123+
items: [
124+
{
125+
type: 'autogenerated',
126+
dirName: 'examples/integrations',
127+
},
128+
],
129+
},
119130
],
120131
tutorials: [
121132
{
607 KB
Loading
576 KB
Loading
39.7 KB
Loading
47.6 KB
Loading
46 KB
Loading
102 KB
Loading
505 KB
Loading
43.3 KB
Loading

0 commit comments

Comments
 (0)