Skip to content

Commit c7bcba6

Browse files
authored
docs(s3): add source docs for S3 (#490)
1 parent 5a9a0c0 commit c7bcba6

File tree

1 file changed

+120
-1
lines changed

1 file changed

+120
-1
lines changed

docs/docs/ops/sources.md

Lines changed: 120 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
---
22
title: Sources
3+
toc_max_heading_level: 4
34
description: CocoIndex Built-in Sources
45
---
56

@@ -32,6 +33,124 @@ The output is a [KTable](/docs/core/data_types#ktable) with the following sub fi
3233
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
3334
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
3435

36+
## AmazonS3
37+
38+
### Setup for Amazon S3
39+
40+
#### Setup AWS accounts
41+
42+
You need to setup AWS accounts to own and access Amazon S3. In particular,
43+
44+
* Setup an AWS account from [AWS homepage](https://aws.amazon.com/) or login with an existing account.
45+
* AWS recommends all programming access to AWS should be done using [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) instead of root account. You can create an IAM user at [AWS IAM Console](https://console.aws.amazon.com/iam/home).
46+
* Make sure your IAM user at least have the following permissions in the IAM console:
47+
* Attach permission policy `AmazonS3ReadOnlyAccess` for read-only access to Amazon S3.
48+
* (optional) Attach permission policy `AmazonSQSFullAccess` to receive notifications from Amazon SQS, if you want to enable change event notifications.
49+
Note that `AmazonSQSReadOnlyAccess` is not enough, as we need to be able to delete messages from the queue after they're processed.
50+
51+
52+
#### Setup Credentials for AWS SDK
53+
54+
AWS SDK needs to access credentials to access Amazon S3.
55+
The easiest way to setup credentials is to run:
56+
57+
```sh
58+
aws configure
59+
```
60+
61+
It will create a credentials file at `~/.aws/credentials` and config at `~/.aws/config`.
62+
63+
See the following documents if you need more control:
64+
65+
* [`aws configure`](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html)
66+
* [Globally configuring AWS SDKs and tools](https://docs.aws.amazon.com/sdkref/latest/guide/creds-config-files.html)
67+
68+
69+
#### Create Amazon S3 buckets
70+
71+
You can create a Amazon S3 bucket in the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), and upload your files to it.
72+
73+
It's also doable by using the AWS CLI `aws s3 mb` (to create buckets) and `aws s3 cp` (to upload files).
74+
When doing so, make sure your current user also has permission policy `AmazonS3FullAccess`.
75+
76+
#### (Optional) Setup SQS queue for event notifications
77+
78+
You can setup an Amazon Simple Queue Service (Amazon SQS) queue to receive change event notifications from Amazon S3.
79+
It provides a change capture mechanism for your AmazonS3 data source, to trigger reprocessing of your AWS S3 files on any creation, update or deletion. Please use a dedicated SQS queue for each of your S3 data source.
80+
81+
This is how to setup:
82+
83+
* Create a SQS queue with proper access policy.
84+
* In the [Amazon SQS Console](https://console.aws.amazon.com/sqs/home), create a queue.
85+
* Add access policy statements, to make sure Amazon S3 can send messages to the queue.
86+
```json
87+
{
88+
...
89+
"Statement": [
90+
...
91+
{
92+
"Sid": "__publish_statement",
93+
"Effect": "Allow",
94+
"Principal": {
95+
"Service": "s3.amazonaws.com"
96+
},
97+
"Resource": "${SQS_QUEUE_ARN}",
98+
"Condition": {
99+
"ArnLike": {
100+
"aws:SourceArn": "${S3_BUCKET_ARN}"
101+
}
102+
}
103+
}
104+
]
105+
}
106+
```
107+
108+
Here, you need to replace `${SQS_QUEUE_ARN}` and `${S3_BUCKET_ARN}` with the actual ARN of your SQS queue and S3 bucket.
109+
You can find the ARN of your SQS queue in the existing policy statement (it starts with `arn:aws:sqs:`), and the ARN of your S3 bucket in the S3 console (it starts with `arn:aws:s3:`).
110+
111+
* In the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), open your S3 bucket. Under *Properties* tab, click *Create event notification*.
112+
* Fill in an arbitrary event name, e.g. `S3ChangeNotifications`.
113+
* If you want your AmazonS3 data source expose a subset of files sharing a prefix, set the same prefix here. Otherwise, leave it empty.
114+
* Select the following event types: *All object create events*, *All object removal events*.
115+
* Select *SQS queue* as the destination, and specify the SQS queue you created above.
116+
and enable *Change Event Notifications* for your bucket, and specify the SQS queue as the destination.
117+
118+
AWS's [Guide of Configuring a Bucket for Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html#step1-create-sqs-queue-for-notification) provides more details.
119+
120+
### Spec
121+
122+
The spec takes the following fields:
123+
* `bucket_name` (type: `str`, required): Amazon S3 bucket name.
124+
* `prefix` (type: `str`, optional): if provided, only files with path starting with this prefix will be imported.
125+
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
126+
* `included_patterns` (type: `list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
127+
If not specified, all files will be included.
128+
* `excluded_patterns` (type: `list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
129+
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
130+
If not specified, no files will be excluded.
131+
132+
:::info
133+
134+
`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
135+
136+
:::
137+
138+
* `sqs_queue_url` (type: `str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.
139+
140+
:::info
141+
142+
We will delete messages from the queue after they're processed.
143+
If there're unrelated messages in the queue (e.g. test messages that SQS will send automatically on queue creation, messages for a different bucket, for non-included files, etc.), we will delete the message upon receiving it, to avoid keeping receiving irrelevant messages again and again after they're redelivered.
144+
145+
:::
146+
147+
### Schema
148+
149+
The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
150+
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
151+
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file.
152+
153+
35154
## GoogleDrive
36155

37156
The `GoogleDrive` source imports files from Google Drive.
@@ -59,7 +178,7 @@ The spec takes the following fields:
59178
* `service_account_credential_path` (type: `str`, required): full path to the service account credential file in JSON format.
60179
* `root_folder_ids` (type: `list[str]`, required): a list of Google Drive folder IDs to import files from.
61180
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
62-
* `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a *change capture mechanism* by polling Google Drive for recent modified files periodically.
181+
* `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.
63182

64183
:::info
65184

0 commit comments

Comments
 (0)