cocoindex-io · badmonster0 · May 15, 2025 · May 15, 2025
diff --git a/docs/docs/ops/sources.md b/docs/docs/ops/sources.md
@@ -1,5 +1,6 @@
 ---
 title: Sources
+toc_max_heading_level: 4
 description: CocoIndex Built-in Sources
 ---
 
@@ -32,6 +33,124 @@ The output is a [KTable](/docs/core/data_types#ktable) with the following sub fi
 *   `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
 *   `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
 
+## AmazonS3
+
+### Setup for Amazon S3
+
+#### Setup AWS accounts
+
+You need to setup AWS accounts to own and access Amazon S3. In particular,
+
+*   Setup an AWS account from [AWS homepage](https://aws.amazon.com/) or login with an existing account.
+*   AWS recommends all programming access to AWS should be done using [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) instead of root account. You can create an IAM user at [AWS IAM Console](https://console.aws.amazon.com/iam/home).
+*   Make sure your IAM user at least have the following permissions in the IAM console:
+    *   Attach permission policy `AmazonS3ReadOnlyAccess` for read-only access to Amazon S3.
+    *   (optional) Attach permission policy `AmazonSQSFullAccess` to receive notifications from Amazon SQS, if you want to enable change event notifications.
+        Note that `AmazonSQSReadOnlyAccess` is not enough, as we need to be able to delete messages from the queue after they're processed.
+
+
+#### Setup Credentials for AWS SDK
+
+AWS SDK needs to access credentials to access Amazon S3.
+The easiest way to setup credentials is to run:
+
+```sh
+aws configure
+```
+
+It will create a credentials file at `~/.aws/credentials` and config at `~/.aws/config`.
+
+See the following documents if you need more control:
+
+*   [`aws configure`](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html)
+*   [Globally configuring AWS SDKs and tools](https://docs.aws.amazon.com/sdkref/latest/guide/creds-config-files.html)
+
+
+#### Create Amazon S3 buckets
+
+You can create a Amazon S3 bucket in the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), and upload your files to it.
+
+It's also doable by using the AWS CLI `aws s3 mb` (to create buckets) and `aws s3 cp` (to upload files).
+When doing so, make sure your current user also has permission policy `AmazonS3FullAccess`.
+
+#### (Optional) Setup SQS queue for event notifications
+
+You can setup an Amazon Simple Queue Service (Amazon SQS) queue to receive change event notifications from Amazon S3.
+It provides a change capture mechanism for your AmazonS3 data source, to trigger reprocessing of your AWS S3 files on any creation, update or deletion.  Please use a dedicated SQS queue for each of your S3 data source.
+
+This is how to setup:
+
+*   Create a SQS queue with proper access policy.
+    *   In the [Amazon SQS Console](https://console.aws.amazon.com/sqs/home), create a queue.
+    *   Add access policy statements, to make sure Amazon S3 can send messages to the queue.
+        ```json
+        {
+          ...
+          "Statement": [
+            ...
+            {
+              "Sid": "__publish_statement",
+              "Effect": "Allow",
+              "Principal": {
+                "Service": "s3.amazonaws.com"
+              },
+              "Resource": "${SQS_QUEUE_ARN}",
+              "Condition": {
+                "ArnLike": {
+                  "aws:SourceArn": "${S3_BUCKET_ARN}"
+                }
+              }
+            }
+          ]
+        }
+        ```
+
+        Here, you need to replace `${SQS_QUEUE_ARN}` and `${S3_BUCKET_ARN}` with the actual ARN of your SQS queue and S3 bucket.
+        You can find the ARN of your SQS queue in the existing policy statement (it starts with `arn:aws:sqs:`), and the ARN of your S3 bucket in the S3 console (it starts with `arn:aws:s3:`).
+
+*   In the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), open your S3 bucket. Under *Properties* tab, click *Create event notification*.
+    *   Fill in an arbitrary event name, e.g. `S3ChangeNotifications`.
+    *   If you want your AmazonS3 data source expose a subset of files sharing a prefix, set the same prefix here. Otherwise, leave it empty.
+    *   Select the following event types: *All object create events*, *All object removal events*.
+    *   Select *SQS queue* as the destination, and specify the SQS queue you created above.
+and enable *Change Event Notifications* for your bucket, and specify the SQS queue as the destination.
+
+AWS's [Guide of Configuring a Bucket for Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html#step1-create-sqs-queue-for-notification) provides more details.
+
+### Spec
+
+The spec takes the following fields:
+*   `bucket_name` (type: `str`, required): Amazon S3 bucket name.
+*   `prefix` (type: `str`, optional): if provided, only files with path starting with this prefix will be imported.
+*   `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
+*   `included_patterns` (type: `list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
+    If not specified, all files will be included.
+*   `excluded_patterns` (type: `list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
+    Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
+    If not specified, no files will be excluded.
+
+    :::info
+
+    `included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
+
+    :::
+
+*   `sqs_queue_url` (type: `str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.
+
+    :::info
+
+    We will delete messages from the queue after they're processed.
+    If there're unrelated messages in the queue (e.g. test messages that SQS will send automatically on queue creation, messages for a different bucket, for non-included files, etc.), we will delete the message upon receiving it, to avoid keeping receiving irrelevant messages again and again after they're redelivered.
+
+    :::
+
+### Schema
+
+The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
+*   `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
+*   `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file.
+
+
 ## GoogleDrive
 
 The `GoogleDrive` source imports files from Google Drive.
@@ -59,7 +178,7 @@ The spec takes the following fields:
 *   `service_account_credential_path` (type: `str`, required): full path to the service account credential file in JSON format.
 *   `root_folder_ids` (type: `list[str]`, required): a list of Google Drive folder IDs to import files from.
 *   `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
-*   `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a *change capture mechanism* by polling Google Drive for recent modified files periodically.
+*   `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.
 
     :::info