Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 120 additions & 1 deletion docs/docs/ops/sources.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Sources
toc_max_heading_level: 4
description: CocoIndex Built-in Sources
---

Expand Down Expand Up @@ -32,6 +33,124 @@ The output is a [KTable](/docs/core/data_types#ktable) with the following sub fi
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file

## AmazonS3

### Setup for Amazon S3

#### Setup AWS accounts

You need to setup AWS accounts to own and access Amazon S3. In particular,

* Setup an AWS account from [AWS homepage](https://aws.amazon.com/) or login with an existing account.
* AWS recommends all programming access to AWS should be done using [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) instead of root account. You can create an IAM user at [AWS IAM Console](https://console.aws.amazon.com/iam/home).
* Make sure your IAM user at least have the following permissions in the IAM console:
* Attach permission policy `AmazonS3ReadOnlyAccess` for read-only access to Amazon S3.
* (optional) Attach permission policy `AmazonSQSFullAccess` to receive notifications from Amazon SQS, if you want to enable change event notifications.
Note that `AmazonSQSReadOnlyAccess` is not enough, as we need to be able to delete messages from the queue after they're processed.


#### Setup Credentials for AWS SDK

AWS SDK needs to access credentials to access Amazon S3.
The easiest way to setup credentials is to run:

```sh
aws configure
```

It will create a credentials file at `~/.aws/credentials` and config at `~/.aws/config`.

See the following documents if you need more control:

* [`aws configure`](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html)
* [Globally configuring AWS SDKs and tools](https://docs.aws.amazon.com/sdkref/latest/guide/creds-config-files.html)


#### Create Amazon S3 buckets

You can create a Amazon S3 bucket in the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), and upload your files to it.

It's also doable by using the AWS CLI `aws s3 mb` (to create buckets) and `aws s3 cp` (to upload files).
When doing so, make sure your current user also has permission policy `AmazonS3FullAccess`.

#### (Optional) Setup SQS queue for event notifications

You can setup an Amazon Simple Queue Service (Amazon SQS) queue to receive change event notifications from Amazon S3.
It provides a change capture mechanism for your AmazonS3 data source, to trigger reprocessing of your AWS S3 files on any creation, update or deletion. Please use a dedicated SQS queue for each of your S3 data source.

This is how to setup:

* Create a SQS queue with proper access policy.
* In the [Amazon SQS Console](https://console.aws.amazon.com/sqs/home), create a queue.
* Add access policy statements, to make sure Amazon S3 can send messages to the queue.
```json
{
...
"Statement": [
...
{
"Sid": "__publish_statement",
"Effect": "Allow",
"Principal": {
"Service": "s3.amazonaws.com"
},
"Resource": "${SQS_QUEUE_ARN}",
"Condition": {
"ArnLike": {
"aws:SourceArn": "${S3_BUCKET_ARN}"
}
}
}
]
}
```

Here, you need to replace `${SQS_QUEUE_ARN}` and `${S3_BUCKET_ARN}` with the actual ARN of your SQS queue and S3 bucket.
You can find the ARN of your SQS queue in the existing policy statement (it starts with `arn:aws:sqs:`), and the ARN of your S3 bucket in the S3 console (it starts with `arn:aws:s3:`).

* In the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), open your S3 bucket. Under *Properties* tab, click *Create event notification*.
* Fill in an arbitrary event name, e.g. `S3ChangeNotifications`.
* If you want your AmazonS3 data source expose a subset of files sharing a prefix, set the same prefix here. Otherwise, leave it empty.
* Select the following event types: *All object create events*, *All object removal events*.
* Select *SQS queue* as the destination, and specify the SQS queue you created above.
and enable *Change Event Notifications* for your bucket, and specify the SQS queue as the destination.

AWS's [Guide of Configuring a Bucket for Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html#step1-create-sqs-queue-for-notification) provides more details.

### Spec

The spec takes the following fields:
* `bucket_name` (type: `str`, required): Amazon S3 bucket name.
* `prefix` (type: `str`, optional): if provided, only files with path starting with this prefix will be imported.
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
* `included_patterns` (type: `list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
If not specified, all files will be included.
* `excluded_patterns` (type: `list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
If not specified, no files will be excluded.

:::info

`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.

:::

* `sqs_queue_url` (type: `str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.

:::info

We will delete messages from the queue after they're processed.
If there're unrelated messages in the queue (e.g. test messages that SQS will send automatically on queue creation, messages for a different bucket, for non-included files, etc.), we will delete the message upon receiving it, to avoid keeping receiving irrelevant messages again and again after they're redelivered.

:::

### Schema

The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file.


## GoogleDrive

The `GoogleDrive` source imports files from Google Drive.
Expand Down Expand Up @@ -59,7 +178,7 @@ The spec takes the following fields:
* `service_account_credential_path` (type: `str`, required): full path to the service account credential file in JSON format.
* `root_folder_ids` (type: `list[str]`, required): a list of Google Drive folder IDs to import files from.
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
* `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a *change capture mechanism* by polling Google Drive for recent modified files periodically.
* `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.

:::info

Expand Down