Skip to content

Commit 6466ee1

Browse files
committed
Feature: Enable programmatically pass in api_key besides reading from env
2 parents 86269ba + fa268d9 commit 6466ee1

File tree

34 files changed

+1025
-61
lines changed

34 files changed

+1025
-61
lines changed

README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
<a href="https://trendshift.io/repositories/13939" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13939" alt="cocoindex-io%2Fcocoindex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
2323
</div>
2424

25-
2625
Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
2726

2827
⭐ Drop a star to help us grow!
@@ -60,9 +59,8 @@ CocoIndex makes it effortless to transform data with AI, and keep source data an
6059

6160
</br>
6261

63-
64-
6562
## Exceptional velocity
63+
6664
Just declare transformation in dataflow with ~100 lines of python
6765

6866
```python
@@ -86,25 +84,30 @@ CocoIndex follows the idea of [Dataflow](https://en.wikipedia.org/wiki/Dataflow_
8684
**Particularly**, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
8785

8886
## Plug-and-Play Building Blocks
87+
8988
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.
9089

9190
<p align="center">
9291
<img src="https://cocoindex.io/images/components.svg" alt="CocoIndex Features">
9392
</p>
9493

9594
## Data Freshness
95+
9696
CocoIndex keep source data and target in sync effortlessly.
9797

9898
<p align="center">
9999
<img src="https://github.com/user-attachments/assets/f4eb29b3-84ee-4fa0-a1e2-80eedeeabde6" alt="Incremental Processing" width="700">
100100
</p>
101101

102102
It has out-of-box support for incremental indexing:
103+
103104
- minimal recomputation on source or logic change.
104105
- (re-)processing necessary portions; reuse cache when possible
105106

106-
## Quick Start:
107+
## Quick Start
108+
107109
If you're new to CocoIndex, we recommend checking out
110+
108111
- 📖 [Documentation](https://cocoindex.io/docs)
109112
-[Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart)
110113
- 🎬 [Quick Start Video Tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT)
@@ -119,7 +122,6 @@ pip install -U cocoindex
119122

120123
2. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. CocoIndex uses it for incremental processing.
121124

122-
123125
## Define data flow
124126

125127
Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow. An example flow looks like:
@@ -175,6 +177,7 @@ It defines an index flow like this:
175177
| [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |
176178
| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
177179
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
180+
| [PDF Elements Embedding](examples/pdf_elements_embedding) | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search |
178181
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
179182
| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
180183
| [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |
@@ -191,16 +194,18 @@ It defines an index flow like this:
191194
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
192195
| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
193196

194-
195197
More coming and stay tuned 👀!
196198

197199
## 📖 Documentation
200+
198201
For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).
199202

200203
## 🤝 Contributing
204+
201205
We love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).
202206

203207
## 👥 Community
208+
204209
Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
205210

206211
Join our community here:
@@ -210,8 +215,10 @@ Join our community here:
210215
- ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
211216
- 📜 [Read our blog posts](https://cocoindex.io/blogs/)
212217

213-
## Support us:
218+
## Support us
219+
214220
We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) to stay tuned and help us grow.
215221

216222
## License
223+
217224
CocoIndex is Apache 2.0 licensed.

docs/docs/ai/llm.mdx

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ We support the following types of LLM APIs:
2828
| [LiteLLM](#litellm) | `LlmApiType.LITE_LLM` |||
2929
| [OpenRouter](#openrouter) | `LlmApiType.OPEN_ROUTER` |||
3030
| [vLLM](#vllm) | `LlmApiType.VLLM` |||
31+
| [Bedrock](#bedrock) | `LlmApiType.BEDROCK` |||
3132

3233
## LLM Tasks
3334

@@ -440,3 +441,28 @@ cocoindex.LlmSpec(
440441

441442
</TabItem>
442443
</Tabs>
444+
445+
### Bedrock
446+
447+
To use the Bedrock API, you need to set up AWS credentials. You can do this by setting the following environment variables:
448+
449+
- `AWS_ACCESS_KEY_ID`
450+
- `AWS_SECRET_ACCESS_KEY`
451+
- `AWS_SESSION_TOKEN` (optional)
452+
453+
A spec for Bedrock looks like this:
454+
455+
<Tabs>
456+
<TabItem value="python" label="Python" default>
457+
458+
```python
459+
cocoindex.LlmSpec(
460+
api_type=cocoindex.LlmApiType.BEDROCK,
461+
model="us.anthropic.claude-3-5-haiku-20241022-v1:0",
462+
)
463+
```
464+
465+
</TabItem>
466+
</Tabs>
467+
468+
You can find the full list of models supported by Bedrock [here](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html).

docs/docs/core/flow_methods.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -210,7 +210,7 @@ A data source may enable one or multiple *change capture mechanisms*:
210210
* Configured with a [refresh interval](flow_def#refresh-interval), which is generally applicable to all data sources.
211211

212212
* Specific data sources also provide their specific change capture mechanisms.
213-
For example, [`Postgres` source](../sources/#postgres) listens to PostgreSQL's change notifications, [`AmazonS3` source](../sources/#amazons3) watches S3 bucket's change events, and [`GoogleDrive` source](../sources#googledrive) allows polling recent modified files.
213+
For example, [`Postgres` source](../sources/postgres) listens to PostgreSQL's change notifications, [`AmazonS3` source](../sources/amazons3) watches S3 bucket's change events, and [`GoogleDrive` source](../sources/googledrive) allows polling recent modified files.
214214
See documentations for specific data sources.
215215

216216
Change capture mechanisms enable CocoIndex to continuously capture changes from the source data and update the target data accordingly, under live update mode.

docs/docs/examples/examples/image_search.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def image_object_embedding_flow(flow_builder, data_scope):
6666

6767
The `add_source` function sets up a table with fields like `filename` and `content`. Images are automatically re-scanned every minute.
6868

69-
<DocumentationButton url="https://cocoindex.io/docs/ops/sources#localfile" text="LocalFile" />
69+
<DocumentationButton url="https://cocoindex.io/docs/ops/sources/localfile" text="LocalFile" />
7070

7171

7272
## Process Each Image and Collect the Embedding

docs/docs/examples/examples/multi_format_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ data_scope["documents"] = flow_builder.add_source(
5252
cocoindex.sources.LocalFile(path="source_files", binary=True)
5353
)
5454
```
55-
<DocumentationButton url="https://cocoindex.io/docs/ops/sources#localfile" text="LocalFile" margin="0 0 16px 0" />
55+
<DocumentationButton url="https://cocoindex.io/docs/ops/sources/localfile" text="LocalFile" margin="0 0 16px 0" />
5656

5757

5858
## Convert Files to Pages

docs/docs/examples/examples/photo_search.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ def face_recognition_flow(flow_builder, data_scope):
6565
This creates a table with `filename` and `content` fields. 📂
6666

6767

68-
You can connect it to your [S3 Buckets](https://cocoindex.io/docs/ops/sources#amazons3) (with SQS integration, [example](https://cocoindex.io/blogs/s3-incremental-etl))
69-
or [Azure Blob store](https://cocoindex.io/docs/ops/sources#azureblob).
68+
You can connect it to your [S3 Buckets](https://cocoindex.io/docs/ops/sources/amazons3) (with SQS integration, [example](https://cocoindex.io/blogs/s3-incremental-etl))
69+
or [Azure Blob store](https://cocoindex.io/docs/ops/sources/azureblob).
7070

7171
## Detect and Extract Faces
7272

docs/docs/examples/examples/postgres_source.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ CocoIndex incrementally sync data from Postgres. When new or updated rows are fo
5959
- `notification` enables change capture based on Postgres LISTEN/NOTIFY. Each change triggers an incremental processing on the specific row immediately.
6060
- Regardless if `notification` is provided or not, CocoIndex still needs to scan the full table to detect changes in some scenarios (e.g. between two `update` invocation), and the `ordinal_column` provides a field that CocoIndex can use to quickly detect which row has changed without reading value columns.
6161

62-
Check [Postgres source](https://cocoindex.io/docs/ops/sources#postgres) for more details.
62+
Check [Postgres source](https://cocoindex.io/docs/ops/sources/postgres) for more details.
6363

6464
If you use the Postgres database hosted by Supabase, please click Connect on your project dashboard and find the URL there. Check [DatabaseConnectionSpec](https://cocoindex.io/docs/core/settings#databaseconnectionspec)
6565
for more details.

docs/docs/sources/amazons3.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
title: AmazonS3
3+
toc_max_heading_level: 4
4+
description: CocoIndex AmazonS3 Built-in Sources
5+
---
6+
7+
### Setup for Amazon S3
8+
9+
#### Setup AWS accounts
10+
11+
You need to setup AWS accounts to own and access Amazon S3. In particular,
12+
13+
* Setup an AWS account from [AWS homepage](https://aws.amazon.com/) or login with an existing account.
14+
* AWS recommends all programming access to AWS should be done using [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) instead of root account. You can create an IAM user at [AWS IAM Console](https://console.aws.amazon.com/iam/home).
15+
* Make sure your IAM user at least have the following permissions in the IAM console:
16+
* Attach permission policy `AmazonS3ReadOnlyAccess` for read-only access to Amazon S3.
17+
* (optional) Attach permission policy `AmazonSQSFullAccess` to receive notifications from Amazon SQS, if you want to enable change event notifications.
18+
Note that `AmazonSQSReadOnlyAccess` is not enough, as we need to be able to delete messages from the queue after they're processed.
19+
20+
21+
#### Setup Credentials for AWS SDK
22+
23+
AWS SDK needs to access credentials to access Amazon S3.
24+
The easiest way to setup credentials is to run:
25+
26+
```sh
27+
aws configure
28+
```
29+
30+
It will create a credentials file at `~/.aws/credentials` and config at `~/.aws/config`.
31+
32+
See the following documents if you need more control:
33+
34+
* [`aws configure`](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html)
35+
* [Globally configuring AWS SDKs and tools](https://docs.aws.amazon.com/sdkref/latest/guide/creds-config-files.html)
36+
37+
38+
#### Create Amazon S3 buckets
39+
40+
You can create a Amazon S3 bucket in the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), and upload your files to it.
41+
42+
It's also doable by using the AWS CLI `aws s3 mb` (to create buckets) and `aws s3 cp` (to upload files).
43+
When doing so, make sure your current user also has permission policy `AmazonS3FullAccess`.
44+
45+
#### (Optional) Setup SQS queue for event notifications
46+
47+
You can setup an Amazon Simple Queue Service (Amazon SQS) queue to receive change event notifications from Amazon S3.
48+
It provides a change capture mechanism for your AmazonS3 data source, to trigger reprocessing of your AWS S3 files on any creation, update or deletion. Please use a dedicated SQS queue for each of your S3 data source.
49+
50+
This is how to setup:
51+
52+
* Create a SQS queue with proper access policy.
53+
* In the [Amazon SQS Console](https://console.aws.amazon.com/sqs/home), create a queue.
54+
* Add access policy statements, to make sure Amazon S3 can send messages to the queue.
55+
```json
56+
{
57+
...
58+
"Statement": [
59+
...
60+
{
61+
"Sid": "__publish_statement",
62+
"Effect": "Allow",
63+
"Principal": {
64+
"Service": "s3.amazonaws.com"
65+
},
66+
"Resource": "${SQS_QUEUE_ARN}",
67+
"Action": "SQS:SendMessage",
68+
"Condition": {
69+
"ArnLike": {
70+
"aws:SourceArn": "${S3_BUCKET_ARN}"
71+
}
72+
}
73+
}
74+
]
75+
}
76+
```
77+
78+
Here, you need to replace `${SQS_QUEUE_ARN}` and `${S3_BUCKET_ARN}` with the actual ARN of your SQS queue and S3 bucket.
79+
You can find the ARN of your SQS queue in the existing policy statement (it starts with `arn:aws:sqs:`), and the ARN of your S3 bucket in the S3 console (it starts with `arn:aws:s3:`).
80+
81+
* In the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), open your S3 bucket. Under *Properties* tab, click *Create event notification*.
82+
* Fill in an arbitrary event name, e.g. `S3ChangeNotifications`.
83+
* If you want your AmazonS3 data source to expose a subset of files sharing a prefix, set the same prefix here. Otherwise, leave it empty.
84+
* Select the following event types: *All object create events*, *All object removal events*.
85+
* Select *SQS queue* as the destination, and specify the SQS queue you created above.
86+
87+
AWS's [Guide of Configuring a Bucket for Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html#step1-create-sqs-queue-for-notification) provides more details.
88+
89+
### Spec
90+
91+
The spec takes the following fields:
92+
* `bucket_name` (`str`): Amazon S3 bucket name.
93+
* `prefix` (`str`, optional): if provided, only files with path starting with this prefix will be imported.
94+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
95+
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
96+
If not specified, all files will be included.
97+
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
98+
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
99+
If not specified, no files will be excluded.
100+
101+
:::info
102+
103+
`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
104+
105+
:::
106+
107+
* `sqs_queue_url` (`str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.
108+
109+
:::info
110+
111+
We will delete messages from the queue after they're processed.
112+
If there are unrelated messages in the queue (e.g. test messages that SQS will send automatically on queue creation, messages for a different bucket, for non-included files, etc.), we will delete the message upon receiving it, to avoid repeatedly receiving irrelevant messages after they're redelivered.
113+
114+
:::
115+
116+
### Schema
117+
118+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
119+
120+
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
121+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.

docs/docs/sources/azureblob.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
title: AzureBlob
3+
toc_max_heading_level: 4
4+
description: CocoIndex AzureBlob Built-in Sources
5+
---
6+
7+
The `AzureBlob` source imports files from Azure Blob Storage.
8+
9+
### Setup for Azure Blob Storage
10+
11+
#### Get Started
12+
13+
If you didn't have experience with Azure Blob Storage, you can refer to the [quickstart](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal).
14+
These are actions you need to take:
15+
16+
* Create a storage account in the [Azure Portal](https://portal.azure.com/).
17+
* Create a container in the storage account.
18+
* Upload your files to the container.
19+
* Grant the user / identity / service principal (depends on your authentication method, see below) access to the storage account. At minimum, a **Storage Blob Data Reader** role is needed. See [this doc](https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-data-operations-portal) for reference.
20+
21+
#### Authentication
22+
23+
We support the following authentication methods:
24+
25+
* Shared access signature (SAS) tokens.
26+
You can generate it from the Azure Portal in the settings for a specific container.
27+
You need to provide at least *List* and *Read* permissions when generating the SAS token.
28+
It's a query string in the form of
29+
`sp=rl&st=2025-07-20T09:33:00Z&se=2025-07-19T09:48:53Z&sv=2024-11-04&sr=c&sig=i3FDjsadfklj3%23adsfkk`.
30+
31+
* Storage account access key. You can find it in the Azure Portal in the settings for a specific storage account.
32+
33+
* Default credential. When none of the above is provided, it will use the default credential.
34+
35+
This allows you to connect to Azure services without putting any secrets in the code or flow spec.
36+
It automatically chooses the best authentication method based on your environment:
37+
38+
* On your local machine: uses your Azure CLI login (`az login`) or environment variables.
39+
40+
```sh
41+
az login
42+
# Optional: Set a default subscription if you have more than one
43+
az account set --subscription "<YOUR_SUBSCRIPTION_NAME_OR_ID>"
44+
```
45+
* In Azure (VM, App Service, AKS, etc.): uses the resource’s Managed Identity.
46+
* In automated environments: supports Service Principals via environment variables
47+
* `AZURE_CLIENT_ID`
48+
* `AZURE_TENANT_ID`
49+
* `AZURE_CLIENT_SECRET`
50+
51+
You can refer to [this doc](https://learn.microsoft.com/en-us/azure/developer/python/sdk/authentication/overview) for more details.
52+
53+
### Spec
54+
55+
The spec takes the following fields:
56+
57+
* `account_name` (`str`): the name of the storage account.
58+
* `container_name` (`str`): the name of the container.
59+
* `prefix` (`str`, optional): if provided, only files with path starting with this prefix will be imported.
60+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
61+
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
62+
If not specified, all files will be included.
63+
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
64+
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
65+
If not specified, no files will be excluded.
66+
* `sas_token` (`cocoindex.TransientAuthEntryReference[str]`, optional): a SAS token for authentication.
67+
* `account_access_key` (`cocoindex.TransientAuthEntryReference[str]`, optional): an account access key for authentication.
68+
69+
:::info
70+
71+
`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
72+
73+
:::
74+
75+
### Schema
76+
77+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
78+
79+
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
80+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.

0 commit comments

Comments
 (0)