-
Notifications
You must be signed in to change notification settings - Fork 285
support s3 as native source #475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # Postgres database address for cocoindex | ||
| COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex | ||
|
|
||
| # S3 bucket name for CocoIndex source | ||
| S3_BUCKET_NAME=your-s3-bucket-name | ||
|
|
||
| # (Optional) S3 prefix to restrict to a subfolder | ||
| # S3_PREFIX=optional/subfolder/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| .env |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| This example builds an embedding index based on files stored in an Amazon S3 bucket. | ||
| It continuously updates the index as files are added / updated / deleted in the source bucket: | ||
| it keeps the index in sync with the S3 bucket effortlessly. | ||
|
|
||
| ## Prerequisite | ||
|
|
||
| Before running the example, you need to: | ||
|
|
||
| 1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. | ||
|
|
||
| 2. Prepare for Amazon S3: | ||
|
|
||
| - **Create an S3 bucket:** | ||
| - Go to the [AWS S3 Console](https://s3.console.aws.amazon.com/s3/home) and click **Create bucket**. Give it a unique name and choose a region. | ||
| - Or, use the AWS CLI: | ||
| ```sh | ||
| aws s3 mb s3://your-s3-bucket-name | ||
| ``` | ||
|
|
||
| - **Upload your files to the bucket:** | ||
| - In the AWS Console, click your bucket, then click **Upload** and add your `.md`, `.txt`, `.docx`, or other files. | ||
| - Or, use the AWS CLI: | ||
| ```sh | ||
| aws s3 cp localfile.txt s3://your-s3-bucket-name/ | ||
| aws s3 cp your-folder/ s3://your-s3-bucket-name/ --recursive | ||
| ``` | ||
|
|
||
| - **Set up AWS credentials:** | ||
| - The easiest way is to run: | ||
| ```sh | ||
| aws configure | ||
| ``` | ||
| Enter your AWS Access Key ID, Secret Access Key, region (e.g., `us-east-1`), and output format (`json`). | ||
| - This creates a credentials file at `~/.aws/credentials` and config at `~/.aws/config`. | ||
| - Alternatively, you can set environment variables: | ||
| ```sh | ||
| export AWS_ACCESS_KEY_ID=your-access-key-id | ||
| export AWS_SECRET_ACCESS_KEY=your-secret-access-key | ||
| export AWS_DEFAULT_REGION=us-east-1 | ||
| ``` | ||
| - If running on AWS EC2 or Lambda, you can use an [IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) with S3 read permissions. | ||
|
|
||
| - **(Optional) Specify a prefix** to restrict to a subfolder in the bucket by setting `S3_PREFIX` in your `.env`. | ||
|
|
||
| See [AWS S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) for more details. | ||
|
|
||
| 3. Create a `.env` file with your S3 bucket name and (optionally) prefix. | ||
| Start from copying the `.env.example`, and then edit it to fill in your bucket name and prefix. | ||
|
|
||
| ```bash | ||
| cp .env.example .env | ||
| $EDITOR .env | ||
| ``` | ||
|
|
||
| ## Run | ||
|
|
||
| Install dependencies: | ||
|
|
||
| ```sh | ||
| uv pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| Setup: | ||
|
|
||
| ```sh | ||
| uv run main.py cocoindex setup | ||
| ``` | ||
|
|
||
| Run: | ||
|
|
||
| ```sh | ||
| uv run main.py | ||
| ``` | ||
|
|
||
| During running, it will keep observing changes in the S3 bucket and update the index automatically. | ||
| At the same time, it accepts queries from the terminal, and performs search on top of the up-to-date index. | ||
|
|
||
|
|
||
| ## CocoInsight | ||
| CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9). | ||
|
|
||
| Run CocoInsight to understand your RAG data pipeline: | ||
|
|
||
| ```sh | ||
| uv run main.py cocoindex server -ci | ||
| ``` | ||
|
|
||
| You can also add a `-L` flag to make the server keep updating the index to reflect source changes at the same time: | ||
|
|
||
| ```sh | ||
| uv run main.py cocoindex server -ci -L | ||
| ``` | ||
|
|
||
| Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| from dotenv import load_dotenv | ||
|
|
||
| import asyncio | ||
| import cocoindex | ||
| import datetime | ||
| import os | ||
|
|
||
| @cocoindex.flow_def(name="S3TextEmbedding") | ||
| def s3_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): | ||
| """ | ||
| Define an example flow that embeds text from S3 into a vector database. | ||
| """ | ||
| bucket_name = os.environ["S3_BUCKET_NAME"] | ||
| prefix = os.environ.get("S3_PREFIX", None) | ||
|
|
||
| data_scope["documents"] = flow_builder.add_source( | ||
| cocoindex.sources.S3( | ||
| bucket_name=bucket_name, | ||
| prefix=prefix, | ||
| included_patterns=["*.md", "*.txt", "*.docx"], | ||
| binary=False), | ||
| refresh_interval=datetime.timedelta(minutes=1)) | ||
|
|
||
| doc_embeddings = data_scope.add_collector() | ||
|
|
||
| with data_scope["documents"].row() as doc: | ||
| doc["chunks"] = doc["content"].transform( | ||
| cocoindex.functions.SplitRecursively(), | ||
| language="markdown", chunk_size=2000, chunk_overlap=500) | ||
|
|
||
| with doc["chunks"].row() as chunk: | ||
| chunk["embedding"] = chunk["text"].transform( | ||
| cocoindex.functions.SentenceTransformerEmbed( | ||
| model="sentence-transformers/all-MiniLM-L6-v2")) | ||
| doc_embeddings.collect(filename=doc["filename"], location=chunk["location"], | ||
| text=chunk["text"], embedding=chunk["embedding"]) | ||
|
|
||
| doc_embeddings.export( | ||
| "doc_embeddings", | ||
| cocoindex.storages.Postgres(), | ||
| primary_key_fields=["filename", "location"], | ||
| vector_indexes=[ | ||
| cocoindex.VectorIndexDef( | ||
| field_name="embedding", | ||
| metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)]) | ||
|
|
||
| query_handler = cocoindex.query.SimpleSemanticsQueryHandler( | ||
| name="SemanticsSearch", | ||
| flow=s3_text_embedding_flow, | ||
| target_name="doc_embeddings", | ||
| query_transform_flow=lambda text: text.transform( | ||
| cocoindex.functions.SentenceTransformerEmbed( | ||
| model="sentence-transformers/all-MiniLM-L6-v2")), | ||
| default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY) | ||
|
|
||
| @cocoindex.main_fn() | ||
| def _run(): | ||
| # Use a `FlowLiveUpdater` to keep the flow data updated. | ||
| with cocoindex.FlowLiveUpdater(s3_text_embedding_flow): | ||
| # Run queries in a loop to demonstrate the query capabilities. | ||
| while True: | ||
| try: | ||
| query = input("Enter search query (or Enter to quit): ") | ||
| if query == '': | ||
| break | ||
| results, _ = query_handler.search(query, 10) | ||
| print("\nSearch results:") | ||
| for result in results: | ||
| print(f"[{result.score:.3f}] {result.data['filename']}") | ||
| print(f" {result.data['text']}") | ||
| print("---") | ||
| print() | ||
| except KeyboardInterrupt: | ||
| break | ||
|
|
||
| if __name__ == "__main__": | ||
| load_dotenv(override=True) | ||
| _run() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| cocoindex | ||
| python-dotenv | ||
| boto3 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,3 @@ | ||
| pub mod google_drive; | ||
| pub mod local_file; | ||
| pub mod s3; |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.