Skip to content

Commit 4ae14d1

Browse files
authored
add Azure Blob Storage source with full authentication support (#736)
- Implement Azure Blob Storage source following S3 pattern - Support multiple authentication methods with priority: * Connection string (highest priority) * SAS token (with proper permissions validation) * Account key (full access) * Anonymous access (public containers) - Include file pattern filtering (include/exclude glob patterns)
1 parent 306c45e commit 4ae14d1

File tree

10 files changed

+1029
-12
lines changed

10 files changed

+1029
-12
lines changed

Cargo.lock

Lines changed: 420 additions & 12 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,10 @@ json5 = "0.4.1"
114114
aws-config = "1.6.2"
115115
aws-sdk-s3 = "1.85.0"
116116
aws-sdk-sqs = "1.67.0"
117+
azure_core = "0.21.0"
118+
azure_storage = "0.21.0"
119+
azure_storage_blobs = "0.21.0"
120+
time = { version = "0.3", features = ["macros", "serde"] }
117121
numpy = "0.25.0"
118122
infer = "0.19.0"
119123
serde_with = { version = "3.13.0", features = ["base64"] }
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Database Configuration
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
# Azure Blob Storage Configuration (Public test container - ready to use!)
5+
AZURE_STORAGE_ACCOUNT_NAME=testnamecocoindex1
6+
AZURE_BLOB_CONTAINER_NAME=testpublic1
7+
AZURE_BLOB_PREFIX=
8+
9+
# Authentication Options (choose ONE - in priority order):
10+
11+
# Option 1: Connection String (HIGHEST PRIORITY - recommended for development)
12+
# NOTE: Use ACCOUNT KEY connection string, NOT SAS connection string!
13+
# AZURE_BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=testnamecocoindex1;AccountKey=key1-goes-here;EndpointSuffix=core.windows.net
14+
15+
# Option 2: SAS Token (SECOND PRIORITY - recommended for production)
16+
# AZURE_BLOB_SAS_TOKEN=sp=r&st=2024-01-01T00:00:00Z&se=2025-12-31T23:59:59Z&spr=https&sv=2022-11-02&sr=c&sig=...
17+
18+
# Option 3: Account Key (THIRD PRIORITY)
19+
# AZURE_BLOB_ACCOUNT_KEY=key1-goes-here
20+
21+
# Option 4: Anonymous access (FALLBACK - for public containers only)
22+
# Leave all auth options commented out - testpublic1 container supports this!
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
This example builds an embedding index based on files stored in an Azure Blob Storage container.
2+
It continuously updates the index as files are added / updated / deleted in the source container:
3+
it keeps the index in sync with the Azure Blob Storage container effortlessly.
4+
5+
## Quick Start (Public Test Container)
6+
7+
🚀 **Try it immediately!** We provide a public test container with sample documents:
8+
- **Account:** `testnamecocoindex1`
9+
- **Container:** `testpublic1` (public access)
10+
- **No authentication required!**
11+
12+
Just copy `.env.example` to `.env` and run - it works out of the box with anonymous access.
13+
14+
## Prerequisite
15+
16+
Before running the example, you need to:
17+
18+
1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
19+
20+
2. Prepare for Azure Blob Storage.
21+
You'll need an Azure Storage account and container. Supported authentication methods:
22+
- **Connection String** (recommended for development)
23+
- **SAS Token** (recommended for production)
24+
- **Account Key** (full access)
25+
- **Anonymous access** (for public containers only)
26+
27+
3. Create a `.env` file with your Azure Blob Storage configuration.
28+
Start from copying the `.env.example`, and then edit it to fill in your credentials.
29+
30+
```bash
31+
cp .env.example .env
32+
$EDITOR .env
33+
```
34+
35+
Example `.env` file with connection string:
36+
```
37+
# Database Configuration
38+
DATABASE_URL=postgresql://localhost:5432/cocoindex
39+
40+
# Azure Blob Storage Configuration
41+
AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount
42+
AZURE_BLOB_CONTAINER_NAME=mydocuments
43+
AZURE_BLOB_PREFIX=
44+
45+
# Authentication (choose one)
46+
AZURE_BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=mykey123;EndpointSuffix=core.windows.net
47+
```
48+
49+
## Run
50+
51+
Install dependencies:
52+
53+
```sh
54+
pip install -e .
55+
```
56+
57+
Run:
58+
59+
```sh
60+
python main.py
61+
```
62+
63+
During running, it will keep observing changes in the Azure Blob Storage container and update the index automatically.
64+
At the same time, it accepts queries from the terminal, and performs search on top of the up-to-date index.
65+
66+
67+
## CocoInsight
68+
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
69+
70+
Run CocoInsight to understand your RAG data pipeline:
71+
72+
```sh
73+
cocoindex server -ci main.py
74+
```
75+
76+
You can also add a `-L` flag to make the server keep updating the index to reflect source changes at the same time:
77+
78+
```sh
79+
cocoindex server -ci -L main.py
80+
```
81+
82+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
83+
84+
## Authentication Methods & Troubleshooting
85+
86+
### Connection String (Recommended for Development)
87+
```bash
88+
AZURE_BLOB_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=testnamecocoindex1;AccountKey=your-key;EndpointSuffix=core.windows.net"
89+
```
90+
- **Pros:** Easiest to set up, contains all necessary information
91+
- **Cons:** Contains account key (full access)
92+
- **⚠️ Important:** Use **Account Key** connection string, NOT SAS connection string!
93+
94+
### SAS Token (Recommended for Production)
95+
```bash
96+
AZURE_BLOB_SAS_TOKEN="sp=r&st=2024-01-01T00:00:00Z&se=2025-12-31T23:59:59Z&spr=https&sv=2022-11-02&sr=c&sig=..."
97+
```
98+
- **Pros:** Fine-grained permissions, time-limited
99+
- **Cons:** More complex to generate and manage
100+
101+
**SAS Token Requirements:**
102+
- `sp=r` - Read permission (required)
103+
- `sp=rl` - Read + List permissions (recommended)
104+
- `sr=c` - Container scope (to access all blobs)
105+
- Valid time range (`st` and `se` in UTC)
106+
107+
### Account Key
108+
```bash
109+
AZURE_BLOB_ACCOUNT_KEY="your-account-key-here"
110+
```
111+
- **Pros:** Simple to use
112+
- **Cons:** Full account access, security risk
113+
114+
### Anonymous Access
115+
Leave all authentication options empty - only works with public containers.
116+
117+
## Common Issues
118+
119+
### 401 Authentication Error
120+
```
121+
Error: server returned error status which will not be retried: 401
122+
Error Code: NoAuthenticationInformation
123+
```
124+
125+
**Solutions:**
126+
1. **Check authentication priority:** Connection String > SAS Token > Account Key > Anonymous
127+
2. **Verify SAS token permissions:** Must include `r` (read) and `l` (list) permissions
128+
3. **Check SAS token expiry:** Ensure `se` (expiry time) is in the future
129+
4. **Verify container scope:** Use `sr=c` for container-level access
130+
131+
### Connection String Issues
132+
133+
**⚠️ CRITICAL: Use Account Key Connection String, NOT SAS Connection String!**
134+
135+
**✅ Correct (Account Key Connection String):**
136+
```
137+
DefaultEndpointsProtocol=https;AccountName=testnamecocoindex1;AccountKey=your-key;EndpointSuffix=core.windows.net
138+
```
139+
140+
**❌ Wrong (SAS Connection String - will not work):**
141+
```
142+
BlobEndpoint=https://testnamecocoindex1.blob.core.windows.net/;SharedAccessSignature=sp=r&st=...
143+
```
144+
145+
**Other tips:**
146+
- Don't include quotes in the actual connection string value
147+
- Account name in connection string should match `AZURE_STORAGE_ACCOUNT_NAME`
148+
- Connection string must contain `AccountKey=` parameter
149+
150+
### Container Access Issues
151+
- Verify container exists and account has access
152+
- Check `AZURE_BLOB_CONTAINER_NAME` spelling
153+
- For anonymous access, container must be public
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
from dotenv import load_dotenv
2+
from psycopg_pool import ConnectionPool
3+
import cocoindex
4+
import os
5+
from typing import Any
6+
7+
8+
@cocoindex.transform_flow()
9+
def text_to_embedding(
10+
text: cocoindex.DataSlice[str],
11+
) -> cocoindex.DataSlice[list[float]]:
12+
"""
13+
Embed the text using a SentenceTransformer model.
14+
This is a shared logic between indexing and querying, so extract it as a function.
15+
"""
16+
return text.transform(
17+
cocoindex.functions.SentenceTransformerEmbed(
18+
model="sentence-transformers/all-MiniLM-L6-v2"
19+
)
20+
)
21+
22+
23+
@cocoindex.flow_def(name="AzureBlobTextEmbedding")
24+
def azure_blob_text_embedding_flow(
25+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
26+
) -> None:
27+
"""
28+
Define an example flow that embeds text from Azure Blob Storage into a vector database.
29+
"""
30+
account_name = os.environ["AZURE_STORAGE_ACCOUNT_NAME"]
31+
container_name = os.environ["AZURE_BLOB_CONTAINER_NAME"]
32+
prefix = os.environ.get("AZURE_BLOB_PREFIX", None)
33+
34+
# Authentication options (in priority order)
35+
connection_string = os.environ.get("AZURE_BLOB_CONNECTION_STRING", None)
36+
account_key = os.environ.get("AZURE_BLOB_ACCOUNT_KEY", None)
37+
sas_token = os.environ.get("AZURE_BLOB_SAS_TOKEN", None)
38+
39+
data_scope["documents"] = flow_builder.add_source(
40+
cocoindex.sources.AzureBlob(
41+
account_name=account_name,
42+
container_name=container_name,
43+
prefix=prefix,
44+
included_patterns=["*.md", "*.mdx", "*.txt", "*.docx"],
45+
binary=False,
46+
connection_string=connection_string,
47+
account_key=account_key,
48+
sas_token=sas_token,
49+
)
50+
)
51+
52+
doc_embeddings = data_scope.add_collector()
53+
54+
with data_scope["documents"].row() as doc:
55+
doc["chunks"] = doc["content"].transform(
56+
cocoindex.functions.SplitRecursively(),
57+
language="markdown",
58+
chunk_size=2000,
59+
chunk_overlap=500,
60+
)
61+
62+
with doc["chunks"].row() as chunk:
63+
chunk["embedding"] = text_to_embedding(chunk["text"])
64+
doc_embeddings.collect(
65+
filename=doc["filename"],
66+
location=chunk["location"],
67+
text=chunk["text"],
68+
embedding=chunk["embedding"],
69+
)
70+
71+
doc_embeddings.export(
72+
"doc_embeddings",
73+
cocoindex.targets.Postgres(),
74+
primary_key_fields=["filename", "location"],
75+
vector_indexes=[
76+
cocoindex.VectorIndexDef(
77+
field_name="embedding",
78+
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
79+
)
80+
],
81+
)
82+
83+
84+
def search(pool: ConnectionPool, query: str, top_k: int = 5) -> list[dict[str, Any]]:
85+
# Get the table name, for the export target in the azure_blob_text_embedding_flow above.
86+
table_name = cocoindex.utils.get_target_default_name(
87+
azure_blob_text_embedding_flow, "doc_embeddings"
88+
)
89+
# Evaluate the transform flow defined above with the input query, to get the embedding.
90+
query_vector = text_to_embedding.eval(query)
91+
# Run the query and get the results.
92+
with pool.connection() as conn:
93+
with conn.cursor() as cur:
94+
cur.execute(
95+
f"""
96+
SELECT filename, text, embedding <=> %s::vector AS distance
97+
FROM {table_name} ORDER BY distance LIMIT %s
98+
""",
99+
(query_vector, top_k),
100+
)
101+
return [
102+
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
103+
for row in cur.fetchall()
104+
]
105+
106+
107+
def _main() -> None:
108+
# Initialize the database connection pool.
109+
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
110+
111+
azure_blob_text_embedding_flow.setup()
112+
with cocoindex.FlowLiveUpdater(azure_blob_text_embedding_flow):
113+
# Run queries in a loop to demonstrate the query capabilities.
114+
while True:
115+
query = input("Enter search query (or Enter to quit): ")
116+
if query == "":
117+
break
118+
# Run the query function with the database connection pool and the query.
119+
results = search(pool, query)
120+
print("\nSearch results:")
121+
for result in results:
122+
print(f"[{result['score']:.3f}] {result['filename']}")
123+
print(f" {result['text']}")
124+
print("---")
125+
print()
126+
127+
128+
if __name__ == "__main__":
129+
load_dotenv()
130+
cocoindex.init()
131+
_main()
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[project]
2+
name = "azure-blob-text-embedding"
3+
version = "0.1.0"
4+
description = "Simple example for cocoindex: build embedding index based on Azure Blob Storage files."
5+
requires-python = ">=3.11"
6+
dependencies = ["cocoindex[embeddings]>=0.1.63", "python-dotenv>=1.0.1"]
7+
8+
[tool.setuptools]
9+
packages = []

python/cocoindex/sources.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,26 @@ class AmazonS3(op.SourceSpec):
4343
included_patterns: list[str] | None = None
4444
excluded_patterns: list[str] | None = None
4545
sqs_queue_url: str | None = None
46+
47+
48+
class AzureBlob(op.SourceSpec):
49+
"""Import data from an Azure Blob Storage container. Supports optional prefix and file filtering by glob patterns.
50+
51+
Authentication options (in priority order):
52+
1. connection_string - Full connection string with credentials
53+
2. sas_token - Shared Access Signature token
54+
3. account_key - Storage account access key
55+
4. None - Anonymous access (for public containers)
56+
"""
57+
58+
_op_category = op.OpCategory.SOURCE
59+
60+
account_name: str
61+
container_name: str
62+
prefix: str | None = None
63+
binary: bool = False
64+
included_patterns: list[str] | None = None
65+
excluded_patterns: list[str] | None = None
66+
account_key: str | None = None
67+
sas_token: str | None = None
68+
connection_string: str | None = None

src/ops/registration.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ fn register_executor_factories(registry: &mut ExecutorFactoryRegistry) -> Result
1111
sources::local_file::Factory.register(registry)?;
1212
sources::google_drive::Factory.register(registry)?;
1313
sources::amazon_s3::Factory.register(registry)?;
14+
sources::azure_blob::Factory.register(registry)?;
1415

1516
functions::parse_json::Factory.register(registry)?;
1617
functions::split_recursively::register(registry)?;

0 commit comments

Comments
 (0)