Skip to content

Commit da6cf16

Browse files
committed
docs: revise the docs for Azure Blob source
1 parent fbc1e3c commit da6cf16

File tree

4 files changed

+89
-124
lines changed

4 files changed

+89
-124
lines changed

docs/docs/ops/sources.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,72 @@ The spec takes the following fields:
148148
### Schema
149149

150150
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
151+
152+
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
153+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.
154+
155+
156+
## AzureBlob
157+
158+
The `AzureBlob` source imports files from Azure Blob Storage.
159+
160+
### Setup for Azure Blob Storage
161+
162+
#### Get Started
163+
164+
If you didn't have experience with Azure Blob Storage, you can refer to the [quickstart](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal).
165+
These are actions you need to take:
166+
167+
* Create a storage account in the [Azure Portal](https://portal.azure.com/).
168+
* Create a container in the storage account.
169+
* Upload your files to the container.
170+
* Grant the user / identity / service principal (depends on your authentication method, see below) access to the container. At minimum, a **Storage Blob Data Reader** role is needed. See [this doc](https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-data-operations-portal) for reference.
171+
172+
#### Authentication
173+
174+
We use Azure’s **Default Credential** system (DefaultAzureCredential) for secure and flexible authentication.
175+
This allows you to connect to Azure services without putting any secrets in the code or flow spec.
176+
It automatically chooses the best authentication method based on your environment:
177+
178+
* On your local machine: uses your Azure CLI login (`az login`) or environment variables.
179+
180+
```sh
181+
az login
182+
# Optional: Set a default subscription if you have more than one
183+
az account set --subscription "<YOUR_SUBSCRIPTION_NAME_OR_ID>"
184+
```
185+
* In Azure (VM, App Service, AKS, etc.): uses the resource’s Managed Identity.
186+
* In automated environments: supports Service Principals via environment variables
187+
* `AZURE_CLIENT_ID`
188+
* `AZURE_TENANT_ID`
189+
* `AZURE_CLIENT_SECRET`
190+
191+
You can refer to [this doc](https://learn.microsoft.com/en-us/azure/developer/python/sdk/authentication/overview) for more details.
192+
193+
### Spec
194+
195+
The spec takes the following fields:
196+
197+
* `account_name` (`str`): the name of the storage account.
198+
* `container_name` (`str`): the name of the container.
199+
* `prefix` (`str`, optional): if provided, only files with path starting with this prefix will be imported.
200+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
201+
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
202+
If not specified, all files will be included.
203+
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
204+
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
205+
If not specified, no files will be excluded.
206+
207+
:::info
208+
209+
`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
210+
211+
:::
212+
213+
### Schema
214+
215+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
216+
151217
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
152218
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.
153219

examples/azure_blob_embedding/.env.example

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,3 @@ COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
55
AZURE_STORAGE_ACCOUNT_NAME=testnamecocoindex1
66
AZURE_BLOB_CONTAINER_NAME=testpublic1
77
AZURE_BLOB_PREFIX=
8-
9-
# Authentication Options (choose ONE - in priority order):
10-
11-
# Option 1: Connection String (HIGHEST PRIORITY - recommended for development)
12-
# NOTE: Use ACCOUNT KEY connection string, NOT SAS connection string!
13-
# AZURE_BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=testnamecocoindex1;AccountKey=key1-goes-here;EndpointSuffix=core.windows.net
14-
15-
# Option 2: SAS Token (SECOND PRIORITY - recommended for production)
16-
# AZURE_BLOB_SAS_TOKEN=sp=r&st=2024-01-01T00:00:00Z&se=2025-12-31T23:59:59Z&spr=https&sv=2022-11-02&sr=c&sig=...
17-
18-
# Option 3: Account Key (THIRD PRIORITY)
19-
# AZURE_BLOB_ACCOUNT_KEY=key1-goes-here
20-
21-
# Option 4: Anonymous access (FALLBACK - for public containers only)
22-
# Leave all auth options commented out - testpublic1 container supports this!

examples/azure_blob_embedding/README.md

Lines changed: 7 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -2,48 +2,31 @@ This example builds an embedding index based on files stored in an Azure Blob St
22
It continuously updates the index as files are added / updated / deleted in the source container:
33
it keeps the index in sync with the Azure Blob Storage container effortlessly.
44

5-
## Quick Start (Public Test Container)
6-
7-
🚀 **Try it immediately!** We provide a public test container with sample documents:
8-
- **Account:** `testnamecocoindex1`
9-
- **Container:** `testpublic1` (public access)
10-
- **No authentication required!**
11-
12-
Just copy `.env.example` to `.env` and run - it works out of the box with anonymous access.
13-
145
## Prerequisite
156

167
Before running the example, you need to:
178

189
1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
1910

2011
2. Prepare for Azure Blob Storage.
21-
You'll need an Azure Storage account and container. Supported authentication methods:
22-
- **Connection String** (recommended for development)
23-
- **SAS Token** (recommended for production)
24-
- **Account Key** (full access)
25-
- **Anonymous access** (for public containers only)
12+
See [Setup for Azure Blob Storage](https://cocoindex.io/docs/ops/sources#setup-for-azure-blob-storage) for more details.
2613

27-
3. Create a `.env` file with your Azure Blob Storage configuration.
28-
Start from copying the `.env.example`, and then edit it to fill in your credentials.
14+
3. Create a `.env` file with your Azure Blob Storage container name and (optionally) prefix.
15+
Start from copying the `.env.example`, and then edit it to fill in your bucket name and prefix.
2916

3017
```bash
3118
cp .env.example .env
3219
$EDITOR .env
3320
```
3421

35-
Example `.env` file with connection string:
22+
Example `.env` file:
3623
```
3724
# Database Configuration
3825
DATABASE_URL=postgresql://localhost:5432/cocoindex
3926

4027
# Azure Blob Storage Configuration
41-
AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount
42-
AZURE_BLOB_CONTAINER_NAME=mydocuments
43-
AZURE_BLOB_PREFIX=
44-
45-
# Authentication (choose one)
46-
AZURE_BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=mykey123;EndpointSuffix=core.windows.net
28+
AZURE_BLOB_STORAGE_ACCOUNT_NAME=your-account-name
29+
AZURE_BLOB_STORAGE_CONTAINER_NAME=your-container-name
4730
```
4831

4932
## Run
@@ -60,7 +43,7 @@ Run:
6043
python main.py
6144
```
6245

63-
During running, it will keep observing changes in the Azure Blob Storage container and update the index automatically.
46+
During running, it will keep observing changes in the Amazon S3 bucket and update the index automatically.
6447
At the same time, it accepts queries from the terminal, and performs search on top of the up-to-date index.
6548

6649

@@ -80,74 +63,3 @@ cocoindex server -ci -L main.py
8063
```
8164

8265
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
83-
84-
## Authentication Methods & Troubleshooting
85-
86-
### Connection String (Recommended for Development)
87-
```bash
88-
AZURE_BLOB_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=testnamecocoindex1;AccountKey=your-key;EndpointSuffix=core.windows.net"
89-
```
90-
- **Pros:** Easiest to set up, contains all necessary information
91-
- **Cons:** Contains account key (full access)
92-
- **⚠️ Important:** Use **Account Key** connection string, NOT SAS connection string!
93-
94-
### SAS Token (Recommended for Production)
95-
```bash
96-
AZURE_BLOB_SAS_TOKEN="sp=r&st=2024-01-01T00:00:00Z&se=2025-12-31T23:59:59Z&spr=https&sv=2022-11-02&sr=c&sig=..."
97-
```
98-
- **Pros:** Fine-grained permissions, time-limited
99-
- **Cons:** More complex to generate and manage
100-
101-
**SAS Token Requirements:**
102-
- `sp=r` - Read permission (required)
103-
- `sp=rl` - Read + List permissions (recommended)
104-
- `sr=c` - Container scope (to access all blobs)
105-
- Valid time range (`st` and `se` in UTC)
106-
107-
### Account Key
108-
```bash
109-
AZURE_BLOB_ACCOUNT_KEY="your-account-key-here"
110-
```
111-
- **Pros:** Simple to use
112-
- **Cons:** Full account access, security risk
113-
114-
### Anonymous Access
115-
Leave all authentication options empty - only works with public containers.
116-
117-
## Common Issues
118-
119-
### 401 Authentication Error
120-
```
121-
Error: server returned error status which will not be retried: 401
122-
Error Code: NoAuthenticationInformation
123-
```
124-
125-
**Solutions:**
126-
1. **Check authentication priority:** Connection String > SAS Token > Account Key > Anonymous
127-
2. **Verify SAS token permissions:** Must include `r` (read) and `l` (list) permissions
128-
3. **Check SAS token expiry:** Ensure `se` (expiry time) is in the future
129-
4. **Verify container scope:** Use `sr=c` for container-level access
130-
131-
### Connection String Issues
132-
133-
**⚠️ CRITICAL: Use Account Key Connection String, NOT SAS Connection String!**
134-
135-
**✅ Correct (Account Key Connection String):**
136-
```
137-
DefaultEndpointsProtocol=https;AccountName=testnamecocoindex1;AccountKey=your-key;EndpointSuffix=core.windows.net
138-
```
139-
140-
**❌ Wrong (SAS Connection String - will not work):**
141-
```
142-
BlobEndpoint=https://testnamecocoindex1.blob.core.windows.net/;SharedAccessSignature=sp=r&st=...
143-
```
144-
145-
**Other tips:**
146-
- Don't include quotes in the actual connection string value
147-
- Account name in connection string should match `AZURE_STORAGE_ACCOUNT_NAME`
148-
- Connection string must contain `AccountKey=` parameter
149-
150-
### Container Access Issues
151-
- Verify container exists and account has access
152-
- Check `AZURE_BLOB_CONTAINER_NAME` spelling
153-
- For anonymous access, container must be public

examples/azure_blob_embedding/main.py

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -101,20 +101,22 @@ def _main() -> None:
101101
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
102102

103103
azure_blob_text_embedding_flow.setup()
104-
with cocoindex.FlowLiveUpdater(azure_blob_text_embedding_flow):
105-
# Run queries in a loop to demonstrate the query capabilities.
106-
while True:
107-
query = input("Enter search query (or Enter to quit): ")
108-
if query == "":
109-
break
110-
# Run the query function with the database connection pool and the query.
111-
results = search(pool, query)
112-
print("\nSearch results:")
113-
for result in results:
114-
print(f"[{result['score']:.3f}] {result['filename']}")
115-
print(f" {result['text']}")
116-
print("---")
117-
print()
104+
update_stats = azure_blob_text_embedding_flow.update()
105+
print(update_stats)
106+
107+
# Run queries in a loop to demonstrate the query capabilities.
108+
while True:
109+
query = input("Enter search query (or Enter to quit): ")
110+
if query == "":
111+
break
112+
# Run the query function with the database connection pool and the query.
113+
results = search(pool, query)
114+
print("\nSearch results:")
115+
for result in results:
116+
print(f"[{result['score']:.3f}] {result['filename']}")
117+
print(f" {result['text']}")
118+
print("---")
119+
print()
118120

119121

120122
if __name__ == "__main__":

0 commit comments

Comments
 (0)