Skip to content

Commit 8c4fc53

Browse files
authored
feat(azure-auth): add more authentication for azure blob storage (#774)
1 parent bd77c9f commit 8c4fc53

File tree

6 files changed

+123
-35
lines changed

6 files changed

+123
-35
lines changed

docs/docs/core/flow_def.mdx

Lines changed: 39 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -416,22 +416,28 @@ flow_builder.declare(
416416
### Auth Registry
417417

418418
CocoIndex manages an auth registry. It's an in-memory key-value store, mainly to store authentication information for a backend.
419+
It's usually used for targets, where key stability is important for backend cleanup.
419420

420-
Operation spec is the default way to configure a persistent backend. But it has the following limitations:
421+
Operation spec is the default way to configure sources, functions and targets. But it has the following limitations:
421422

422423
* The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
423-
* Once an operation is removed after flow definition code change, the spec is also gone.
424-
But we still need to be able to drop the backend (e.g. a table) when [setup / drop flow](/docs/core/flow_methods#setup--drop-flow).
424+
* For targets, once an operation is removed after flow definition code change, the spec is also gone.
425+
But we still need to be able to drop the persistent backend (e.g. a table) when [setup / drop flow](/docs/core/flow_methods#setup--drop-flow).
425426

426-
Auth registry is introduced to solve the problems above. It works as follows:
427+
Auth registry is introduced to solve the problems above.
427428

428-
* You can create new **auth entry** by a key and a value.
429-
* You can references the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
429+
430+
#### Auth Entry
431+
432+
An auth entry is an entry in the auth registry with an explicit key.
433+
434+
* You can create new *auth entry* by a key and a value.
435+
* You can reference the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
430436

431437
<Tabs>
432438
<TabItem value="python" label="Python" default>
433439

434-
You can add an auth entry by `cocoindex.add_auth_entry()` function, which returns a `cocoindex.AuthEntryReference`:
440+
You can add an auth entry by `cocoindex.add_auth_entry()` function, which returns a `cocoindex.AuthEntryReference[T]`:
435441

436442
```python
437443
my_graph_conn = cocoindex.add_auth_entry(
@@ -445,7 +451,7 @@ my_graph_conn = cocoindex.add_auth_entry(
445451

446452
Then reference it when building a spec that takes an auth entry:
447453

448-
* You can either reference by the `AuthEntryReference` object directly:
454+
* You can either reference by the `AuthEntryReference[T]` object directly:
449455

450456
```python
451457
demo_collector.export(
@@ -472,3 +478,28 @@ Note that CocoIndex backends use the key of an auth entry to identify the backen
472478

473479
* If a key is no longer referenced in any operation spec, keep it until the next flow setup / drop action,
474480
so that CocoIndex will be able to clean up the backends.
481+
482+
#### Transient Auth Entry
483+
484+
A transient auth entry is an entry in the auth registry with an automatically generated key.
485+
It's usually used for sources and functions, where key stability is not important.
486+
487+
<Tabs>
488+
<TabItem value="python" label="Python" default>
489+
490+
You can create a new *transient auth entry* by `cocoindex.add_transient_auth_entry()` function, which returns a `cocoindex.TransientAuthEntryReference[T]`, and pass it to a source or function spec that takes it, e.g.
491+
492+
```python
493+
flow_builder.add_source(
494+
cocoindex.sources.AzureBlob(
495+
...
496+
sas_token=cocoindex.add_transient_auth_entry("...")
497+
)
498+
)
499+
```
500+
501+
502+
</TabItem>
503+
</Tabs>
504+
505+
Whenever a `TransientAuthEntryReference[T]` is expected, you can also pass a `AuthEntryReference[T]` instead, as `AuthEntryReference[T]` is a subtype of `TransientAuthEntryReference[T]`.

docs/docs/ops/sources.md

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -170,22 +170,33 @@ These are actions you need to take:
170170

171171
#### Authentication
172172

173-
We use Azure’s **Default Credential** system (DefaultAzureCredential) for secure and flexible authentication.
174-
This allows you to connect to Azure services without putting any secrets in the code or flow spec.
175-
It automatically chooses the best authentication method based on your environment:
176-
177-
* On your local machine: uses your Azure CLI login (`az login`) or environment variables.
178-
179-
```sh
180-
az login
181-
# Optional: Set a default subscription if you have more than one
182-
az account set --subscription "<YOUR_SUBSCRIPTION_NAME_OR_ID>"
183-
```
184-
* In Azure (VM, App Service, AKS, etc.): uses the resource’s Managed Identity.
185-
* In automated environments: supports Service Principals via environment variables
186-
* `AZURE_CLIENT_ID`
187-
* `AZURE_TENANT_ID`
188-
* `AZURE_CLIENT_SECRET`
173+
We support the following authentication methods:
174+
175+
* Shared access signature (SAS) tokens.
176+
You can generate it from the Azure Portal in the settings for a specific container.
177+
You need to provide at least *List* and *Read* permissions when generating the SAS token.
178+
It's a query string in the form of
179+
`sp=rl&st=2025-07-20T09:33:00Z&se=2025-07-19T09:48:53Z&sv=2024-11-04&sr=c&sig=i3FDjsadfklj3%23adsfkk`.
180+
181+
* Storage account access key. You can find it in the Azure Portal in the settings for a specific storage account.
182+
183+
* Default credential. When none of the above is provided, it will use the default credential.
184+
185+
This allows you to connect to Azure services without putting any secrets in the code or flow spec.
186+
It automatically chooses the best authentication method based on your environment:
187+
188+
* On your local machine: uses your Azure CLI login (`az login`) or environment variables.
189+
190+
```sh
191+
az login
192+
# Optional: Set a default subscription if you have more than one
193+
az account set --subscription "<YOUR_SUBSCRIPTION_NAME_OR_ID>"
194+
```
195+
* In Azure (VM, App Service, AKS, etc.): uses the resource’s Managed Identity.
196+
* In automated environments: supports Service Principals via environment variables
197+
* `AZURE_CLIENT_ID`
198+
* `AZURE_TENANT_ID`
199+
* `AZURE_CLIENT_SECRET`
189200

190201
You can refer to [this doc](https://learn.microsoft.com/en-us/azure/developer/python/sdk/authentication/overview) for more details.
191202

@@ -202,6 +213,8 @@ The spec takes the following fields:
202213
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
203214
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
204215
If not specified, no files will be excluded.
216+
* `sas_token` (`cocoindex.TransientAuthEntryReference[str]`, optional): a SAS token for authentication.
217+
* `account_access_key` (`cocoindex.TransientAuthEntryReference[str]`, optional): an account access key for authentication.
205218

206219
:::info
207220

python/cocoindex/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from . import targets as storages # Deprecated: Use targets instead
88

9-
from .auth_registry import AuthEntryReference, add_auth_entry, ref_auth_entry
9+
from .auth_registry import AuthEntryReference, add_auth_entry, add_transient_auth_entry
1010
from .flow import FlowBuilder, DataScope, DataSlice, Flow, transform_flow
1111
from .flow import flow_def
1212
from .flow import EvaluateAndDumpOptions, GeneratedField
@@ -42,6 +42,7 @@
4242
# Auth registry
4343
"AuthEntryReference",
4444
"add_auth_entry",
45+
"add_transient_auth_entry",
4546
"ref_auth_entry",
4647
# Flow
4748
"FlowBuilder",

python/cocoindex/auth_registry.py

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,42 @@
44

55
from dataclasses import dataclass
66
from typing import Generic, TypeVar
7+
import threading
78

89
from . import _engine # type: ignore
910
from .convert import dump_engine_object
1011

1112
T = TypeVar("T")
1213

14+
# Global atomic counter for generating unique auth entry keys
15+
_counter_lock = threading.Lock()
16+
_auth_key_counter = 0
17+
18+
19+
def _generate_auth_key() -> str:
20+
"""Generate a unique auth entry key using a global atomic counter."""
21+
global _auth_key_counter # pylint: disable=global-statement
22+
with _counter_lock:
23+
_auth_key_counter += 1
24+
return f"__auth_{_auth_key_counter}"
25+
1326

1427
@dataclass
15-
class AuthEntryReference(Generic[T]):
16-
"""Reference an auth entry by its key."""
28+
class TransientAuthEntryReference(Generic[T]):
29+
"""Reference an auth entry, may or may not have a stable key."""
1730

1831
key: str
1932

2033

34+
class AuthEntryReference(TransientAuthEntryReference[T]):
35+
"""Reference an auth entry, with a key stable across ."""
36+
37+
38+
def add_transient_auth_entry(value: T) -> TransientAuthEntryReference[T]:
39+
"""Add an auth entry to the registry. Returns its reference."""
40+
return add_auth_entry(_generate_auth_key(), value)
41+
42+
2143
def add_auth_entry(key: str, value: T) -> AuthEntryReference[T]:
2244
"""Add an auth entry to the registry. Returns its reference."""
2345
_engine.add_auth_entry(key, dump_engine_object(value))

python/cocoindex/sources.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""All builtin sources."""
22

33
from . import op
4+
from .auth_registry import TransientAuthEntryReference
45
import datetime
56

67

@@ -48,6 +49,11 @@ class AmazonS3(op.SourceSpec):
4849
class AzureBlob(op.SourceSpec):
4950
"""
5051
Import data from an Azure Blob Storage container. Supports optional prefix and file filtering by glob patterns.
52+
53+
Authentication mechanisms taken in the following order:
54+
- SAS token (if provided)
55+
- Account access key (if provided)
56+
- Default Azure credential
5157
"""
5258

5359
_op_category = op.OpCategory.SOURCE
@@ -58,3 +64,6 @@ class AzureBlob(op.SourceSpec):
5864
binary: bool = False
5965
included_patterns: list[str] | None = None
6066
excluded_patterns: list[str] | None = None
67+
68+
sas_token: TransientAuthEntryReference[str] | None = None
69+
account_access_key: TransientAuthEntryReference[str] | None = None

src/ops/sources/azure_blob.rs

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@ pub struct Spec {
1919
binary: bool,
2020
included_patterns: Option<Vec<String>>,
2121
excluded_patterns: Option<Vec<String>>,
22+
23+
/// SAS token for authentication. Takes precedence over account_access_key.
24+
sas_token: Option<AuthEntryReference<String>>,
25+
/// Account access key for authentication. If not provided, will use default Azure credential.
26+
account_access_key: Option<AuthEntryReference<String>>,
2227
}
2328

2429
struct Executor {
@@ -209,15 +214,22 @@ impl SourceFactoryBase for Factory {
209214
async fn build_executor(
210215
self: Arc<Self>,
211216
spec: Spec,
212-
_context: Arc<FlowInstanceContext>,
217+
context: Arc<FlowInstanceContext>,
213218
) -> Result<Box<dyn SourceExecutor>> {
214-
let default_credential = Arc::new(DefaultAzureCredential::create(
215-
TokenCredentialOptions::default(),
216-
)?);
217-
let client = BlobServiceClient::new(
218-
&spec.account_name,
219-
StorageCredentials::token_credential(default_credential),
220-
);
219+
let credential = if let Some(sas_token) = spec.sas_token {
220+
let sas_token = context.auth_registry.get(&sas_token)?;
221+
StorageCredentials::sas_token(sas_token)?
222+
} else if let Some(account_access_key) = spec.account_access_key {
223+
let account_access_key = context.auth_registry.get(&account_access_key)?;
224+
StorageCredentials::access_key(spec.account_name.clone(), account_access_key)
225+
} else {
226+
let default_credential = Arc::new(DefaultAzureCredential::create(
227+
TokenCredentialOptions::default(),
228+
)?);
229+
StorageCredentials::token_credential(default_credential)
230+
};
231+
232+
let client = BlobServiceClient::new(&spec.account_name, credential);
221233
Ok(Box::new(Executor {
222234
client,
223235
container_name: spec.container_name,

0 commit comments

Comments
 (0)