Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions docs/integrations/object-storage/adls.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,40 @@ storage_options = {

dt = DeltaTable(abfs_path,storage_options=storage_options)
```

## Configuration Reference

The following table lists all available configuration options that can be passed via the `storage_options` parameter when working with Azure Data Lake Storage. These options correspond to the `AzureConfigKey` enum from the `object_store` crate.

| Configuration Key | Environment Variable | Description |
|-------------------|---------------------|-------------|
| `account_name` | `AZURE_STORAGE_ACCOUNT_NAME` | Azure storage account name |
| `access_key` | `AZURE_STORAGE_ACCOUNT_KEY` | Azure storage account access key |
| `client_id` | `AZURE_STORAGE_CLIENT_ID` | Service principal client ID for Azure AD authentication |
| `client_secret` | `AZURE_STORAGE_CLIENT_SECRET` | Service principal client secret |
| `authority_id` / `tenant_id` | `AZURE_STORAGE_TENANT_ID` | Azure Active Directory tenant ID |
| `sas_key` | `AZURE_STORAGE_SAS_KEY` | Shared Access Signature (SAS) token (must be percent-encoded) |
| `bearer_token` | `AZURE_STORAGE_TOKEN` | Bearer token for authentication |
| `use_emulator` | `AZURE_STORAGE_USE_EMULATOR` | Use Azurite storage emulator (set to `true`) |
| `endpoint` | `AZURE_STORAGE_ENDPOINT` | Custom Azure endpoint URL |
| `use_azure_cli` | `AZURE_STORAGE_USE_AZURE_CLI` | Use credentials from Azure CLI (set to `true`) |
| `federated_token_file` | `AZURE_FEDERATED_TOKEN_FILE` | Path to federated token file for workload identity |
| `container_name` | `AZURE_STORAGE_CONTAINER_NAME` | Container name (alternative to specifying in URL) |
| `msi_endpoint` | `IDENTITY_ENDPOINT` or `AZURE_MSI_ENDPOINT` | Managed Service Identity (MSI) endpoint |
| `object_id` | `AZURE_STORAGE_OBJECT_ID` | Object ID for managed identity |
| `msi_resource_id` | `AZURE_STORAGE_MSI_RESOURCE_ID` | MSI resource ID |
| `skip_signature` | `AZURE_STORAGE_SKIP_SIGNATURE` | Skip request signature (set to `true` for anonymous access) |
| `use_fabric_endpoint` | `AZURE_STORAGE_USE_FABRIC_ENDPOINT` | Use Microsoft Fabric endpoint (set to `true`) |
| `disable_tagging` | `AZURE_STORAGE_DISABLE_TAGGING` | Disable blob tagging (set to `true` if not supported) |

### Supported URL Schemes

Delta Lake on Azure ADLS supports the following URL schemes:

- `abfss://container@account.dfs.core.windows.net/path/to/table` - Azure Blob File System Secure (ABFSS)
- `abfs://container@account.dfs.core.windows.net/path/to/table` - Azure Blob File System
- `az://container/path/to/table` - Short form Azure scheme
- `adl://container/path/to/table` - Azure Data Lake scheme

!!! note
For the complete and authoritative list of configuration options, refer to the [object_store AzureConfigKey documentation](https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html).
30 changes: 30 additions & 0 deletions docs/integrations/object-storage/gcs.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,33 @@ You will need the following permissions in your GCS account:
- `storage.objects.list` (only required if you plan on using the Google Cloud CLI)

For more information, see the [GCP documentation](https://cloud.google.com/storage/docs/uploading-objects)

## Configuration Reference

The following table lists all available configuration options that can be passed via the `storage_options` parameter when working with Google Cloud Storage. These options correspond to the `GoogleConfigKey` enum from the `object_store` crate.

| Configuration Key | Environment Variable | Description |
|-------------------|---------------------|-------------|
| `service_account` | `GOOGLE_SERVICE_ACCOUNT` | Path to service account JSON file for authentication |
| `service_account_key` | `GOOGLE_SERVICE_ACCOUNT_KEY` | Serialized service account key JSON string |
| `application_credentials` | `GOOGLE_APPLICATION_CREDENTIALS` | Path to Application Default Credentials (ADC) file |
| `bucket` / `bucket_name` | `GOOGLE_BUCKET` | GCS bucket name (alternative to specifying in URL) |
| `endpoint` | `GOOGLE_ENDPOINT` | Custom GCS endpoint URL (for testing or GCS-compatible services) |

### Supported URL Schemes

Delta Lake on Google Cloud Storage supports the following URL scheme:

- `gs://bucket-name/path/to/table` - Google Cloud Storage URL

### Authentication Methods

GCS authentication can be configured in several ways (in order of precedence):

1. **Service Account Key** - Provide a service account JSON via `service_account_key` or `service_account` (file path)
2. **Application Default Credentials (ADC)** - Set `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to a credentials file
3. **GCloud CLI Credentials** - If authenticated via `gcloud auth application-default login`, credentials will be automatically discovered
4. **Workload Identity** - For applications running on GKE, credentials are automatically provided via workload identity

!!! note
For the complete and authoritative list of configuration options, refer to the [object_store GoogleConfigKey documentation](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html).
25 changes: 23 additions & 2 deletions docs/integrations/object-storage/lakefs.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,27 @@ storage_options = {

It might occur that a deltalake operation fails midway. At this point a lakefs transaction branch was created, but never destroyed. The branches are hidden in the UI, but each branch starts with `delta-tx`.

With the lakefs python library you can list these branches and delete stale ones.
With the lakefs python library you can list these branches and delete stale ones.

<TODO add example here>
```python
import lakefs

# Initialize LakeFS client
client = lakefs.Client(
host="https://mylakefs.example.com",
username="LAKEFSID",
password="LAKEFSKEY",
)

# Access the repository
repo = lakefs.Repository("my-repo", client=client)

# List and delete stale transaction branches
for branch in repo.branches():
if branch.id.startswith("delta-tx"):
print(f"Deleting stale transaction branch: {branch.id}")
branch.delete()
```

!!! tip
You can add additional logic to check the branch creation time and only delete branches older than a certain threshold to avoid removing branches from operations that are still in progress.
137 changes: 137 additions & 0 deletions docs/integrations/object-storage/s3-like.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,140 @@ storage_options = {
'aws_conditional_put': 'etag'
}
```

## Supported S3-Compatible Services

Delta Lake works with many S3-compatible object storage services. Below are the most commonly used services and their specific configuration requirements.

### Cloudflare R2

Cloudflare R2 supports conditional puts with ETags, which provides safe concurrent writes without requiring DynamoDB.

```python
from deltalake import write_deltalake

storage_options = {
'AWS_ACCESS_KEY_ID': '<R2_ACCESS_KEY_ID>',
'AWS_SECRET_ACCESS_KEY': '<R2_SECRET_ACCESS_KEY>',
'AWS_ENDPOINT_URL': 'https://<account_id>.r2.cloudflarestorage.com',
'AWS_REGION': 'auto',
'aws_conditional_put': 'etag', # Required for safe concurrent writes
}

write_deltalake(
"s3://my-bucket/delta-table",
df,
storage_options=storage_options
)
```

### MinIO

MinIO is an open-source S3-compatible storage server that can be self-hosted.

```python
from deltalake import write_deltalake

storage_options = {
'AWS_ACCESS_KEY_ID': '<MINIO_ACCESS_KEY>',
'AWS_SECRET_ACCESS_KEY': '<MINIO_SECRET_KEY>',
'AWS_ENDPOINT_URL': 'http://localhost:9000', # Or your MinIO server URL
'AWS_REGION': 'us-east-1', # MinIO default region
'allow_http': 'true', # Required for non-HTTPS endpoints
'aws_conditional_put': 'etag', # Required for safe concurrent writes
}

write_deltalake(
"s3://my-bucket/delta-table",
df,
storage_options=storage_options
)
```

### Alibaba Cloud OSS

Alibaba Cloud Object Storage Service (OSS) is S3-compatible and commonly used in China and Asia-Pacific regions.

```python
from deltalake import write_deltalake

storage_options = {
'AWS_ACCESS_KEY_ID': '<OSS_ACCESS_KEY_ID>',
'AWS_SECRET_ACCESS_KEY': '<OSS_SECRET_ACCESS_KEY>',
'AWS_ENDPOINT_URL': 'https://oss-<region>.aliyuncs.com', # e.g., oss-cn-hangzhou.aliyuncs.com
'AWS_REGION': '<region>', # e.g., cn-hangzhou
'aws_virtual_hosted_style_request': 'true',
}

write_deltalake(
"s3://my-bucket/delta-table",
df,
storage_options=storage_options
)
```

!!! note
For Alibaba OSS issues, see [#2361](https://github.com/delta-io/delta-rs/issues/2361) for known compatibility considerations.

### LocalStack

LocalStack provides a local AWS cloud emulator for testing, including S3 emulation.

```python
from deltalake import write_deltalake

storage_options = {
'AWS_ACCESS_KEY_ID': 'test', # LocalStack accepts any credentials
'AWS_SECRET_ACCESS_KEY': 'test',
'AWS_ENDPOINT_URL': 'http://localhost:4566', # Default LocalStack endpoint
'AWS_REGION': 'us-east-1',
'allow_http': 'true', # Required for HTTP endpoint
'aws_skip_signature': 'true', # Optional: skip signing for faster testing
}

write_deltalake(
"s3://test-bucket/delta-table",
df,
storage_options=storage_options
)
```

### Ceph / RADOS Gateway

Ceph's RADOS Gateway provides an S3-compatible interface to Ceph object storage.

```python
from deltalake import write_deltalake

storage_options = {
'AWS_ACCESS_KEY_ID': '<CEPH_ACCESS_KEY>',
'AWS_SECRET_ACCESS_KEY': '<CEPH_SECRET_KEY>',
'AWS_ENDPOINT_URL': 'https://ceph.example.com',
'AWS_REGION': 'default', # Ceph typically uses 'default' or 'us-east-1'
'aws_virtual_hosted_style_request': 'false', # Ceph often requires path-style requests
}

write_deltalake(
"s3://my-bucket/delta-table",
df,
storage_options=storage_options
)
```

!!! warning
Different Ceph configurations may require different settings. Check your Ceph installation's documentation for specific requirements.

## Configuration Key Summary

When working with S3-compatible services, the most important configuration keys are:

| Key | Purpose | Common Values |
|-----|---------|---------------|
| `AWS_ENDPOINT_URL` | Custom S3 endpoint | Service-specific URL |
| `aws_conditional_put` | Safe concurrent writes | `etag` (for services supporting conditional puts) |
| `allow_http` | Allow non-HTTPS connections | `true` for local/testing environments |
| `aws_virtual_hosted_style_request` | URL style for requests | `true` for virtual-hosted, `false` for path-style |
| `aws_skip_signature` | Skip request signing | `true` for testing/unauthenticated access |

!!! tip
Most S3-compatible services work best with `aws_conditional_put: 'etag'` to enable safe concurrent writes without requiring DynamoDB.
50 changes: 50 additions & 0 deletions docs/integrations/object-storage/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,53 @@ In DynamoDB, you will need the following permissions:
- dynamodb:Query
- dynamodb:PutItem
- dynamodb:UpdateItem

## Configuration Reference

The following table lists all available configuration options that can be passed via the `storage_options` parameter when working with AWS S3. These options correspond to the `AmazonS3ConfigKey` enum from the `object_store` crate.

| Configuration Key | Environment Variable | Description |
|-------------------|---------------------|-------------|
| `access_key_id` | `AWS_ACCESS_KEY_ID` | AWS access key ID for authentication |
| `secret_access_key` | `AWS_SECRET_ACCESS_KEY` | AWS secret access key for authentication |
| `region` | `AWS_REGION` or `AWS_DEFAULT_REGION` | AWS region where the S3 bucket is located |
| `endpoint` | `AWS_ENDPOINT_URL` | Custom S3 endpoint URL (for S3-compatible services like MinIO, LocalStack) |
| `token` | `AWS_SESSION_TOKEN` | Session token for temporary credentials (STS) |
| `imdsv1_fallback` | `AWS_EC2_METADATA_V1_DISABLED` | Allow IMDSv1 fallback for EC2 metadata (set to `true` to disable) |
| `virtual_hosted_style_request` | `AWS_VIRTUAL_HOSTED_STYLE_REQUEST` | Use virtual hosted-style requests (`true`) or path-style (`false`) |
| `aws_unsigned_payload` | - | Skip payload signing for requests (set to `true` for unsigned uploads) |
| `aws_checksum_algorithm` | - | Checksum algorithm to use (e.g., `sha256`) |
| `aws_metadata_endpoint` | `AWS_EC2_METADATA_SERVICE_ENDPOINT` | EC2 metadata service endpoint URL |
| `aws_container_credentials_relative_uri` | `AWS_CONTAINER_CREDENTIALS_RELATIVE_URI` | URI for container credentials (ECS tasks) |
| `aws_copy_if_not_exists` | - | How to handle copy-if-not-exists operations |
| `aws_conditional_put` | - | Conditional put support mode (e.g., `etag` for S3-compatible stores) |
| `aws_skip_signature` | - | Skip request signing entirely (set to `true` for anonymous access) |
| `aws_disable_tagging` | - | Disable object tagging (set to `true` if not supported) |
| `aws_s3_express` | - | Enable S3 Express One Zone support |
| `aws_request_payer` | - | Request payer setting (for requester-pays buckets) |
| `aws_web_identity_token_file` | `AWS_WEB_IDENTITY_TOKEN_FILE` | Path to web identity token file for OIDC authentication |
| `aws_role_arn` | `AWS_ROLE_ARN` | IAM role ARN to assume via STS AssumeRole |
| `aws_role_session_name` | `AWS_ROLE_SESSION_NAME` | Session name for role assumption |
| `aws_sts_endpoint` | - | Custom STS endpoint URL |

### Delta Lake Specific Options

In addition to the standard S3 configuration options above, delta-rs provides these specific settings:

| Configuration Key | Environment Variable | Description |
|-------------------|---------------------|-------------|
| `AWS_S3_LOCKING_PROVIDER` | `AWS_S3_LOCKING_PROVIDER` | Locking mechanism for safe concurrent writes (set to `dynamodb`) |
| `DELTA_DYNAMO_TABLE_NAME` | `DELTA_DYNAMO_TABLE_NAME` | DynamoDB table name for lock management |
| `AWS_S3_ALLOW_UNSAFE_RENAME` | `AWS_S3_ALLOW_UNSAFE_RENAME` | Allow unsafe writes without locking (set to `true` to skip locking - not recommended for production) |

### Supported URL Schemes

Delta Lake on S3 supports the following URL schemes:

- `s3://bucket-name/path/to/table` - Standard S3 URL
- `s3a://bucket-name/path/to/table` - Hadoop S3A scheme
- `https://s3.<region>.amazonaws.com/bucket-name/path/to/table` - HTTPS path-style URL
- `https://bucket-name.s3.<region>.amazonaws.com/path/to/table` - HTTPS virtual hosted-style URL

!!! note
For the complete and authoritative list of configuration options, refer to the [object_store AmazonS3ConfigKey documentation](https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html).
65 changes: 65 additions & 0 deletions docs/integrations/object-storage/special_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,68 @@ Delta-rs provides some addition values to be set in the storage_options for adva
| backoff_config.max_backoff | The maximum backoff duration |
| backoff_config.base | The multiplier to use for the next backoff duration |
| MOUNT_ALLOW_UNSAFE_RENAME | If set it will allow unsafe renames on mounted storage |

## Common Client Options

The following configuration options from `ClientConfigKey` work across all storage backends (S3, Azure, GCS, etc.) and control HTTP client behavior. These can be passed via `storage_options` regardless of which cloud provider you're using.

| Config Key | Description |
|------------|-------------|
| `allow_http` | Allow HTTP connections (non-HTTPS). Set to `true` for local development or testing with services like MinIO or LocalStack. Default: `false` |
| `allow_invalid_certificates` | Skip TLS certificate validation. **WARNING**: This is dangerous and should only be used for testing. Default: `false` |
| `connect_timeout` | Maximum time to wait for a connection to be established. Accepts duration strings like `30s`, `5m`. |
| `timeout` | Maximum time for a complete request (including retries). Accepts duration strings like `60s`, `10m`. |
| `proxy_url` | HTTP proxy URL to route requests through. Example: `http://proxy.example.com:8080` |
| `proxy_ca_certificate` | PEM-encoded CA certificate for the proxy server (when using HTTPS proxy with custom CA) |
| `proxy_excludes` | Comma-separated list of hosts to exclude from proxying. Example: `localhost,127.0.0.1` |
| `pool_idle_timeout` | Maximum time a connection can remain idle in the connection pool before being closed. Accepts duration strings. |
| `pool_max_idle_per_host` | Maximum number of idle connections to maintain per host. Default varies by backend. |
| `http1_only` | Force HTTP/1.1 only, disable HTTP/2. Set to `true` if the server doesn't support HTTP/2. Default: `false` |
| `http2_only` | Force HTTP/2 only. Set to `true` to require HTTP/2. Default: `false` |
| `http2_keep_alive_interval` | Interval for HTTP/2 keep-alive pings. Accepts duration strings like `30s`. |
| `http2_keep_alive_timeout` | Timeout for HTTP/2 keep-alive ping responses. Accepts duration strings. |
| `http2_keep_alive_while_idle` | Send HTTP/2 keep-alive pings even when no streams are active. Set to `true` to enable. Default: `false` |
| `http2_max_frame_size` | Maximum HTTP/2 frame size in bytes. Must be between 16,384 and 16,777,215. |
| `user_agent` | Custom User-Agent header to send with requests. Example: `my-app/1.0` |
| `default_content_type` | Default Content-Type header for uploads when not otherwise specified. Example: `application/octet-stream` |

### Example Usage

```python
from deltalake import write_deltalake

storage_options = {
# Cloud-specific credentials
'AWS_ACCESS_KEY_ID': 'your-key',
'AWS_SECRET_ACCESS_KEY': 'your-secret',
'AWS_REGION': 'us-east-1',

# Common client options (work with any backend)
'timeout': '120s',
'connect_timeout': '30s',
'pool_max_idle_per_host': '10',
}

write_deltalake("s3://bucket/table", df, storage_options=storage_options)
```

### Development and Testing Options

For local development with services like MinIO, LocalStack, or Azurite:

```python
storage_options = {
'AWS_ACCESS_KEY_ID': 'test',
'AWS_SECRET_ACCESS_KEY': 'test',
'AWS_ENDPOINT_URL': 'http://localhost:4566',
'allow_http': 'true', # Required for non-HTTPS endpoints
'connect_timeout': '10s',
'timeout': '60s',
}
```

!!! warning
Never use `allow_invalid_certificates: true` in production environments. This disables critical security protections.

!!! note
For the complete and authoritative list of client configuration options, refer to the [object_store ClientConfigKey documentation](https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html).
Loading