Skip to content

Commit cbdd8d1

Browse files
authored
Add documentation for Iceberg Catalog Sync (#9682)
1 parent c34b9cc commit cbdd8d1

File tree

1 file changed

+186
-3
lines changed

1 file changed

+186
-3
lines changed

docs/src/integrations/iceberg.md

Lines changed: 186 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -341,9 +341,6 @@ The following table maintenance operations are not supported in the current vers
341341

342342
The following features are planned for future releases:
343343

344-
1. **Catalog Sync**:
345-
- Support for pushing/pulling tables to/from other catalogs
346-
- Integration with AWS Glue and other Iceberg-compatible catalogs
347344
1. **Table Import**:
348345
- Support for importing existing Iceberg tables from other catalogs
349346
- Bulk import capabilities for large-scale migrations
@@ -412,6 +409,192 @@ sequenceDiagram
412409
- [Iceberg Official Documentation](https://iceberg.apache.org/docs/latest/)
413410
- [lakeFS Enterprise Features](../enterprise/index.md)
414411

412+
## Catalog Sync
413+
414+
Catalog Sync allows you to synchronize Iceberg tables between the lakeFS catalog and external catalogs such as AWS Glue Data Catalog or other Iceberg REST-compatible catalogs.
415+
This enables workflows where some users or tools need to access data through other external catalogs.
416+
417+
### Use Cases
418+
419+
**Collaboration with External Tools**: Share data with users who rely on tools that only support specific catalogs.
420+
For example, data engineers can work with tables in lakeFS while data analysts query the same data through AWS Glue using Athena, all while maintaining isolation and version control.
421+
422+
```mermaid
423+
sequenceDiagram
424+
actor Alice (Data Engineer)
425+
actor Bob (Data Analyst)
426+
participant lakeFS
427+
participant AWS Glue
428+
Alice (Data Engineer)->>lakeFS: Update tables on branch 'dev'
429+
Alice (Data Engineer)->>lakeFS: Push tables to Glue
430+
lakeFS->>AWS Glue: Register tables in 'dev' database
431+
Alice (Data Engineer)->>Bob (Data Analyst): Please review: glue/dev
432+
Bob (Data Analyst)->>AWS Glue: SELECT * FROM dev.table (via Athena)
433+
Bob (Data Analyst)->>Alice (Data Engineer): Approved!
434+
Alice (Data Engineer)->>lakeFS: Merge 'dev' into 'main'
435+
```
436+
437+
**Isolated Pipelines**: Run data pipelines using tools that require external catalogs while maintaining isolation through lakeFS branches.
438+
Create a branch, push tables to an external catalog, run your pipeline, pull the changes back, and merge into main.
439+
440+
```mermaid
441+
sequenceDiagram
442+
participant Orchestrator
443+
participant lakeFS
444+
participant External Catalog
445+
participant Pipeline Tool
446+
Orchestrator->>lakeFS: Create branch 'etl-2024-01-15'
447+
Orchestrator->>lakeFS: Push tables to external catalog
448+
lakeFS->>External Catalog: Register tables
449+
Orchestrator->>Pipeline Tool: Run ETL pipeline
450+
Pipeline Tool->>External Catalog: Read/write tables
451+
Orchestrator->>lakeFS: Pull updated tables
452+
lakeFS->>External Catalog: Read updated metadata
453+
Orchestrator->>lakeFS: Merge 'etl-2024-01-15' into 'main'
454+
```
455+
456+
### Configuration
457+
458+
Remote catalogs are configured in your [lakeFS configuration file](../reference/configuration.md).
459+
Each remote catalog requires a unique identifier and type-specific connection properties.
460+
461+
!!! note
462+
In case you need help configuring a remote catalog, contact [support](mailto:[email protected]).
463+
464+
#### AWS Glue Data Catalog
465+
466+
Configure an AWS Glue catalog by specifying the region and AWS credentials:
467+
468+
```yaml
469+
iceberg_catalog:
470+
remotes:
471+
- id: aws_glue_us_east_1
472+
type: glue
473+
glue:
474+
region: us-east-1
475+
access_key_id: <your-glue-key>
476+
secret_access_key: <your-glue-secret>
477+
```
478+
479+
#### Iceberg REST Catalog
480+
481+
Configure a generic Iceberg REST catalog with basic authentication:
482+
483+
```yaml
484+
iceberg_catalog:
485+
remotes:
486+
- id: remote_catalog
487+
type: rest
488+
rest:
489+
uri: https://catalog.example.com/iceberg/api
490+
credential: <client-id>:<client-secret>
491+
```
492+
493+
Or with OAuth2 client credentials flow:
494+
495+
```yaml
496+
iceberg_catalog:
497+
remotes:
498+
- id: remote_catalog
499+
type: rest
500+
rest:
501+
uri: https://catalog.example.com/iceberg/api
502+
credential: <client-id>:<client-secret>
503+
oauth_server_uri: https://auth.example.com/oauth/tokens
504+
oauth_scope: catalog:read catalog:write
505+
```
506+
507+
### Push to remote
508+
509+
Push operations register a table from lakeFS into a remote catalog.
510+
The table's metadata and data remain in lakeFS-managed storage, which are used as the pushed table's location.
511+
512+
**API Endpoint**: `POST /iceberg/remotes/{catalog-id}/push`
513+
514+
**Parameters**:
515+
- `source`: The lakeFS table location (repository, branch/reference, namespace, table name)
516+
- `destination`: The remote catalog location (namespace, table name)
517+
- `force_update`: (optional, default: `false`) Override the table if it already exists in the remote catalog
518+
- `create_namespace`: (optional, default: `false`) Create the namespace in the remote catalog if it doesn't exist
519+
520+
**Example**:
521+
522+
```bash
523+
curl -X POST "https://lakefs.example.com/iceberg/remotes/aws_glue_us_east_1/push" \
524+
-H "Authorization: Bearer $LAKEFS_ACCESS_TOKEN" \
525+
-H "Content-Type: application/json" \
526+
-d '{
527+
"source": {
528+
"repository_id": "my-repo",
529+
"reference_id": "main",
530+
"namespace": ["default", "features"],
531+
"table": "image_properties"
532+
},
533+
"destination": {
534+
"namespace": ["main_features"],
535+
"table": "image_properties"
536+
},
537+
"create_namespace": true,
538+
"force_update": true
539+
}'
540+
```
541+
542+
This example pushes the table `my-repo.main.default.features.image_properties` from lakeFS to the AWS Glue catalog as `main_features.image_properties`.
543+
It creates the remote namespace if needed (since `create_namespace: true`),
544+
and overwrites any existing table or possible recent updates committed to it (since `force_update: true`).
545+
546+
### Pull from remote
547+
548+
Pull operations update a lakeFS table with changes from a remote catalog.
549+
This is useful after external tools have modified a table previously pushed from lakeFS.
550+
551+
**API Endpoint**: `POST /iceberg/remotes/{catalog-id}/pull`
552+
553+
**Parameters**:
554+
- `source`: The remote catalog location (namespace, table name)
555+
- `destination`: The lakeFS table location (repository, branch/reference, namespace, table name)
556+
- `force_update`: (optional, default: `false`) Override the table in lakeFS if metadata conflicts exist
557+
- `create_namespace`: (optional, default: `false`) Create the namespace in lakeFS if it doesn't exist
558+
559+
**Example**:
560+
561+
```bash
562+
curl -X POST "https://lakefs.example.com/iceberg/remotes/aws_glue_us_east_1/pull" \
563+
-H "Authorization: Bearer $LAKEFS_ACCESS_TOKEN" \
564+
-H "Content-Type: application/json" \
565+
-d '{
566+
"source": {
567+
"namespace": ["main_features"],
568+
"table": "image_properties"
569+
},
570+
"destination": {
571+
"repository_id": "my-repo",
572+
"reference_id": "main",
573+
"namespace": ["default", "features"],
574+
"table": "image_properties"
575+
}
576+
"create_namespace": true,
577+
"force_update": true
578+
}'
579+
```
580+
581+
This example pulls changes from the Glue table `main_features.image_properties` back into the lakeFS table `my-repo.main.default.features.image_properties`.
582+
It creates the namespace in lakeFS if needed (since `create_namespace: true`),
583+
and overwrites any existing table or possible recent updates committed to it (since `force_update: true`).
584+
585+
### Important Notes
586+
587+
1. **Storage Location**: Pulled tables' `metadata.json` file must reside in a storage location to which lakeFS has read access to, or the `pull` operation will fail.
588+
589+
2. **Atomicity**: Push and pull operations are not atomic. If an operation fails partway through, manual intervention may be required
590+
(contact [support](mailto:[email protected]) in such a case if needed).
591+
592+
3. **Authentication**: Ensure the credentials configured for remote catalogs have appropriate permissions:
593+
- For AWS Glue: `glue:CreateTable`, `glue:UpdateTable`, `glue:GetTable`, `glue:CreateDatabase` (if `create_namespace` is used)
594+
- For REST catalogs: Appropriate OAuth scopes for table and namespace operations
595+
596+
4. **Namespace Format**: Namespaces are represented as arrays of strings to support nested namespaces (e.g., `["accounting", "tax"]` represents `accounting.tax`).
597+
415598
---
416599

417600
## Deprecated: Iceberg HadoopCatalog

0 commit comments

Comments
 (0)