You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Catalog Sync allows you to synchronize Iceberg tables between the lakeFS catalog and external catalogs such as AWS Glue Data Catalog or other Iceberg REST-compatible catalogs.
415
+
This enables workflows where some users or tools need to access data through other external catalogs.
416
+
417
+
### Use Cases
418
+
419
+
**Collaboration with External Tools**: Share data with users who rely on tools that only support specific catalogs.
420
+
For example, data engineers can work with tables in lakeFS while data analysts query the same data through AWS Glue using Athena, all while maintaining isolation and version control.
421
+
422
+
```mermaid
423
+
sequenceDiagram
424
+
actor Alice (Data Engineer)
425
+
actor Bob (Data Analyst)
426
+
participant lakeFS
427
+
participant AWS Glue
428
+
Alice (Data Engineer)->>lakeFS: Update tables on branch 'dev'
429
+
Alice (Data Engineer)->>lakeFS: Push tables to Glue
430
+
lakeFS->>AWS Glue: Register tables in 'dev' database
431
+
Alice (Data Engineer)->>Bob (Data Analyst): Please review: glue/dev
432
+
Bob (Data Analyst)->>AWS Glue: SELECT * FROM dev.table (via Athena)
433
+
Bob (Data Analyst)->>Alice (Data Engineer): Approved!
434
+
Alice (Data Engineer)->>lakeFS: Merge 'dev' into 'main'
435
+
```
436
+
437
+
**Isolated Pipelines**: Run data pipelines using tools that require external catalogs while maintaining isolation through lakeFS branches.
438
+
Create a branch, push tables to an external catalog, run your pipeline, pull the changes back, and merge into main.
- `destination`: The remote catalog location (namespace, table name)
517
+
- `force_update`: (optional, default: `false`) Override the table if it already exists in the remote catalog
518
+
- `create_namespace`: (optional, default: `false`) Create the namespace in the remote catalog if it doesn't exist
519
+
520
+
**Example**:
521
+
522
+
```bash
523
+
curl -X POST "https://lakefs.example.com/iceberg/remotes/aws_glue_us_east_1/push" \
524
+
-H "Authorization: Bearer $LAKEFS_ACCESS_TOKEN" \
525
+
-H "Content-Type: application/json" \
526
+
-d '{
527
+
"source": {
528
+
"repository_id": "my-repo",
529
+
"reference_id": "main",
530
+
"namespace": ["default", "features"],
531
+
"table": "image_properties"
532
+
},
533
+
"destination": {
534
+
"namespace": ["main_features"],
535
+
"table": "image_properties"
536
+
},
537
+
"create_namespace": true,
538
+
"force_update": true
539
+
}'
540
+
```
541
+
542
+
This example pushes the table `my-repo.main.default.features.image_properties` from lakeFS to the AWS Glue catalog as `main_features.image_properties`.
543
+
It creates the remote namespace if needed (since `create_namespace: true`),
544
+
and overwrites any existing table or possible recent updates committed to it (since `force_update: true`).
545
+
546
+
### Pull from remote
547
+
548
+
Pull operations update a lakeFS table with changes from a remote catalog.
549
+
This is useful after external tools have modified a table previously pushed from lakeFS.
- `force_update`: (optional, default: `false`) Override the table in lakeFS if metadata conflicts exist
557
+
- `create_namespace`: (optional, default: `false`) Create the namespace in lakeFS if it doesn't exist
558
+
559
+
**Example**:
560
+
561
+
```bash
562
+
curl -X POST "https://lakefs.example.com/iceberg/remotes/aws_glue_us_east_1/pull" \
563
+
-H "Authorization: Bearer $LAKEFS_ACCESS_TOKEN" \
564
+
-H "Content-Type: application/json" \
565
+
-d '{
566
+
"source": {
567
+
"namespace": ["main_features"],
568
+
"table": "image_properties"
569
+
},
570
+
"destination": {
571
+
"repository_id": "my-repo",
572
+
"reference_id": "main",
573
+
"namespace": ["default", "features"],
574
+
"table": "image_properties"
575
+
}
576
+
"create_namespace": true,
577
+
"force_update": true
578
+
}'
579
+
```
580
+
581
+
This example pulls changes from the Glue table `main_features.image_properties` back into the lakeFS table `my-repo.main.default.features.image_properties`.
582
+
It creates the namespace in lakeFS if needed (since `create_namespace: true`),
583
+
and overwrites any existing table or possible recent updates committed to it (since `force_update: true`).
584
+
585
+
### Important Notes
586
+
587
+
1. **Storage Location**: Pulled tables' `metadata.json` file must reside in a storage location to which lakeFS has read access to, or the `pull` operation will fail.
588
+
589
+
2. **Atomicity**: Push and pull operations are not atomic. If an operation fails partway through, manual intervention may be required
590
+
(contact [support](mailto:[email protected]) in such a case if needed).
591
+
592
+
3. **Authentication**: Ensure the credentials configured for remote catalogs have appropriate permissions:
593
+
- For AWS Glue: `glue:CreateTable`, `glue:UpdateTable`, `glue:GetTable`, `glue:CreateDatabase` (if `create_namespace` is used)
594
+
- For REST catalogs: Appropriate OAuth scopes for table and namespace operations
595
+
596
+
4. **Namespace Format**: Namespaces are represented as arrays of strings to support nested namespaces (e.g., `["accounting", "tax"]` represents `accounting.tax`).
0 commit comments