Docs: revise structured data compute pages (#9624)

ozkatz · web-flow · commit 797982c2c96c · 2025-11-06T17:47:13.000+02:00
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -191,15 +191,15 @@ nav:
     - Data Processing & Compute:
       - Apache Spark: integrations/spark.md
       - Apache Iceberg: integrations/iceberg.md
-      - Apache Hive: integrations/hive.md
       - Presto / Trino: integrations/presto_trino.md
+      - DuckDB: integrations/duckdb.md
       - Dremio: integrations/dremio.md
       - Databricks: integrations/databricks.md
       - Cloudera: integrations/cloudera.md
-      - DuckDB: integrations/duckdb.md
       - Delta Lake: integrations/delta.md
       - Amazon Athena: integrations/athena.md
       - Apache Kafka: integrations/kafka.md
+      - Apache Hive: integrations/hive.md
     - ML & AI:
       - Amazon SageMaker: integrations/sagemaker.md
       - Vertex AI: integrations/vertex_ai.md
diff --git a/docs/src/integrations/duckdb.md b/docs/src/integrations/duckdb.md
@@ -10,6 +10,42 @@ description: How to use lakeFS with DuckDB, an open-source SQL OLAP database man
 
 ## Accessing lakeFS from DuckDB
 
+The recommended way to access lakeFS from DuckDB is to use the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog). 
+
+
+![lakeFS Iceberg REST Catalog](../assets/img/lakefs_iceberg_rest_catalog.png)
+
+
+This allows you to query and update Iceberg tables using a standards-compliant catalog, built into lakeFS Enterprise. In this mode, lakeFS stays completely outside the data path: data itself is read and written by DuckDB executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
+
+```sql
+LOAD iceberg;
+LOAD httpfs;
+
+CREATE SECRET lakefs_credentials (
+    TYPE ICEBERG,
+    CLIENT_ID 'AKIAIOSFODNN7EXAMPLE',
+    CLIENT_SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
+    OAUTH2_SERVER_URI 'https://lakefs.example.com/iceberg/api/v1/oauth/tokens'
+);
+
+ATTACH '' AS main_branch (
+    TYPE iceberg,
+    SECRET lakefs_credentials,
+    ENDPOINT 'https://lakefs.example.com/iceberg/relative_to/my-repo.main/api'
+);
+
+USE main_branch.inventory;
+SELECT * FROM books;
+```
+
+!!! tip
+    To learn more about the Iceberg REST Catalog, see the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog) documentation.
+
+## Using DuckDB with the S3 Gateway
+
+Using the S3 Gateway allows reading and writing data to lakeFS from DuckDB, in any format supported by DuckDB (i.e. not just Iceberg tables). While flexible, this approach requires lakeFS to be involved in the data path, which can be less efficient than the Iceberg REST Catalog approach, since lakeFS has to proxy all data operations through the lakeFS server.
+
 ### Configuration
 
 Querying data in lakeFS from DuckDB is similar to querying data in S3 from DuckDB. It is done using the [httpfs extension](https://duckdb.org/docs/stable/core_extensions/httpfs/overview){:target="_blank"} connecting to the [S3 Gateway that lakeFS provides](../understand/architecture.md#s3-gateway).
diff --git a/docs/src/integrations/presto_trino.md b/docs/src/integrations/presto_trino.md
@@ -1,22 +1,82 @@
 ---
-title: Presto / Trino
-description: This section explains how you can start using lakeFS with Presto and Trino, an open-source distributed SQL query engine.
+title: Trino / Presto
+description: This section explains how you can start using lakeFS with the Trino and Presto open-source distributed SQL query engines.
 ---
 
-# Using lakeFS with Presto/Trino
+# Using lakeFS with Trino / Presto 
 
-[Presto](https://prestodb.io){:target="_blank"} and [Trino](https://trinodb.io){:target="_blank"} are a distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
+ [Trino](https://trinodb.io){:target="_blank"} and [Presto](https://prestodb.io){:target="_blank"} are distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
 
-Querying data in lakeFS from Presto/Trino is similar to querying data in S3 from Presto/Trino. It is done using the [Presto Hive connector](https://prestodb.io/docs/current/connector/hive.html){:target="_blank"} or [Trino Hive connector](https://trino.io/docs/current/connector/hive.html){:target="_blank"}.
 
 
+## Iceberg REST Catalog 
+
+lakeFS Iceberg REST Catalog allow you to use lakeFS as a [spec-compliant](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml) Apache [Iceberg REST catalog](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml), 
+allowing Trino/Presto to manage and access tables using a standard REST API. 
+
+![lakeFS Iceberg REST Catalog](../assets/img/lakefs_iceberg_rest_catalog.png)
+
+This is the recommended way to use lakeFS with Trino/Presto, as it allows lakeFS to stay completely outside the data path: data itself is read and written by Trino/Presto executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
+
+[Read more about using the Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog).
+
+### Configuration
+
+To use the Iceberg REST Catalog, you need to configure Trino/Presto to use the [Iceberg REST catalog endpoint](https://trino.io/docs/current/object-storage/metastores.html#iceberg-rest-catalog):
+
+!!! tip
+    To learn more about the Iceberg REST Catalog, see the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog) documentation.
+
+
+```properties
+# example: /etc/trino/catalog/lakefs.properties
+connector.name=iceberg
+
+# REST Catalog connection
+iceberg.catalog.type=rest
+iceberg.rest-catalog.uri=https://lakefs.example.com/iceberg/api
+iceberg.rest-catalog.nested-namespace-enabled=true
+
+# REST Catalog authentication
+iceberg.rest-catalog.security=OAUTH2
+iceberg.rest-catalog.oauth2.credential=${ENV:LAKEFS_CREDENTIALS}
+iceberg.rest-catalog.oauth2.server-uri=https://lakefs.example.com/iceberg/api/v1/oauth/tokens
+
+# Object storage access to underlying tables (modify this to match your storage provider)
+fs.hadoop.enabled=false
+fs.native-s3.enabled=true
+s3.region=us-east-1
+s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
+s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
+```
+
+### Usage
+
+Once configured, you can use the Iceberg REST Catalog to query and update Iceberg tables.
+
+```sql
+USE "repo.main.inventory";
+SHOW TABLES;
+SELECT * FROM books LIMIT 100;
+```
+
+```sql
+USE "repo.new_branch.inventory";
+SHOW TABLES;
+SELECT * FROM books LIMIT 100;
+```
+
+## Using Presto/Trino with the S3 Gateway
+
+Using the S3 Gateway allows reading and writing data to lakeFS from Presto/Trino, in any format supported by Presto/Trino (i.e. not just Iceberg tables). 
+
+While flexible, this approach requires lakeFS to be involved in the data path, which can be less efficient than the Iceberg REST Catalog approach, since lakeFS has to proxy all data operations through the lakeFS server. This is particularly true for large data sets where network bandwidth might incur some overhead.
+
+### Configuration
 
 !!! warning "Credentials"
     In the following examples, we set AWS credentials at runtime for clarity. In production, these properties should be set using one of Hadoop's standard ways of [Authenticating with S3](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"}. 
 
-## Configuration
-
-### Configure the Hive connector
 
 Create `/etc/catalog/hive.properties` with the following contents to mount the `hive-hadoop2` connector as the Hive catalog, replacing `example.net:9083` with the correct host and port for your Hive Metastore Thrift service:
 
@@ -34,7 +94,7 @@ hive.s3.endpoint=https://lakefs.example.com
 hive.s3.path-style-access=true
 ```
 
-### Configure Hive
+#### Configure Hive
 
 Presto/Trino uses Hive Metastore Service (HMS) or a compatible implementation of the Hive Metastore such as AWS Glue Data Catalog to write data to S3.
 In case you are using Hive Metastore, you will need to configure Hive as well.
@@ -62,14 +122,4 @@ In file `hive-site.xml` add to the configuration:
 </configuration>
 ```
  
-
-
-## Integration with lakeFS Data Catalogs
-
-For advanced integration with lakeFS that supports querying different branches as schemas, see the [Data Catalog Exports documentation](../howto/catalog_exports.md). This approach allows you to:
-
-- Query data from specific lakeFS branches using branch names as schemas
-- Automate table metadata synchronization through hooks
-- Support multiple table formats (Hive, Delta Lake, etc.)
-
-For AWS Glue users, see the detailed [Glue Data Catalog integration guide](./glue_metastore.md) which provides step-by-step instructions for setting up automated exports.
+ 
diff --git a/docs/src/integrations/spark.md b/docs/src/integrations/spark.md
@@ -7,18 +7,53 @@ description: Accessing data in lakeFS from Apache Spark works the same as access
 
 There are several ways to use lakeFS with Spark:
 
+* [**Recommended:** using the lakeFS Iceberg REST Catalog](./iceberg.md): Read and write Iceberg tables using a standards-compliant catalog, built into lakeFS Enterprise.
 * [The lakeFS FileSystem](#lakefs-hadoop-filesystem): Direct data flow from client to storage, highly scalable. <span class="badge">AWS S3</span>
     * [lakeFS FileSystem in Presigned mode](#hadoop-filesystem-in-presigned-mode): Best of both worlds. <span class="badge mr-1">AWS S3</span><span class="badge">Azure Blob</span>
 * [The S3-compatible API](#s3-compatible-api):  <span class="badge">All Storage Vendors</span>
 
-!!! tip
-    See how SimilarWeb is using lakeFS with Spark to [manage algorithm changes in data pipelines](https://grdoron.medium.com/a-smarter-way-to-manage-algorithm-changes-in-data-pipelines-with-lakefs-a4e284f8c756).
 
+| Method | Metadata operations | Data operations | Supported Data Formats | Compatibility |
+|--------|---------------------|-----------------|-----------------|-----------------|
+| [Iceberg REST Catalog](#iceberg-rest-catalog) | ✅ Table-level operations only | ✅ Direct I/O to underlying storage | Apache Iceberg Tables | ✅ Any Spark environment capable of connecting to an Apache Iceberg REST Catalog (most) |
+| [lakeFS FileSystem](#lakefs-hadoop-filesystem) | ⚠️ Object-level operations require lakeFS API calls | ✅ Direct I/O to underlying storage | All | ⚠️ Any Spark environment capable of loading user provided jar files (some) |
+| [S3-compatible API](#s3-compatible-api) | N/A | 🚩 All data operations are proxied through lakeFS | All | ✅ Any Spark environment capable of connecting to an S3-compatible API (most) |
+
+
+## Iceberg REST Catalog
+
+lakeFS Iceberg REST Catalog allow you to use lakeFS as a [spec-compliant](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml) Apache [Iceberg REST catalog](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml), 
+allowing Apache Spark to manage and access tables using a standard REST API.
+
+
+![lakeFS Iceberg REST Catalog](../assets/img/lakefs_iceberg_rest_catalog.png)
+
+
+!!! example
+    
+    ```scala
+    // Configure Spark to use the lakeFS REST catalog
+    spark.sql("USE my_repo.main.inventory")
+
+    // List available tables
+    spark.sql("SHOW TABLES").show()
+
+    // Query data with branch isolation
+    spark.sql("SELECT * FROM books").show()
+
+    // Switch to a feature branch
+    spark.sql("USE my_repo.new_branch.inventory")
+    spark.sql("SELECT * FROM books").show()
+    ```
+
+In this mode, lakeFS stays completely outside the data path: data itself is read and written by Spark executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
+
+[Read more about using the Iceberg REST Catalog](./iceberg.md).
 
 ## lakeFS Hadoop FileSystem
 
 In this mode, Spark will read and write objects directly from the underlying object store, reducing the load on the lakeFS server.
-It will only access the lakeFS server for metadata operations.
+It will only access the lakeFS server for metadata operations, which works for most other data formats
 
 After configuring the lakeFS Hadoop FileSystem below, use URIs of the form `lakefs://example-repo/ref/path/to/data` to
 interact with your data on lakeFS.