Skip to content

Commit 797982c

Browse files
authored
Docs: revise structured data compute pages (#9624)
1 parent 10b382f commit 797982c

File tree

4 files changed

+146
-25
lines changed

4 files changed

+146
-25
lines changed

docs/mkdocs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -191,15 +191,15 @@ nav:
191191
- Data Processing & Compute:
192192
- Apache Spark: integrations/spark.md
193193
- Apache Iceberg: integrations/iceberg.md
194-
- Apache Hive: integrations/hive.md
195194
- Presto / Trino: integrations/presto_trino.md
195+
- DuckDB: integrations/duckdb.md
196196
- Dremio: integrations/dremio.md
197197
- Databricks: integrations/databricks.md
198198
- Cloudera: integrations/cloudera.md
199-
- DuckDB: integrations/duckdb.md
200199
- Delta Lake: integrations/delta.md
201200
- Amazon Athena: integrations/athena.md
202201
- Apache Kafka: integrations/kafka.md
202+
- Apache Hive: integrations/hive.md
203203
- ML & AI:
204204
- Amazon SageMaker: integrations/sagemaker.md
205205
- Vertex AI: integrations/vertex_ai.md

docs/src/integrations/duckdb.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,42 @@ description: How to use lakeFS with DuckDB, an open-source SQL OLAP database man
1010

1111
## Accessing lakeFS from DuckDB
1212

13+
The recommended way to access lakeFS from DuckDB is to use the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog).
14+
15+
16+
![lakeFS Iceberg REST Catalog](../assets/img/lakefs_iceberg_rest_catalog.png)
17+
18+
19+
This allows you to query and update Iceberg tables using a standards-compliant catalog, built into lakeFS Enterprise. In this mode, lakeFS stays completely outside the data path: data itself is read and written by DuckDB executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
20+
21+
```sql
22+
LOAD iceberg;
23+
LOAD httpfs;
24+
25+
CREATE SECRET lakefs_credentials (
26+
TYPE ICEBERG,
27+
CLIENT_ID 'AKIAIOSFODNN7EXAMPLE',
28+
CLIENT_SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
29+
OAUTH2_SERVER_URI 'https://lakefs.example.com/iceberg/api/v1/oauth/tokens'
30+
);
31+
32+
ATTACH '' AS main_branch (
33+
TYPE iceberg,
34+
SECRET lakefs_credentials,
35+
ENDPOINT 'https://lakefs.example.com/iceberg/relative_to/my-repo.main/api'
36+
);
37+
38+
USE main_branch.inventory;
39+
SELECT * FROM books;
40+
```
41+
42+
!!! tip
43+
To learn more about the Iceberg REST Catalog, see the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog) documentation.
44+
45+
## Using DuckDB with the S3 Gateway
46+
47+
Using the S3 Gateway allows reading and writing data to lakeFS from DuckDB, in any format supported by DuckDB (i.e. not just Iceberg tables). While flexible, this approach requires lakeFS to be involved in the data path, which can be less efficient than the Iceberg REST Catalog approach, since lakeFS has to proxy all data operations through the lakeFS server.
48+
1349
### Configuration
1450

1551
Querying data in lakeFS from DuckDB is similar to querying data in S3 from DuckDB. It is done using the [httpfs extension](https://duckdb.org/docs/stable/core_extensions/httpfs/overview){:target="_blank"} connecting to the [S3 Gateway that lakeFS provides](../understand/architecture.md#s3-gateway).

docs/src/integrations/presto_trino.md

Lines changed: 70 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,82 @@
11
---
2-
title: Presto / Trino
3-
description: This section explains how you can start using lakeFS with Presto and Trino, an open-source distributed SQL query engine.
2+
title: Trino / Presto
3+
description: This section explains how you can start using lakeFS with the Trino and Presto open-source distributed SQL query engines.
44
---
55

6-
# Using lakeFS with Presto/Trino
6+
# Using lakeFS with Trino / Presto
77

8-
[Presto](https://prestodb.io){:target="_blank"} and [Trino](https://trinodb.io){:target="_blank"} are a distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
8+
[Trino](https://trinodb.io){:target="_blank"} and [Presto](https://prestodb.io){:target="_blank"} are distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
99

10-
Querying data in lakeFS from Presto/Trino is similar to querying data in S3 from Presto/Trino. It is done using the [Presto Hive connector](https://prestodb.io/docs/current/connector/hive.html){:target="_blank"} or [Trino Hive connector](https://trino.io/docs/current/connector/hive.html){:target="_blank"}.
1110

1211

12+
## Iceberg REST Catalog
13+
14+
lakeFS Iceberg REST Catalog allow you to use lakeFS as a [spec-compliant](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml) Apache [Iceberg REST catalog](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml),
15+
allowing Trino/Presto to manage and access tables using a standard REST API.
16+
17+
![lakeFS Iceberg REST Catalog](../assets/img/lakefs_iceberg_rest_catalog.png)
18+
19+
This is the recommended way to use lakeFS with Trino/Presto, as it allows lakeFS to stay completely outside the data path: data itself is read and written by Trino/Presto executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
20+
21+
[Read more about using the Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog).
22+
23+
### Configuration
24+
25+
To use the Iceberg REST Catalog, you need to configure Trino/Presto to use the [Iceberg REST catalog endpoint](https://trino.io/docs/current/object-storage/metastores.html#iceberg-rest-catalog):
26+
27+
!!! tip
28+
To learn more about the Iceberg REST Catalog, see the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog) documentation.
29+
30+
31+
```properties
32+
# example: /etc/trino/catalog/lakefs.properties
33+
connector.name=iceberg
34+
35+
# REST Catalog connection
36+
iceberg.catalog.type=rest
37+
iceberg.rest-catalog.uri=https://lakefs.example.com/iceberg/api
38+
iceberg.rest-catalog.nested-namespace-enabled=true
39+
40+
# REST Catalog authentication
41+
iceberg.rest-catalog.security=OAUTH2
42+
iceberg.rest-catalog.oauth2.credential=${ENV:LAKEFS_CREDENTIALS}
43+
iceberg.rest-catalog.oauth2.server-uri=https://lakefs.example.com/iceberg/api/v1/oauth/tokens
44+
45+
# Object storage access to underlying tables (modify this to match your storage provider)
46+
fs.hadoop.enabled=false
47+
fs.native-s3.enabled=true
48+
s3.region=us-east-1
49+
s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
50+
s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
51+
```
52+
53+
### Usage
54+
55+
Once configured, you can use the Iceberg REST Catalog to query and update Iceberg tables.
56+
57+
```sql
58+
USE "repo.main.inventory";
59+
SHOW TABLES;
60+
SELECT * FROM books LIMIT 100;
61+
```
62+
63+
```sql
64+
USE "repo.new_branch.inventory";
65+
SHOW TABLES;
66+
SELECT * FROM books LIMIT 100;
67+
```
68+
69+
## Using Presto/Trino with the S3 Gateway
70+
71+
Using the S3 Gateway allows reading and writing data to lakeFS from Presto/Trino, in any format supported by Presto/Trino (i.e. not just Iceberg tables).
72+
73+
While flexible, this approach requires lakeFS to be involved in the data path, which can be less efficient than the Iceberg REST Catalog approach, since lakeFS has to proxy all data operations through the lakeFS server. This is particularly true for large data sets where network bandwidth might incur some overhead.
74+
75+
### Configuration
1376

1477
!!! warning "Credentials"
1578
In the following examples, we set AWS credentials at runtime for clarity. In production, these properties should be set using one of Hadoop's standard ways of [Authenticating with S3](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"}.
1679

17-
## Configuration
18-
19-
### Configure the Hive connector
2080

2181
Create `/etc/catalog/hive.properties` with the following contents to mount the `hive-hadoop2` connector as the Hive catalog, replacing `example.net:9083` with the correct host and port for your Hive Metastore Thrift service:
2282

@@ -34,7 +94,7 @@ hive.s3.endpoint=https://lakefs.example.com
3494
hive.s3.path-style-access=true
3595
```
3696

37-
### Configure Hive
97+
#### Configure Hive
3898

3999
Presto/Trino uses Hive Metastore Service (HMS) or a compatible implementation of the Hive Metastore such as AWS Glue Data Catalog to write data to S3.
40100
In case you are using Hive Metastore, you will need to configure Hive as well.
@@ -62,14 +122,4 @@ In file `hive-site.xml` add to the configuration:
62122
</configuration>
63123
```
64124

65-
66-
67-
## Integration with lakeFS Data Catalogs
68-
69-
For advanced integration with lakeFS that supports querying different branches as schemas, see the [Data Catalog Exports documentation](../howto/catalog_exports.md). This approach allows you to:
70-
71-
- Query data from specific lakeFS branches using branch names as schemas
72-
- Automate table metadata synchronization through hooks
73-
- Support multiple table formats (Hive, Delta Lake, etc.)
74-
75-
For AWS Glue users, see the detailed [Glue Data Catalog integration guide](./glue_metastore.md) which provides step-by-step instructions for setting up automated exports.
125+

docs/src/integrations/spark.md

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,53 @@ description: Accessing data in lakeFS from Apache Spark works the same as access
77

88
There are several ways to use lakeFS with Spark:
99

10+
* [**Recommended:** using the lakeFS Iceberg REST Catalog](./iceberg.md): Read and write Iceberg tables using a standards-compliant catalog, built into lakeFS Enterprise.
1011
* [The lakeFS FileSystem](#lakefs-hadoop-filesystem): Direct data flow from client to storage, highly scalable. <span class="badge">AWS S3</span>
1112
* [lakeFS FileSystem in Presigned mode](#hadoop-filesystem-in-presigned-mode): Best of both worlds. <span class="badge mr-1">AWS S3</span><span class="badge">Azure Blob</span>
1213
* [The S3-compatible API](#s3-compatible-api): <span class="badge">All Storage Vendors</span>
1314

14-
!!! tip
15-
See how SimilarWeb is using lakeFS with Spark to [manage algorithm changes in data pipelines](https://grdoron.medium.com/a-smarter-way-to-manage-algorithm-changes-in-data-pipelines-with-lakefs-a4e284f8c756).
1615

16+
| Method | Metadata operations | Data operations | Supported Data Formats | Compatibility |
17+
|--------|---------------------|-----------------|-----------------|-----------------|
18+
| [Iceberg REST Catalog](#iceberg-rest-catalog) | ✅ Table-level operations only | ✅ Direct I/O to underlying storage | Apache Iceberg Tables | ✅ Any Spark environment capable of connecting to an Apache Iceberg REST Catalog (most) |
19+
| [lakeFS FileSystem](#lakefs-hadoop-filesystem) | ⚠️ Object-level operations require lakeFS API calls | ✅ Direct I/O to underlying storage | All | ⚠️ Any Spark environment capable of loading user provided jar files (some) |
20+
| [S3-compatible API](#s3-compatible-api) | N/A | 🚩 All data operations are proxied through lakeFS | All | ✅ Any Spark environment capable of connecting to an S3-compatible API (most) |
21+
22+
23+
## Iceberg REST Catalog
24+
25+
lakeFS Iceberg REST Catalog allow you to use lakeFS as a [spec-compliant](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml) Apache [Iceberg REST catalog](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml),
26+
allowing Apache Spark to manage and access tables using a standard REST API.
27+
28+
29+
![lakeFS Iceberg REST Catalog](../assets/img/lakefs_iceberg_rest_catalog.png)
30+
31+
32+
!!! example
33+
34+
```scala
35+
// Configure Spark to use the lakeFS REST catalog
36+
spark.sql("USE my_repo.main.inventory")
37+
38+
// List available tables
39+
spark.sql("SHOW TABLES").show()
40+
41+
// Query data with branch isolation
42+
spark.sql("SELECT * FROM books").show()
43+
44+
// Switch to a feature branch
45+
spark.sql("USE my_repo.new_branch.inventory")
46+
spark.sql("SELECT * FROM books").show()
47+
```
48+
49+
In this mode, lakeFS stays completely outside the data path: data itself is read and written by Spark executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
50+
51+
[Read more about using the Iceberg REST Catalog](./iceberg.md).
1752

1853
## lakeFS Hadoop FileSystem
1954

2055
In this mode, Spark will read and write objects directly from the underlying object store, reducing the load on the lakeFS server.
21-
It will only access the lakeFS server for metadata operations.
56+
It will only access the lakeFS server for metadata operations, which works for most other data formats
2257

2358
After configuring the lakeFS Hadoop FileSystem below, use URIs of the form `lakefs://example-repo/ref/path/to/data` to
2459
interact with your data on lakeFS.

0 commit comments

Comments
 (0)