You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This allows you to query and update Iceberg tables using a standards-compliant catalog, built into lakeFS Enterprise. In this mode, lakeFS stays completely outside the data path: data itself is read and written by DuckDB executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
To learn more about the Iceberg REST Catalog, see the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog) documentation.
44
+
45
+
## Using DuckDB with the S3 Gateway
46
+
47
+
Using the S3 Gateway allows reading and writing data to lakeFS from DuckDB, in any format supported by DuckDB (i.e. not just Iceberg tables). While flexible, this approach requires lakeFS to be involved in the data path, which can be less efficient than the Iceberg REST Catalog approach, since lakeFS has to proxy all data operations through the lakeFS server.
48
+
13
49
### Configuration
14
50
15
51
Querying data in lakeFS from DuckDB is similar to querying data in S3 from DuckDB. It is done using the [httpfs extension](https://duckdb.org/docs/stable/core_extensions/httpfs/overview){:target="_blank"} connecting to the [S3 Gateway that lakeFS provides](../understand/architecture.md#s3-gateway).
Copy file name to clipboardExpand all lines: docs/src/integrations/presto_trino.md
+70-20Lines changed: 70 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,82 @@
1
1
---
2
-
title: Presto / Trino
3
-
description: This section explains how you can start using lakeFS with Presto and Trino, an open-source distributed SQL query engine.
2
+
title: Trino / Presto
3
+
description: This section explains how you can start using lakeFS with the Trino and Presto open-source distributed SQL query engines.
4
4
---
5
5
6
-
# Using lakeFS with Presto/Trino
6
+
# Using lakeFS with Trino / Presto
7
7
8
-
[Presto](https://prestodb.io){:target="_blank"} and [Trino](https://trinodb.io){:target="_blank"} are a distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
8
+
[Trino](https://trinodb.io){:target="_blank"} and [Presto](https://prestodb.io){:target="_blank"} are distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
9
9
10
-
Querying data in lakeFS from Presto/Trino is similar to querying data in S3 from Presto/Trino. It is done using the [Presto Hive connector](https://prestodb.io/docs/current/connector/hive.html){:target="_blank"} or [Trino Hive connector](https://trino.io/docs/current/connector/hive.html){:target="_blank"}.
11
10
12
11
12
+
## Iceberg REST Catalog
13
+
14
+
lakeFS Iceberg REST Catalog allow you to use lakeFS as a [spec-compliant](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml) Apache [Iceberg REST catalog](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml),
15
+
allowing Trino/Presto to manage and access tables using a standard REST API.
This is the recommended way to use lakeFS with Trino/Presto, as it allows lakeFS to stay completely outside the data path: data itself is read and written by Trino/Presto executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
20
+
21
+
[Read more about using the Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog).
22
+
23
+
### Configuration
24
+
25
+
To use the Iceberg REST Catalog, you need to configure Trino/Presto to use the [Iceberg REST catalog endpoint](https://trino.io/docs/current/object-storage/metastores.html#iceberg-rest-catalog):
26
+
27
+
!!! tip
28
+
To learn more about the Iceberg REST Catalog, see the [Iceberg REST Catalog](./iceberg.md#iceberg-rest-catalog) documentation.
# Object storage access to underlying tables (modify this to match your storage provider)
46
+
fs.hadoop.enabled=false
47
+
fs.native-s3.enabled=true
48
+
s3.region=us-east-1
49
+
s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
50
+
s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
51
+
```
52
+
53
+
### Usage
54
+
55
+
Once configured, you can use the Iceberg REST Catalog to query and update Iceberg tables.
56
+
57
+
```sql
58
+
USE "repo.main.inventory";
59
+
SHOW TABLES;
60
+
SELECT*FROM books LIMIT100;
61
+
```
62
+
63
+
```sql
64
+
USE "repo.new_branch.inventory";
65
+
SHOW TABLES;
66
+
SELECT*FROM books LIMIT100;
67
+
```
68
+
69
+
## Using Presto/Trino with the S3 Gateway
70
+
71
+
Using the S3 Gateway allows reading and writing data to lakeFS from Presto/Trino, in any format supported by Presto/Trino (i.e. not just Iceberg tables).
72
+
73
+
While flexible, this approach requires lakeFS to be involved in the data path, which can be less efficient than the Iceberg REST Catalog approach, since lakeFS has to proxy all data operations through the lakeFS server. This is particularly true for large data sets where network bandwidth might incur some overhead.
74
+
75
+
### Configuration
13
76
14
77
!!! warning "Credentials"
15
78
In the following examples, we set AWS credentials at runtime for clarity. In production, these properties should be set using one of Hadoop's standard ways of [Authenticating with S3](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"}.
16
79
17
-
## Configuration
18
-
19
-
### Configure the Hive connector
20
80
21
81
Create `/etc/catalog/hive.properties` with the following contents to mount the `hive-hadoop2` connector as the Hive catalog, replacing `example.net:9083` with the correct host and port for your Hive Metastore Thrift service:
Presto/Trino uses Hive Metastore Service (HMS) or a compatible implementation of the Hive Metastore such as AWS Glue Data Catalog to write data to S3.
40
100
In case you are using Hive Metastore, you will need to configure Hive as well.
@@ -62,14 +122,4 @@ In file `hive-site.xml` add to the configuration:
62
122
</configuration>
63
123
```
64
124
65
-
66
-
67
-
## Integration with lakeFS Data Catalogs
68
-
69
-
For advanced integration with lakeFS that supports querying different branches as schemas, see the [Data Catalog Exports documentation](../howto/catalog_exports.md). This approach allows you to:
70
-
71
-
- Query data from specific lakeFS branches using branch names as schemas
72
-
- Automate table metadata synchronization through hooks
73
-
- Support multiple table formats (Hive, Delta Lake, etc.)
74
-
75
-
For AWS Glue users, see the detailed [Glue Data Catalog integration guide](./glue_metastore.md) which provides step-by-step instructions for setting up automated exports.
Copy file name to clipboardExpand all lines: docs/src/integrations/spark.md
+38-3Lines changed: 38 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,18 +7,53 @@ description: Accessing data in lakeFS from Apache Spark works the same as access
7
7
8
8
There are several ways to use lakeFS with Spark:
9
9
10
+
*[**Recommended:** using the lakeFS Iceberg REST Catalog](./iceberg.md): Read and write Iceberg tables using a standards-compliant catalog, built into lakeFS Enterprise.
10
11
*[The lakeFS FileSystem](#lakefs-hadoop-filesystem): Direct data flow from client to storage, highly scalable. <spanclass="badge">AWS S3</span>
11
12
*[lakeFS FileSystem in Presigned mode](#hadoop-filesystem-in-presigned-mode): Best of both worlds. <spanclass="badge mr-1">AWS S3</span><spanclass="badge">Azure Blob</span>
See how SimilarWeb is using lakeFS with Spark to [manage algorithm changes in data pipelines](https://grdoron.medium.com/a-smarter-way-to-manage-algorithm-changes-in-data-pipelines-with-lakefs-a4e284f8c756).
16
15
16
+
| Method | Metadata operations | Data operations | Supported Data Formats | Compatibility |
|[Iceberg REST Catalog](#iceberg-rest-catalog)| ✅ Table-level operations only | ✅ Direct I/O to underlying storage | Apache Iceberg Tables | ✅ Any Spark environment capable of connecting to an Apache Iceberg REST Catalog (most) |
19
+
|[lakeFS FileSystem](#lakefs-hadoop-filesystem)| ⚠️ Object-level operations require lakeFS API calls | ✅ Direct I/O to underlying storage | All | ⚠️ Any Spark environment capable of loading user provided jar files (some) |
20
+
|[S3-compatible API](#s3-compatible-api)| N/A | 🚩 All data operations are proxied through lakeFS | All | ✅ Any Spark environment capable of connecting to an S3-compatible API (most) |
21
+
22
+
23
+
## Iceberg REST Catalog
24
+
25
+
lakeFS Iceberg REST Catalog allow you to use lakeFS as a [spec-compliant](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml) Apache [Iceberg REST catalog](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml),
26
+
allowing Apache Spark to manage and access tables using a standard REST API.
In this mode, lakeFS stays completely outside the data path: data itself is read and written by Spark executors, directly to the underlying object store. Metadata is managed by Iceberg at the table level, while lakeFS keeps track of new snapshots to provide versioning and isolation.
50
+
51
+
[Read more about using the Iceberg REST Catalog](./iceberg.md).
17
52
18
53
## lakeFS Hadoop FileSystem
19
54
20
55
In this mode, Spark will read and write objects directly from the underlying object store, reducing the load on the lakeFS server.
21
-
It will only access the lakeFS server for metadata operations.
56
+
It will only access the lakeFS server for metadata operations, which works for most other data formats
22
57
23
58
After configuring the lakeFS Hadoop FileSystem below, use URIs of the form `lakefs://example-repo/ref/path/to/data` to
0 commit comments