Skip to content

Commit 098f7ab

Browse files
authored
docs: iceberg quick start (#1987)
* Update 02-iceberg.md * updates
1 parent 37eebc0 commit 098f7ab

File tree

2 files changed

+222
-0
lines changed

2 files changed

+222
-0
lines changed

docs/en/guides/10-deploy/04-references/02-node-config/02-query-config.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,7 @@ The following is a list of the parameters available within the [cache] section:
228228
| Parameter | Description |
229229
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
230230
| data_cache_storage | The type of storage used for table data cache. Available options: "none" (disables table data cache), "disk" (enables disk cache). Defaults to "none". |
231+
| iceberg_table_meta_count | Controls the number of Iceberg table metadata entries to cache. Set to `0` to disable metadata caching. |
231232

232233
### [cache.disk] Section
233234

docs/en/guides/51-access-data-lake/02-iceberg.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,202 @@ import FunctionDescription from '@site/src/components/FunctionDescription';
77

88
Databend supports the integration of an [Apache Iceberg](https://iceberg.apache.org/) catalog, enhancing its compatibility and versatility for data management and analytics. This extends Databend's capabilities by seamlessly incorporating the powerful metadata and storage management capabilities of Apache Iceberg into the platform.
99

10+
## Quick Start with Apache Iceberg
11+
12+
If you want to quickly try out Apache Iceberg and experiment with table operations locally, a [Docker-based starter project](https://github.com/databendlabs/iceberg-quick-start) is available. This setup allows you to:
13+
14+
- Run Spark with Iceberg support
15+
- Use a REST catalog (Iceberg REST Fixture)
16+
- Simulate an S3-compatible object store using MinIO
17+
- Load sample TPC-H data into Iceberg tables for query testing
18+
19+
### Prerequisites
20+
21+
Before you start, make sure Docker and Docker Compose are installed on your system.
22+
23+
### Start Iceberg Environment
24+
25+
```bash
26+
git clone https://github.com/databendlabs/iceberg-quick-start.git
27+
cd iceberg-quick-start
28+
docker compose up -d
29+
```
30+
31+
This will start the following services:
32+
33+
- `spark-iceberg`: Spark 3.4 with Iceberg
34+
- `rest`: Iceberg REST Catalog
35+
- `minio`: S3-compatible object store
36+
- `mc`: MinIO client (for setting up the bucket)
37+
38+
```bash
39+
WARN[0000] /Users/eric/iceberg-quick-start/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
40+
[+] Running 5/5
41+
✔ Network iceberg-quick-start_iceberg_net Created 0.0s
42+
✔ Container iceberg-rest-test Started 0.4s
43+
✔ Container minio Started 0.4s
44+
✔ Container mc Started 0.6s
45+
✔ Container spark-iceberg S... 0.7s
46+
```
47+
48+
### Load TPC-H Data via Spark Shell
49+
50+
Run the following command to generate and load sample TPC-H data into the Iceberg tables:
51+
52+
```bash
53+
docker exec spark-iceberg bash /home/iceberg/load_tpch.sh
54+
```
55+
56+
```bash
57+
Collecting duckdb
58+
Downloading duckdb-1.2.2-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (18.7 MB)
59+
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.7/18.7 MB 5.8 MB/s eta 0:00:00
60+
Requirement already satisfied: pyspark in /opt/spark/python (3.5.5)
61+
Collecting py4j==0.10.9.7
62+
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
63+
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 kB 5.9 MB/s eta 0:00:00
64+
Installing collected packages: py4j, duckdb
65+
Successfully installed duckdb-1.2.2 py4j-0.10.9.7
66+
67+
[notice] A new release of pip is available: 23.0.1 -> 25.1.1
68+
[notice] To update, run: pip install --upgrade pip
69+
Setting default log level to "WARN".
70+
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
71+
25/05/07 12:17:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
72+
25/05/07 12:17:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
73+
[2025-05-07 12:17:18] [INFO] Starting TPC-H data generation and loading process
74+
[2025-05-07 12:17:18] [INFO] Configuration: Scale Factor=1, Data Dir=/home/iceberg/data/tpch_1
75+
[2025-05-07 12:17:18] [INFO] Generating TPC-H data with DuckDB (Scale Factor: 1)
76+
[2025-05-07 12:17:27] [INFO] Generated 8 Parquet files in /home/iceberg/data/tpch_1
77+
[2025-05-07 12:17:28] [INFO] Loading data into Iceberg catalog
78+
[2025-05-07 12:17:33] [INFO] Created Iceberg table: demo.tpch.part from part.parquet
79+
[2025-05-07 12:17:33] [INFO] Created Iceberg table: demo.tpch.region from region.parquet
80+
[2025-05-07 12:17:33] [INFO] Created Iceberg table: demo.tpch.supplier from supplier.parquet
81+
[2025-05-07 12:17:35] [INFO] Created Iceberg table: demo.tpch.orders from orders.parquet
82+
[2025-05-07 12:17:35] [INFO] Created Iceberg table: demo.tpch.nation from nation.parquet
83+
[2025-05-07 12:17:40] [INFO] Created Iceberg table: demo.tpch.lineitem from lineitem.parquet
84+
[2025-05-07 12:17:40] [INFO] Created Iceberg table: demo.tpch.partsupp from partsupp.parquet
85+
[2025-05-07 12:17:41] [INFO] Created Iceberg table: demo.tpch.customer from customer.parquet
86+
+---------+---------+-----------+
87+
|namespace|tableName|isTemporary|
88+
+---------+---------+-----------+
89+
| tpch| customer| false|
90+
| tpch| lineitem| false|
91+
| tpch| nation| false|
92+
| tpch| orders| false|
93+
| tpch| part| false|
94+
| tpch| partsupp| false|
95+
| tpch| region| false|
96+
| tpch| supplier| false|
97+
+---------+---------+-----------+
98+
99+
[2025-05-07 12:17:42] [SUCCESS] TPCH data generation and loading completed successfully
100+
```
101+
102+
### Query Data in Databend
103+
104+
Once the TPC-H tables are loaded, you can query the data in Databend:
105+
106+
1. Launch Databend in Docker:
107+
108+
```bash
109+
docker network create iceberg_net
110+
```
111+
112+
```bash
113+
docker run -d \
114+
--name databend \
115+
--network iceberg_net \
116+
-p 3307:3307 \
117+
-p 8000:8000 \
118+
-p 8124:8124 \
119+
-p 8900:8900 \
120+
datafuselabs/databend
121+
```
122+
123+
2. Connect to Databend using BendSQL first, and then create an Iceberg catalog:
124+
125+
```bash
126+
bendsql
127+
```
128+
129+
```bash
130+
Welcome to BendSQL 0.24.1-f1f7de0(2024-12-04T12:31:18.526234000Z).
131+
Connecting to localhost:8000 as user root.
132+
Connected to Databend Query v1.2.725-8d073f6b7a(rust-1.88.0-nightly-2025-04-21T11:49:03.577976082Z)
133+
Loaded 1436 auto complete keywords from server.
134+
Started web server at 127.0.0.1:8080
135+
```
136+
137+
```sql
138+
CREATE CATALOG iceberg TYPE = ICEBERG CONNECTION = (
139+
TYPE = 'rest'
140+
ADDRESS = 'http://host.docker.internal:8181'
141+
warehouse = 's3://warehouse/wh/'
142+
"s3.endpoint" = 'http://host.docker.internal:9000'
143+
"s3.access-key-id" = 'admin'
144+
"s3.secret-access-key" = 'password'
145+
"s3.region" = 'us-east-1'
146+
);
147+
```
148+
149+
3. Use the newly created catalog:
150+
151+
```sql
152+
USE CATALOG iceberg;
153+
```
154+
155+
4. Show available databases:
156+
157+
```sql
158+
SHOW DATABASES;
159+
```
160+
161+
```sql
162+
╭──────────────────────╮
163+
│ databases_in_iceberg │
164+
│ String │
165+
├──────────────────────┤
166+
│ tpch │
167+
╰──────────────────────╯
168+
```
169+
170+
5. Run a sample query to aggregate TPC-H data:
171+
172+
```bash
173+
SELECT
174+
l_returnflag,
175+
l_linestatus,
176+
SUM(l_quantity) AS sum_qty,
177+
SUM(l_extendedprice) AS sum_base_price,
178+
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
179+
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
180+
AVG(l_quantity) AS avg_qty,
181+
AVG(l_extendedprice) AS avg_price,
182+
AVG(l_discount) AS avg_disc,
183+
COUNT(*) AS count_order
184+
FROM
185+
iceberg.tpch.lineitem
186+
GROUP BY
187+
l_returnflag,
188+
l_linestatus
189+
ORDER BY
190+
l_returnflag,
191+
l_linestatus;
192+
```
193+
194+
```sql
195+
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
196+
│ l_returnflag │ l_linestatus │ sum_qty │ sum_base_price │ sum_disc_price │ sum_charge │ avg_qty │ avg_price │ avg_disc │ count_order │
197+
│ Nullable(String) │ Nullable(String) │ Nullable(Decimal(38, 2)) │ Nullable(Decimal(38, 2)) │ Nullable(Decimal(38, 4)) │ Nullable(Decimal(38, 6)) │ Nullable(Decimal(38, 8)) │ Nullable(Decimal(38, 8)) │ Nullable(Decimal(38, 8)) │ UInt64 │
198+
├──────────────────┼──────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┼─────────────┤
199+
│ A │ F │ 37734107.0056586554400.7353758257134.870055909065222.82769225.5220058538273.129734620.049985301478493
200+
│ N │ F │ 991417.001487504710.381413082168.05411469649223.19437525.5164719238284.467760850.0500934338854
201+
│ N │ O │ 76633518.00114935210409.19109189591897.4720113561024263.01378225.5020196438248.015609060.050000263004998
202+
│ R │ F │ 37719753.0056568041380.9053741292684.604055889619119.83193225.5057936138250.854626100.050009411478870
203+
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
204+
```
205+
10206
## Datatype Mapping
11207

12208
This table maps data types between Apache Iceberg and Databend. Please note that Databend does not currently support Iceberg data types that are not listed in the table.
@@ -115,6 +311,31 @@ Switches the current session to the specified catalog.
115311
USE CATALOG <catalog_name>
116312
```
117313

314+
## Caching Iceberg Catalog
315+
316+
Databend offers a Catalog Metadata Cache specifically designed for Iceberg catalogs. When a query is executed on an Iceberg table for the first time, the metadata is cached in memory. By default, this cache remains valid for 10 minutes, after which it is asynchronously refreshed. This ensures that queries on Iceberg tables are faster by avoiding repeated metadata retrieval.
317+
318+
If you need fresh metadata, you can manually refresh the cache using the following commands:
319+
320+
```sql
321+
USE CATALOG iceberg;
322+
ALTER DATABASE tpch REFRESH CACHE; -- Refresh metadata cache for the tpch database
323+
ALTER TABLE tpch.lineitem REFRESH CACHE; -- Refresh metadata cache for the lineitem table
324+
```
325+
326+
If you prefer not to use the metadata cache, you can disable it entirely by configuring the `iceberg_table_meta_count` setting to `0` in the [databend-query.toml](https://github.com/databendlabs/databend/blob/main/scripts/distribution/configs/databend-query.toml) configuration file:
327+
328+
```toml
329+
...
330+
# Cache config.
331+
[cache]
332+
...
333+
iceberg_table_meta_count = 0
334+
...
335+
```
336+
337+
In addition to metadata caching, Databend also supports table data caching for Iceberg catalog tables, similar to Fuse tables. For more information on data caching, refer to the `[cache] Section` in the [Query Configurations](../10-deploy/04-references/02-node-config/02-query-config.md) reference.
338+
118339
## Iceberg Table Functions
119340

120341
Databend provides the following table functions for querying Iceberg metadata, allowing users to inspect snapshots and manifests efficiently:

0 commit comments

Comments
 (0)