Skip to content

Commit 67da720

Browse files
committed
[SPARK-53933] Add Apache Iceberg example
### What changes were proposed in this pull request? This PR aims to add `Apache Iceberg` example. ### Why are the changes needed? To provide an working example of `Apache Spark 4.0.1` and `Apache Iceberg 1.10.0`. 1. Prepare the storage ``` $ kubectl apply -f examples/localstack.ym ``` 2. Launch `Spark Connect Server` with `Apache Iceberg` setting. ``` $ kubectl apply -f examples/spark-connect-server-iceberg.yaml ``` 3. Setup port-forwarding to test ``` $ kubectl port-forward spark-connect-server-iceberg-0-driver 15002 ``` 4. Test with `Apache Iceberg Spark Quickstart` guideline. - https://iceberg.apache.org/spark-quickstart/ ```scala $ bin/spark-connect-shell --remote sc://localhost:15002 scala> sql("""CREATE TABLE taxis(vendor_id bigint, trip_id bigint, trip_distance float, fare_amount double, store_and_fwd_flag string) PARTITIONED BY (vendor_id);""").show() scala> sql("""INSERT INTO taxis VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');""").show() scala> sql("SELECT * FROM taxis").show(false) +---------+-------+-------------+-----------+------------------+ |vendor_id|trip_id|trip_distance|fare_amount|store_and_fwd_flag| +---------+-------+-------------+-----------+------------------+ |1 |1000374|8.4 |42.13 |Y | |1 |1000371|1.8 |15.32 |N | |2 |1000372|2.5 |22.15 |N | |2 |1000373|0.9 |9.01 |N | +---------+-------+-------------+-----------+------------------+ scala> sql("SELECT * FROM taxis.history").show(false) +-----------------------+-------------------+---------+-------------------+ |made_current_at |snapshot_id |parent_id|is_current_ancestor| +-----------------------+-------------------+---------+-------------------+ |2025-10-16 03:53:04.063|6463217948421571140|NULL |true | +-----------------------+-------------------+---------+-------------------+ scala> sql("SELECT * FROM taxis VERSION AS OF 6463217948421571140").show(false) +---------+-------+-------------+-----------+------------------+ |vendor_id|trip_id|trip_distance|fare_amount|store_and_fwd_flag| +---------+-------+-------------+-----------+------------------+ |1 |1000374|8.4 |42.13 |Y | |1 |1000371|1.8 |15.32 |N | |2 |1000372|2.5 |22.15 |N | |2 |1000373|0.9 |9.01 |N | +---------+-------+-------------+-----------+------------------+ ``` 5. Check the data in the storage. ``` rootlocalstack:/opt/code/localstack# awslocal s3 ls s3://warehouse/ --recursive 2025-10-16 03:53:03 1545 taxis/data/vendor_id=1/00000-3-749fe2e5-bbe3-4f0e-b976-a21749550705-0-00002.parquet 2025-10-16 03:53:03 1590 taxis/data/vendor_id=2/00000-3-749fe2e5-bbe3-4f0e-b976-a21749550705-0-00001.parquet 2025-10-16 03:53:04 7559 taxis/metadata/9f629d40-ce91-4822-aeee-283d53ec5ef6-m0.avro 2025-10-16 03:53:04 4446 taxis/metadata/snap-6463217948421571140-1-9f629d40-ce91-4822-aeee-283d53ec5ef6.avro 2025-10-16 03:52:47 1006 taxis/metadata/v1.metadata.json 2025-10-16 03:53:04 1970 taxis/metadata/v2.metadata.json 2025-10-16 03:53:04 1 taxis/metadata/version-hint.text ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #394 from dongjoon-hyun/SPARK-53933. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent 241592c commit 67da720

File tree

2 files changed

+46
-0
lines changed

2 files changed

+46
-0
lines changed

examples/localstack.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ spec:
4141
awslocal s3 mb s3://spark-events;
4242
awslocal s3 mb s3://ingest;
4343
awslocal s3 mb s3://data;
44+
awslocal s3 mb s3://warehouse;
4445
awslocal s3 cp /opt/code/localstack/Makefile s3://data/
4546
---
4647
apiVersion: v1
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one or more
2+
# contributor license agreements. See the NOTICE file distributed with
3+
# this work for additional information regarding copyright ownership.
4+
# The ASF licenses this file to You under the Apache License, Version 2.0
5+
# (the "License"); you may not use this file except in compliance with
6+
# the License. You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
apiVersion: spark.apache.org/v1
16+
kind: SparkApplication
17+
metadata:
18+
name: spark-connect-server-iceberg
19+
spec:
20+
mainClass: "org.apache.spark.sql.connect.service.SparkConnectServer"
21+
sparkConf:
22+
spark.dynamicAllocation.enabled: "true"
23+
spark.dynamicAllocation.maxExecutors: "3"
24+
spark.dynamicAllocation.minExecutors: "3"
25+
spark.dynamicAllocation.shuffleTracking.enabled: "true"
26+
spark.hadoop.fs.s3a.access.key: "test"
27+
spark.hadoop.fs.s3a.endpoint: "http://localstack:4566"
28+
spark.hadoop.fs.s3a.path.style.access: "true"
29+
spark.hadoop.fs.s3a.secret.key: "test"
30+
spark.jars.ivy: "/tmp/.ivy2.5.2"
31+
spark.jars.packages: "org.apache.hadoop:hadoop-aws:3.4.1,org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0"
32+
spark.kubernetes.authenticate.driver.serviceAccountName: "spark"
33+
spark.kubernetes.container.image: "apache/spark:4.0.1"
34+
spark.kubernetes.driver.pod.excludedFeatureSteps: "org.apache.spark.deploy.k8s.features.KerberosConfDriverFeatureStep"
35+
spark.kubernetes.executor.podNamePrefix: "spark-connect-server-iceberg"
36+
spark.scheduler.mode: "FAIR"
37+
spark.sql.catalog.s3.type: "hadoop"
38+
spark.sql.catalog.s3.warehouse: "s3a://warehouse"
39+
spark.sql.catalog.s3: "org.apache.iceberg.spark.SparkCatalog"
40+
spark.sql.defaultCatalog: "s3"
41+
spark.sql.extensions: "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
42+
applicationTolerations:
43+
resourceRetainPolicy: OnFailure
44+
runtimeVersions:
45+
sparkVersion: "4.0.1"

0 commit comments

Comments
 (0)