Skip to content

Commit 10cf0fd

Browse files
authored
[AURON apache#2030] Add Native Scan Support for Apache Hudi Copy-On-Write Tables. (apache#2031)
### Which issue does this PR close? Closes apache#2030 ### Rationale for this change This PR adds native scan support for Hudi Copy-On-Write (COW) tables, enabling Auron to accelerate Hudi table reads by converting `FileSourceScanExec` operations to native Parquet/ORC scan implementations. ### What changes are included in this PR? #### 1. **New Module: `thirdparty/auron-hudi`** - **`HudiConvertProvider`**: Implements `AuronConvertProvider` SPI to intercept and convert Hudi `FileSourceScanExec` to native scans - Detects Hudi file formats (`HoodieParquetFileFormat`, `HoodieOrcFileFormat`) - Converts to `NativeParquetScanExec` or `NativeOrcScanExec` - Handles timestamp fallback logic automatically - **`HudiScanSupport`**: Core detection and validation logic - File format recognition with `NewHoodie*` format rejection - Table type resolution via multi-source metadata fallback: - Options → Catalog → `.hoodie/hoodie.properties` - MOR table detection and rejection - Time travel query detection (via `as.of.instant`, `as.of.timestamp` options) - FileIndex class hierarchy verification #### 2. **Configuration** - Added `spark.auron.enable.hudi.scan` config option (default: `true`) - Respects existing Parquet/ORC timestamp scanning configurations - Runtime Spark version validation (3.0–3.5 only) #### 3. **Build & Integration** - **Maven**: New profile `hudi-0.15` with enforcer rules - Validates `hudiEnabled=true` property - Restricts Spark to 3.0–3.5 - Pins Hudi version to 0.15.0 - **Build Script**: Enhanced `auron-build.sh` - Added `--hudi <VERSION>` parameter - Version compatibility validation - Auto-enables `hudiEnabled` property - **CI/CD**: New workflow `.github/workflows/hudi.yml` - Matrix testing: Spark 3.0–3.5 × JDK 8/17/21 × Scala 2.12 - Independent Hudi test pipeline ### Are there any user-facing changes? ## New Configuration Option ```scala // Enable Hudi native scan (enabled by default) spark.conf.set("spark.auron.enable.hudi.scan", "true") ``` ### How was this patch tested? Add Junit Test. Signed-off-by: slfan1989 <slfan1989@apache.org>
1 parent 7da2ba5 commit 10cf0fd

File tree

12 files changed

+1009
-3
lines changed

12 files changed

+1009
-3
lines changed

.github/workflows/hudi.yml

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
name: Hudi
19+
20+
on:
21+
workflow_dispatch:
22+
push:
23+
branches:
24+
- master
25+
- branch-*
26+
pull_request:
27+
branches:
28+
- master
29+
- branch-*
30+
31+
concurrency:
32+
group: hudi-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
33+
cancel-in-progress: true
34+
35+
jobs:
36+
test-hudi:
37+
name: Test Hudi (${{ matrix.sparkver }} / JDK${{ matrix.javaver }} / Scala${{ matrix.scalaver }})
38+
runs-on: ubuntu-24.04
39+
strategy:
40+
fail-fast: false
41+
matrix:
42+
include:
43+
- sparkver: "3.0"
44+
scalaver: "2.12"
45+
javaver: "8"
46+
hudiver: "0.15"
47+
- sparkver: "3.1"
48+
scalaver: "2.12"
49+
javaver: "8"
50+
hudiver: "0.15"
51+
- sparkver: "3.2"
52+
scalaver: "2.12"
53+
javaver: "8"
54+
hudiver: "0.15"
55+
- sparkver: "3.3"
56+
scalaver: "2.12"
57+
javaver: "8"
58+
hudiver: "0.15"
59+
- sparkver: "3.4"
60+
scalaver: "2.12"
61+
javaver: "17"
62+
hudiver: "0.15"
63+
- sparkver: "3.5"
64+
scalaver: "2.12"
65+
javaver: "17"
66+
hudiver: "0.15"
67+
- sparkver: "3.5"
68+
scalaver: "2.12"
69+
hudiver: "0.15"
70+
javaver: "21"
71+
72+
steps:
73+
- name: Checkout Auron
74+
uses: actions/checkout@v6
75+
76+
- name: Setup Java and Maven cache
77+
uses: actions/setup-java@v5
78+
with:
79+
distribution: 'adopt-hotspot'
80+
java-version: ${{ matrix.javaver }}
81+
cache: 'maven'
82+
83+
- name: Build dependencies (skip tests)
84+
run: >
85+
./build/mvn -B install
86+
-pl thirdparty/auron-hudi
87+
-am
88+
-Pscala-${{ matrix.scalaver }}
89+
-Pspark-${{ matrix.sparkver }}
90+
-Phudi-${{ matrix.hudiver }}
91+
-Prelease
92+
-DskipTests
93+
94+
- name: Test Hudi Module
95+
run: >
96+
./build/mvn -B test
97+
-pl thirdparty/auron-hudi
98+
-Pscala-${{ matrix.scalaver }}
99+
-Pspark-${{ matrix.sparkver }}
100+
-Phudi-${{ matrix.hudiver }}
101+
-Prelease
102+
103+
- name: Upload reports
104+
if: failure()
105+
uses: actions/upload-artifact@v6
106+
with:
107+
name: auron-hudi-${{ matrix.sparkver }}-hudi${{ matrix.hudiver }}-jdk${{ matrix.javaver }}-test-report
108+
path: thirdparty/auron-hudi/target/surefire-reports

auron-build.sh

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ SUPPORTED_UNIFFLE_VERSIONS=("0.10")
3838
SUPPORTED_PAIMON_VERSIONS=("1.2")
3939
SUPPORTED_FLINK_VERSIONS=("1.18")
4040
SUPPORTED_ICEBERG_VERSIONS=("1.10.1")
41+
SUPPORTED_HUDI_VERSIONS=("0.15")
4142

4243
# -----------------------------------------------------------------------------
4344
# Function: print_help
@@ -64,6 +65,7 @@ print_help() {
6465
IFS=','; echo " --uniffle <VERSION> Specify Uniffle version (e.g. ${SUPPORTED_UNIFFLE_VERSIONS[*]})"; unset IFS
6566
IFS=','; echo " --paimon <VERSION> Specify Paimon version (e.g. ${SUPPORTED_PAIMON_VERSIONS[*]})"; unset IFS
6667
IFS=','; echo " --iceberg <VERSION> Specify Iceberg version (e.g. ${SUPPORTED_ICEBERG_VERSIONS[*]})"; unset IFS
68+
IFS=','; echo " --hudi <VERSION> Specify Hudi version (e.g. ${SUPPORTED_HUDI_VERSIONS[*]})"; unset IFS
6769

6870
echo " -h, --help Show this help message"
6971
echo
@@ -78,7 +80,8 @@ print_help() {
7880
"--celeborn ${SUPPORTED_CELEBORN_VERSIONS[*]: -1}" \
7981
"--uniffle ${SUPPORTED_UNIFFLE_VERSIONS[*]: -1}" \
8082
"--paimon ${SUPPORTED_PAIMON_VERSIONS[*]: -1}" \
81-
"--iceberg ${SUPPORTED_ICEBERG_VERSIONS[*]: -1}"
83+
"--iceberg ${SUPPORTED_ICEBERG_VERSIONS[*]: -1}" \
84+
"--hudi ${SUPPORTED_HUDI_VERSIONS[*]: -1}"
8285
exit 0
8386
}
8487

@@ -135,6 +138,7 @@ CELEBORN_VER=""
135138
UNIFFLE_VER=""
136139
PAIMON_VER=""
137140
ICEBERG_VER=""
141+
HUDI_VER=""
138142

139143
# -----------------------------------------------------------------------------
140144
# Section: Argument Parsing
@@ -301,6 +305,27 @@ while [[ $# -gt 0 ]]; do
301305
exit 1
302306
fi
303307
;;
308+
--hudi)
309+
if [[ -n "$2" && "$2" != -* ]]; then
310+
HUDI_VER="$2"
311+
if ! check_supported_version "$HUDI_VER" "${SUPPORTED_HUDI_VERSIONS[@]}"; then
312+
print_invalid_option_error Hudi "$HUDI_VER" "${SUPPORTED_HUDI_VERSIONS[@]}"
313+
fi
314+
if [ -z "$SPARK_VER" ]; then
315+
echo "ERROR: Building hudi requires spark at the same time, and only Spark versions 3.0 to 3.5 are supported."
316+
exit 1
317+
fi
318+
if [ "$SPARK_VER" != "3.0" ] && [ "$SPARK_VER" != "3.1" ] && [ "$SPARK_VER" != "3.2" ] && [ "$SPARK_VER" != "3.3" ] && [ "$SPARK_VER" != "3.4" ] && [ "$SPARK_VER" != "3.5" ]; then
319+
echo "ERROR: Building hudi requires spark versions are 3.0 to 3.5."
320+
exit 1
321+
fi
322+
shift 2
323+
else
324+
IFS=','; echo "ERROR: Missing argument for --hudi," \
325+
"specify one of: ${SUPPORTED_HUDI_VERSIONS[*]}" >&2; unset IFS
326+
exit 1
327+
fi
328+
;;
304329
--flinkver)
305330
if [[ -n "$2" && "$2" != -* ]]; then
306331
FLINK_VER="$2"
@@ -437,6 +462,9 @@ fi
437462
if [[ -n "$ICEBERG_VER" ]]; then
438463
BUILD_ARGS+=("-Piceberg-$ICEBERG_VER")
439464
fi
465+
if [[ -n "$HUDI_VER" ]]; then
466+
BUILD_ARGS+=("-Phudi-$HUDI_VER")
467+
fi
440468

441469
# Configure Maven build threads:
442470
# - local builds default to Maven's single-threaded behavior
@@ -473,6 +501,7 @@ get_build_info() {
473501
"paimon.version") echo "${PAIMON_VER}" ;;
474502
"flink.version") echo "${FLINK_VER}" ;;
475503
"iceberg.version") echo "${ICEBERG_VER}" ;;
504+
"hudi.version") echo "${HUDI_VER}" ;;
476505
"build.timestamp") echo "$(date -u +"%Y-%m-%dT%H:%M:%SZ")" ;;
477506
*) echo "" ;;
478507
esac

dev/reformat

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ sparkver=spark-3.5
146146
prepare_for_spark "${sparkver}"
147147
for celebornver in celeborn-0.5 celeborn-0.6
148148
do
149-
run_maven_format -P"${sparkver}" -Pceleborn,"${celebornver}" -Puniffle,uniffle-0.10 -Ppaimon,paimon-1.2 -Pflink-1.18 -Piceberg-1.10.1
149+
run_maven_format -P"${sparkver}" -Pceleborn,"${celebornver}" -Puniffle,uniffle-0.10 -Ppaimon,paimon-1.2 -Pflink-1.18 -Piceberg-1.10.1 -Phudi-0.15
150150

151151
done
152152

pom.xml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1319,6 +1319,16 @@
13191319
</properties>
13201320
</profile>
13211321

1322+
<profile>
1323+
<id>hudi-0.15</id>
1324+
<modules>
1325+
<module>thirdparty/auron-hudi</module>
1326+
</modules>
1327+
<properties>
1328+
<hudiVersion>0.15.0</hudiVersion>
1329+
</properties>
1330+
</profile>
1331+
13221332
<profile>
13231333
<id>flink-1.18</id>
13241334
<modules>

spark-extension/src/main/java/org/apache/auron/spark/configuration/SparkAuronConfiguration.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,12 @@ public class SparkAuronConfiguration extends AuronConfiguration {
327327
.withDescription("Enable Iceberg scan operation conversion to native Auron implementations.")
328328
.withDefaultValue(true);
329329

330+
public static final ConfigOption<Boolean> ENABLE_HUDI_SCAN = new SQLConfOption<>(Boolean.class)
331+
.withKey("auron.enable.hudi.scan")
332+
.withCategory("Operator Supports")
333+
.withDescription("Enable Hudi scan operation conversion to native Auron implementations.")
334+
.withDefaultValue(true);
335+
330336
public static final ConfigOption<Boolean> ENABLE_PROJECT = new SQLConfOption<>(Boolean.class)
331337
.withKey("auron.enable.project")
332338
.withCategory("Operator Supports")

spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,10 @@ object AuronConverters extends Logging {
161161
case e: BroadcastExchangeExec if enableBroadcastExchange =>
162162
tryConvert(e, convertBroadcastExchangeExec)
163163
case e: FileSourceScanExec if enableScan => // scan
164-
tryConvert(e, convertFileSourceScanExec)
164+
extConvertProviders.find(p => p.isEnabled && p.isSupported(e)) match {
165+
case Some(provider) => tryConvert(e, provider.convert)
166+
case None => tryConvert(e, convertFileSourceScanExec)
167+
}
165168
case e: ProjectExec if enableProject => // project
166169
tryConvert(e, convertProjectExec)
167170
case e: FilterExec if enableFilter => // filter

thirdparty/auron-hudi/pom.xml

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!--
3+
~ Licensed to the Apache Software Foundation (ASF) under one or more
4+
~ contributor license agreements. See the NOTICE file distributed with
5+
~ this work for additional information regarding copyright ownership.
6+
~ The ASF licenses this file to You under the Apache License, Version 2.0
7+
~ (the "License"); you may not use this file except in compliance with
8+
~ the License. You may obtain a copy of the License at
9+
~
10+
~ http://www.apache.org/licenses/LICENSE-2.0
11+
~
12+
~ Unless required by applicable law or agreed to in writing, software
13+
~ distributed under the License is distributed on an "AS IS" BASIS,
14+
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
~ See the License for the specific language governing permissions and
16+
~ limitations under the License.
17+
-->
18+
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
19+
<modelVersion>4.0.0</modelVersion>
20+
21+
<parent>
22+
<groupId>org.apache.auron</groupId>
23+
<artifactId>auron-parent_${scalaVersion}</artifactId>
24+
<version>${project.version}</version>
25+
<relativePath>../../pom.xml</relativePath>
26+
</parent>
27+
28+
<artifactId>auron-hudi_${scalaVersion}</artifactId>
29+
<packaging>jar</packaging>
30+
<name>Apache Auron Hudi ${hudiVersion} ${scalaVersion}</name>
31+
32+
<dependencies>
33+
<dependency>
34+
<groupId>org.apache.auron</groupId>
35+
<artifactId>spark-extension_${scalaVersion}</artifactId>
36+
<version>${project.version}</version>
37+
</dependency>
38+
<dependency>
39+
<groupId>org.apache.hudi</groupId>
40+
<artifactId>hudi-spark${shortSparkVersion}-bundle_${scalaVersion}</artifactId>
41+
<version>${hudiVersion}</version>
42+
<scope>test</scope>
43+
</dependency>
44+
<dependency>
45+
<groupId>org.apache.spark</groupId>
46+
<artifactId>spark-sql_${scalaVersion}</artifactId>
47+
<scope>provided</scope>
48+
</dependency>
49+
<dependency>
50+
<groupId>org.scalatest</groupId>
51+
<artifactId>scalatest_${scalaVersion}</artifactId>
52+
<scope>test</scope>
53+
</dependency>
54+
<dependency>
55+
<groupId>org.apache.spark</groupId>
56+
<artifactId>spark-core_${scalaVersion}</artifactId>
57+
<type>test-jar</type>
58+
<scope>test</scope>
59+
</dependency>
60+
<dependency>
61+
<groupId>org.apache.spark</groupId>
62+
<artifactId>spark-sql_${scalaVersion}</artifactId>
63+
<type>test-jar</type>
64+
<scope>test</scope>
65+
</dependency>
66+
<dependency>
67+
<groupId>org.apache.spark</groupId>
68+
<artifactId>spark-catalyst_${scalaVersion}</artifactId>
69+
<type>test-jar</type>
70+
<scope>test</scope>
71+
</dependency>
72+
<dependency>
73+
<groupId>org.apache.spark</groupId>
74+
<artifactId>spark-hive_${scalaVersion}</artifactId>
75+
<scope>test</scope>
76+
</dependency>
77+
<dependency>
78+
<groupId>org.apache.auron</groupId>
79+
<artifactId>spark-extension-shims-spark_${scalaVersion}</artifactId>
80+
<version>${project.version}</version>
81+
<scope>test</scope>
82+
</dependency>
83+
</dependencies>
84+
85+
<build>
86+
<plugins>
87+
<plugin>
88+
<groupId>org.apache.maven.plugins</groupId>
89+
<artifactId>maven-enforcer-plugin</artifactId>
90+
<version>${maven-enforcer-plugin.version}</version>
91+
<executions>
92+
<execution>
93+
<id>hudi-spark-version-compat</id>
94+
<goals>
95+
<goal>enforce</goal>
96+
</goals>
97+
<configuration>
98+
<rules>
99+
<requireProperty>
100+
<property>shortSparkVersion</property>
101+
<regex>^(3\.0|3\.1|3\.2|3\.3|3\.4|3\.5)$</regex>
102+
<regexMessage>Hudi integration supports Spark 3.0-3.5 only. Current: ${shortSparkVersion}</regexMessage>
103+
</requireProperty>
104+
<requireProperty>
105+
<property>hudiVersion</property>
106+
<regex>^0\.15\.0$</regex>
107+
<regexMessage>Hudi integration supports only Hudi 0.15.0. Current: ${hudiVersion}</regexMessage>
108+
</requireProperty>
109+
</rules>
110+
</configuration>
111+
</execution>
112+
</executions>
113+
</plugin>
114+
</plugins>
115+
</build>
116+
</project>
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
org.apache.spark.sql.auron.hudi.HudiConvertProvider

0 commit comments

Comments
 (0)