Skip to content

Commit 7f3d292

Browse files
authored
Merge branch 'apache:main' into main
2 parents cd82f03 + 4f8ce75 commit 7f3d292

File tree

174 files changed

+8357
-4310
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

174 files changed

+8357
-4310
lines changed

.github/actions/setup-spark-builder/action.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ inputs:
2929
comet-version:
3030
description: 'The Comet version to use for Spark'
3131
required: true
32-
default: '0.4.0-SNAPSHOT'
32+
default: '0.5.0-SNAPSHOT'
3333
runs:
3434
using: "composite"
3535
steps:

.github/workflows/spark_sql_test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ jobs:
7171
with:
7272
spark-version: ${{ matrix.spark-version.full }}
7373
spark-short-version: ${{ matrix.spark-version.short }}
74-
comet-version: '0.4.0-SNAPSHOT' # TODO: get this from pom.xml
74+
comet-version: '0.5.0-SNAPSHOT' # TODO: get this from pom.xml
7575
- name: Run Spark tests
7676
run: |
7777
cd apache-spark

.github/workflows/spark_sql_test_ansi.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ jobs:
6969
with:
7070
spark-version: ${{ matrix.spark-version.full }}
7171
spark-short-version: ${{ matrix.spark-version.short }}
72-
comet-version: '0.4.0-SNAPSHOT' # TODO: get this from pom.xml
72+
comet-version: '0.5.0-SNAPSHOT' # TODO: get this from pom.xml
7373
- name: Run Spark tests
7474
run: |
7575
cd apache-spark

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ The following chart shows the time it takes to run the 22 TPC-H queries against
4646
using a single executor with 8 cores. See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html)
4747
for details of the environment used for these benchmarks.
4848

49-
When using Comet, the overall run time is reduced from 616 seconds to 374 seconds, a 1.6x speedup, with query 1
49+
When using Comet, the overall run time is reduced from 615 seconds to 364 seconds, a 1.7x speedup, with query 1
5050
running 9x faster than Spark.
5151

5252
Running the same queries with DataFusion standalone (without Spark) using the same number of cores results in a 3.6x
@@ -55,21 +55,21 @@ speedup compared to Spark.
5555
Comet is not yet achieving full DataFusion speeds in all cases, but with future work we aim to provide a 2x-4x speedup
5656
for a broader set of queries.
5757

58-
![](docs/source/_static/images/benchmark-results/0.3.0/tpch_allqueries.png)
58+
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_allqueries.png)
5959

6060
Here is a breakdown showing relative performance of Spark, Comet, and DataFusion for each TPC-H query.
6161

62-
![](docs/source/_static/images/benchmark-results/0.3.0/tpch_queries_compare.png)
62+
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_queries_compare.png)
6363

6464
The following charts shows how much Comet currently accelerates each query from the benchmark.
6565

6666
### Relative speedup
6767

68-
![](docs/source/_static/images/benchmark-results/0.3.0/tpch_queries_speedup_rel.png)
68+
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_queries_speedup_rel.png)
6969

7070
### Absolute speedup
7171

72-
![](docs/source/_static/images/benchmark-results/0.3.0/tpch_queries_speedup_abs.png)
72+
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_queries_speedup_abs.png)
7373

7474
These benchmarks can be reproduced in any environment using the documentation in the
7575
[Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html). We encourage
@@ -80,7 +80,7 @@ Results for our benchmark derived from TPC-DS are available in the [benchmarking
8080
## Use Commodity Hardware
8181

8282
Comet leverages commodity hardware, eliminating the need for costly hardware upgrades or
83-
specialized hardware accelerators, such as GPUs or FGPA. By maximizing the utilization of commodity hardware, Comet
83+
specialized hardware accelerators, such as GPUs or FPGA. By maximizing the utilization of commodity hardware, Comet
8484
ensures cost-effectiveness and scalability for your Spark deployments.
8585

8686
## Spark Compatibility

benchmarks/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ docker push localhost:32000/apache/datafusion-comet-tpcbench:latest
6262
export SPARK_MASTER=k8s://https://127.0.0.1:16443
6363
export COMET_DOCKER_IMAGE=localhost:32000/apache/datafusion-comet-tpcbench:latest
6464
# Location of Comet JAR within the Docker image
65-
export COMET_JAR=/opt/spark/jars/comet-spark-spark3.4_2.12-0.2.0-SNAPSHOT.jar
65+
export COMET_JAR=/opt/spark/jars/comet-spark-spark3.4_2.12-0.5.0-SNAPSHOT.jar
6666

6767
$SPARK_HOME/bin/spark-submit \
6868
--master $SPARK_MASTER \

common/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ under the License.
2626
<parent>
2727
<groupId>org.apache.datafusion</groupId>
2828
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
29-
<version>0.4.0-SNAPSHOT</version>
29+
<version>0.5.0-SNAPSHOT</version>
3030
<relativePath>../pom.xml</relativePath>
3131
</parent>
3232

common/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -272,13 +272,21 @@ object CometConf extends ShimCometConf {
272272
.booleanConf
273273
.createWithDefault(false)
274274

275-
val COMET_EXEC_SHUFFLE_CODEC: ConfigEntry[String] = conf(
276-
s"$COMET_EXEC_CONFIG_PREFIX.shuffle.codec")
275+
val COMET_EXEC_SHUFFLE_COMPRESSION_CODEC: ConfigEntry[String] = conf(
276+
s"$COMET_EXEC_CONFIG_PREFIX.shuffle.compression.codec")
277277
.doc(
278-
"The codec of Comet native shuffle used to compress shuffle data. Only zstd is supported.")
278+
"The codec of Comet native shuffle used to compress shuffle data. Only zstd is supported. " +
279+
"Compression can be disabled by setting spark.shuffle.compress=false.")
279280
.stringConf
281+
.checkValues(Set("zstd"))
280282
.createWithDefault("zstd")
281283

284+
val COMET_EXEC_SHUFFLE_COMPRESSION_LEVEL: ConfigEntry[Int] =
285+
conf(s"$COMET_EXEC_CONFIG_PREFIX.shuffle.compression.level")
286+
.doc("The compression level to use when compression shuffle files.")
287+
.intConf
288+
.createWithDefault(1)
289+
282290
val COMET_COLUMNAR_SHUFFLE_ASYNC_ENABLED: ConfigEntry[Boolean] =
283291
conf("spark.comet.columnar.shuffle.async.enabled")
284292
.doc("Whether to enable asynchronous shuffle for Arrow-based shuffle.")
@@ -322,8 +330,10 @@ object CometConf extends ShimCometConf {
322330

323331
val COMET_COLUMNAR_SHUFFLE_MEMORY_SIZE: OptionalConfigEntry[Long] =
324332
conf("spark.comet.columnar.shuffle.memorySize")
333+
.internal()
325334
.doc(
326-
"The optional maximum size of the memory used for Comet columnar shuffle, in MiB. " +
335+
"Test-only config. This is only used to test Comet shuffle with Spark tests. " +
336+
"The optional maximum size of the memory used for Comet columnar shuffle, in MiB. " +
327337
"Note that this config is only used when `spark.comet.exec.shuffle.mode` is " +
328338
"`jvm`. Once allocated memory size reaches this config, the current batch will be " +
329339
"flushed to disk immediately. If this is not configured, Comet will use " +
@@ -335,8 +345,10 @@ object CometConf extends ShimCometConf {
335345

336346
val COMET_COLUMNAR_SHUFFLE_MEMORY_FACTOR: ConfigEntry[Double] =
337347
conf("spark.comet.columnar.shuffle.memory.factor")
348+
.internal()
338349
.doc(
339-
"Fraction of Comet memory to be allocated per executor process for Comet shuffle. " +
350+
"Test-only config. This is only used to test Comet shuffle with Spark tests. " +
351+
"Fraction of Comet memory to be allocated per executor process for Comet shuffle. " +
340352
"Comet memory size is specified by `spark.comet.memoryOverhead` or " +
341353
"calculated by `spark.comet.memory.overhead.factor` * `spark.executor.memory`.")
342354
.doubleConf
@@ -345,6 +357,17 @@ object CometConf extends ShimCometConf {
345357
"Ensure that Comet shuffle memory overhead factor is a double greater than 0")
346358
.createWithDefault(1.0)
347359

360+
val COMET_COLUMNAR_SHUFFLE_UNIFIED_MEMORY_ALLOCATOR_IN_TEST: ConfigEntry[Boolean] =
361+
conf("spark.comet.columnar.shuffle.unifiedMemoryAllocatorTest")
362+
.doc("Whether to use Spark unified memory allocator for Comet columnar shuffle in tests." +
363+
"If not configured, Comet will use a test-only memory allocator for Comet columnar " +
364+
"shuffle when Spark test env detected. The test-ony allocator is proposed to run with " +
365+
"Spark tests as these tests require on-heap memory configuration. " +
366+
"By default, this config is false.")
367+
.internal()
368+
.booleanConf
369+
.createWithDefault(false)
370+
348371
val COMET_COLUMNAR_SHUFFLE_BATCH_SIZE: ConfigEntry[Int] =
349372
conf("spark.comet.columnar.shuffle.batch.size")
350373
.internal()

dev/changelog/0.4.0.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# DataFusion Comet 0.4.0 Changelog
21+
22+
This release consists of 51 commits from 10 contributors. See credits at the end of this changelog for more information.
23+
24+
**Fixed bugs:**
25+
26+
- fix: Use the number of rows from underlying arrays instead of logical row count from RecordBatch [#972](https://github.com/apache/datafusion-comet/pull/972) (viirya)
27+
- fix: The spilled_bytes metric of CometSortExec should be size instead of time [#984](https://github.com/apache/datafusion-comet/pull/984) (Kontinuation)
28+
- fix: Properly handle Java exceptions without error messages; fix loading of comet native library from java.library.path [#982](https://github.com/apache/datafusion-comet/pull/982) (Kontinuation)
29+
- fix: Fallback to Spark if scan has meta columns [#997](https://github.com/apache/datafusion-comet/pull/997) (viirya)
30+
- fix: Fallback to Spark if named_struct contains duplicate field names [#1016](https://github.com/apache/datafusion-comet/pull/1016) (viirya)
31+
- fix: Make comet-git-info.properties optional [#1027](https://github.com/apache/datafusion-comet/pull/1027) (andygrove)
32+
- fix: TopK operator should return correct results on dictionary column with nulls [#1033](https://github.com/apache/datafusion-comet/pull/1033) (viirya)
33+
- fix: need default value for getSizeAsMb(EXECUTOR_MEMORY.key) [#1046](https://github.com/apache/datafusion-comet/pull/1046) (neyama)
34+
35+
**Performance related:**
36+
37+
- perf: Remove one redundant CopyExec for SMJ [#962](https://github.com/apache/datafusion-comet/pull/962) (andygrove)
38+
- perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin [#1007](https://github.com/apache/datafusion-comet/pull/1007) (andygrove)
39+
- perf: Cache jstrings during metrics collection [#1029](https://github.com/apache/datafusion-comet/pull/1029) (mbutrovich)
40+
41+
**Implemented enhancements:**
42+
43+
- feat: Support `GetArrayStructFields` expression [#993](https://github.com/apache/datafusion-comet/pull/993) (Kimahriman)
44+
- feat: Implement bloom_filter_agg [#987](https://github.com/apache/datafusion-comet/pull/987) (mbutrovich)
45+
- feat: Support more types with BloomFilterAgg [#1039](https://github.com/apache/datafusion-comet/pull/1039) (mbutrovich)
46+
- feat: Implement CAST from struct to string [#1066](https://github.com/apache/datafusion-comet/pull/1066) (andygrove)
47+
- feat: Use official DataFusion 43 release [#1070](https://github.com/apache/datafusion-comet/pull/1070) (andygrove)
48+
- feat: Implement CAST between struct types [#1074](https://github.com/apache/datafusion-comet/pull/1074) (andygrove)
49+
- feat: support array_append [#1072](https://github.com/apache/datafusion-comet/pull/1072) (NoeB)
50+
- feat: Require offHeap memory to be enabled (always use unified memory) [#1062](https://github.com/apache/datafusion-comet/pull/1062) (andygrove)
51+
52+
**Documentation updates:**
53+
54+
- doc: add documentation interlinks [#975](https://github.com/apache/datafusion-comet/pull/975) (comphead)
55+
- docs: Add IntelliJ documentation for generated source code [#985](https://github.com/apache/datafusion-comet/pull/985) (mbutrovich)
56+
- docs: Update tuning guide [#995](https://github.com/apache/datafusion-comet/pull/995) (andygrove)
57+
- docs: Various documentation improvements [#1005](https://github.com/apache/datafusion-comet/pull/1005) (andygrove)
58+
- docs: clarify that Maven central only has jars for Linux [#1009](https://github.com/apache/datafusion-comet/pull/1009) (andygrove)
59+
- doc: fix K8s links and doc [#1058](https://github.com/apache/datafusion-comet/pull/1058) (comphead)
60+
- docs: Update benchmarking.md [#1085](https://github.com/apache/datafusion-comet/pull/1085) (rluvaton-flarion)
61+
62+
**Other:**
63+
64+
- chore: Generate changelog for 0.3.0 release [#964](https://github.com/apache/datafusion-comet/pull/964) (andygrove)
65+
- chore: fix publish-to-maven script [#966](https://github.com/apache/datafusion-comet/pull/966) (andygrove)
66+
- chore: Update benchmarks results based on 0.3.0-rc1 [#969](https://github.com/apache/datafusion-comet/pull/969) (andygrove)
67+
- chore: update rem expression guide [#976](https://github.com/apache/datafusion-comet/pull/976) (kazuyukitanimura)
68+
- chore: Enable additional CreateArray tests [#928](https://github.com/apache/datafusion-comet/pull/928) (Kimahriman)
69+
- chore: fix compatibility guide [#978](https://github.com/apache/datafusion-comet/pull/978) (kazuyukitanimura)
70+
- chore: Update for 0.3.0 release, prepare for 0.4.0 development [#970](https://github.com/apache/datafusion-comet/pull/970) (andygrove)
71+
- chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled [#991](https://github.com/apache/datafusion-comet/pull/991) (viirya)
72+
- chore: Make parquet reader options Comet options instead of Hadoop options [#968](https://github.com/apache/datafusion-comet/pull/968) (parthchandra)
73+
- chore: remove legacy comet-spark-shell [#1013](https://github.com/apache/datafusion-comet/pull/1013) (andygrove)
74+
- chore: Reserve memory for native shuffle writer per partition [#988](https://github.com/apache/datafusion-comet/pull/988) (viirya)
75+
- chore: Bump arrow-rs to 53.1.0 and datafusion [#1001](https://github.com/apache/datafusion-comet/pull/1001) (kazuyukitanimura)
76+
- chore: Revert "chore: Reserve memory for native shuffle writer per partition (#988)" [#1020](https://github.com/apache/datafusion-comet/pull/1020) (viirya)
77+
- minor: Remove hard-coded version number from Dockerfile [#1025](https://github.com/apache/datafusion-comet/pull/1025) (andygrove)
78+
- chore: Reserve memory for native shuffle writer per partition [#1022](https://github.com/apache/datafusion-comet/pull/1022) (viirya)
79+
- chore: Improve error handling when native lib fails to load [#1000](https://github.com/apache/datafusion-comet/pull/1000) (andygrove)
80+
- chore: Use twox-hash 2.0 xxhash64 oneshot api instead of custom implementation [#1041](https://github.com/apache/datafusion-comet/pull/1041) (NoeB)
81+
- chore: Refactor Arrow Array and Schema allocation in ColumnReader and MetadataColumnReader [#1047](https://github.com/apache/datafusion-comet/pull/1047) (viirya)
82+
- minor: Refactor binary expr serde to reduce code duplication [#1053](https://github.com/apache/datafusion-comet/pull/1053) (andygrove)
83+
- chore: Upgrade to DataFusion 43.0.0-rc1 [#1057](https://github.com/apache/datafusion-comet/pull/1057) (andygrove)
84+
- chore: Refactor UnaryExpr and MathExpr in protobuf [#1056](https://github.com/apache/datafusion-comet/pull/1056) (andygrove)
85+
- minor: use defaults instead of hard-coding values [#1060](https://github.com/apache/datafusion-comet/pull/1060) (andygrove)
86+
- minor: refactor UnaryExpr handling to make code more concise [#1065](https://github.com/apache/datafusion-comet/pull/1065) (andygrove)
87+
- chore: Refactor binary and math expression serde code [#1069](https://github.com/apache/datafusion-comet/pull/1069) (andygrove)
88+
- chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator [#1063](https://github.com/apache/datafusion-comet/pull/1063) (viirya)
89+
- test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config [#1087](https://github.com/apache/datafusion-comet/pull/1087) (viirya)
90+
91+
## Credits
92+
93+
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
94+
95+
```
96+
19 Andy Grove
97+
13 Matt Butrovich
98+
8 Liang-Chi Hsieh
99+
3 KAZUYUKI TANIMURA
100+
2 Adam Binford
101+
2 Kristin Cowalcijk
102+
1 NoeB
103+
1 Oleks V
104+
1 Parth Chandra
105+
1 neyama
106+
```
107+
108+
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.

dev/diffs/3.4.3.diff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ index d3544881af1..bf0e2b53c70 100644
77
<ivy.version>2.5.1</ivy.version>
88
<oro.version>2.0.8</oro.version>
99
+ <spark.version.short>3.4</spark.version.short>
10-
+ <comet.version>0.4.0-SNAPSHOT</comet.version>
10+
+ <comet.version>0.5.0-SNAPSHOT</comet.version>
1111
<!--
1212
If you changes codahale.metrics.version, you also need to change
1313
the link to metrics.dropwizard.io in docs/monitoring.md.

dev/diffs/3.5.1.diff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ index 0f504dbee85..f6019da888a 100644
77
<ivy.version>2.5.1</ivy.version>
88
<oro.version>2.0.8</oro.version>
99
+ <spark.version.short>3.5</spark.version.short>
10-
+ <comet.version>0.4.0-SNAPSHOT</comet.version>
10+
+ <comet.version>0.5.0-SNAPSHOT</comet.version>
1111
<!--
1212
If you changes codahale.metrics.version, you also need to change
1313
the link to metrics.dropwizard.io in docs/monitoring.md.

0 commit comments

Comments
 (0)