Skip to content

Commit 9b9d1e1

Browse files
committed
Merge remote-tracking branch 'upstream/main'
# Conflicts: # native/core/src/parquet/parquet_support.rs
2 parents cb53370 + 19f07b0 commit 9b9d1e1

File tree

25 files changed

+1301
-2730
lines changed

25 files changed

+1301
-2730
lines changed

common/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -614,7 +614,7 @@ object CometConf extends ShimCometConf {
614614
"Comet is not currently fully compatible with Spark for all datatypes. " +
615615
s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
616616
.booleanConf
617-
.createWithDefault(true)
617+
.createWithDefault(false)
618618

619619
val COMET_EXPR_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
620620
conf("spark.comet.expression.allowIncompatible")

dev/changelog/0.6.0.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# DataFusion Comet 0.6.0 Changelog
21+
22+
**Fixed bugs:**
23+
24+
- fix: cast timestamp to decimal is unsupported [#1281](https://github.com/apache/datafusion-comet/pull/1281) (wForget)
25+
- fix: partially fix consistency issue of hash functions with decimal input [#1295](https://github.com/apache/datafusion-comet/pull/1295) (wForget)
26+
- fix: Improve testing for array_remove and fallback to Spark for unsupported types [#1308](https://github.com/apache/datafusion-comet/pull/1308) (andygrove)
27+
- fix: address post merge comet-parquet-exec review comments [#1327](https://github.com/apache/datafusion-comet/pull/1327) (parthchandra)
28+
- fix: memory pool error type [#1346](https://github.com/apache/datafusion-comet/pull/1346) (kazuyukitanimura)
29+
- fix: Fall back to Spark when hashing decimals with precision > 18 [#1325](https://github.com/apache/datafusion-comet/pull/1325) (andygrove)
30+
- fix: expressions doc for ArrayRemove [#1356](https://github.com/apache/datafusion-comet/pull/1356) (kazuyukitanimura)
31+
- fix: pass scale to DF round in spark_round [#1341](https://github.com/apache/datafusion-comet/pull/1341) (cht42)
32+
- fix: Mark cast from float/double to decimal as incompatible [#1372](https://github.com/apache/datafusion-comet/pull/1372) (andygrove)
33+
- fix: Passthrough condition in StaticInvoke case block [#1392](https://github.com/apache/datafusion-comet/pull/1392) (EmilyMatt)
34+
- fix: disable checking for uint_8 and uint_16 if complex type readers are enabled [#1376](https://github.com/apache/datafusion-comet/pull/1376) (parthchandra)
35+
36+
**Performance related:**
37+
38+
- perf: improve performance of update metrics [#1329](https://github.com/apache/datafusion-comet/pull/1329) (wForget)
39+
- perf: Use DataFusion FilterExec for experimental native scans [#1395](https://github.com/apache/datafusion-comet/pull/1395) (mbutrovich)
40+
41+
**Implemented enhancements:**
42+
43+
- feat: Add HasRowIdMapping interface [#1288](https://github.com/apache/datafusion-comet/pull/1288) (viirya)
44+
- feat: Upgrade to DataFusion 45 [#1364](https://github.com/apache/datafusion-comet/pull/1364) (andygrove)
45+
- feat: Add fair unified memory pool [#1369](https://github.com/apache/datafusion-comet/pull/1369) (kazuyukitanimura)
46+
- feat: Add unbounded memory pool [#1386](https://github.com/apache/datafusion-comet/pull/1386) (kazuyukitanimura)
47+
- feat: make random seed configurable in fuzz-testing [#1401](https://github.com/apache/datafusion-comet/pull/1401) (wForget)
48+
- feat: override executor overhead memory only when comet unified memory manager is disabled [#1379](https://github.com/apache/datafusion-comet/pull/1379) (wForget)
49+
50+
**Documentation updates:**
51+
52+
- docs: Fix links and provide complete benchmarking scripts [#1284](https://github.com/apache/datafusion-comet/pull/1284) (andygrove)
53+
- doc: update memory tuning guide [#1394](https://github.com/apache/datafusion-comet/pull/1394) (kazuyukitanimura)
54+
55+
**Other:**
56+
57+
- chore: Start 0.6.0 development [#1286](https://github.com/apache/datafusion-comet/pull/1286) (andygrove)
58+
- minor: update compatibility [#1303](https://github.com/apache/datafusion-comet/pull/1303) (kazuyukitanimura)
59+
- chore: extract conversion_funcs, conditional_funcs, bitwise_funcs and array_funcs expressions to folders based on spark grouping [#1223](https://github.com/apache/datafusion-comet/pull/1223) (rluvaton)
60+
- chore: extract math_funcs expressions to folders based on spark grouping [#1219](https://github.com/apache/datafusion-comet/pull/1219) (rluvaton)
61+
- chore: merge comet-parquet-exec branch into main [#1318](https://github.com/apache/datafusion-comet/pull/1318) (andygrove)
62+
- Feat: Support array_intersect function [#1271](https://github.com/apache/datafusion-comet/pull/1271) (erenavsarogullari)
63+
- build(deps): bump pprof from 0.13.0 to 0.14.0 in /native [#1319](https://github.com/apache/datafusion-comet/pull/1319) (dependabot[bot])
64+
- chore: Fix merge conflicts from merging comet-parquet-exec into main [#1320](https://github.com/apache/datafusion-comet/pull/1320) (andygrove)
65+
- chore: Revert accidental re-introduction of off-heap memory requirement [#1326](https://github.com/apache/datafusion-comet/pull/1326) (andygrove)
66+
- chore: Fix merge conflicts from merging comet-parquet-exec into main [#1323](https://github.com/apache/datafusion-comet/pull/1323) (mbutrovich)
67+
- Feat: Support array_join function [#1290](https://github.com/apache/datafusion-comet/pull/1290) (erenavsarogullari)
68+
- Fix missing slash in spark script [#1334](https://github.com/apache/datafusion-comet/pull/1334) (xleoken)
69+
- chore: Refactor QueryPlanSerde to allow logic to be moved to individual classes per expression [#1331](https://github.com/apache/datafusion-comet/pull/1331) (andygrove)
70+
- build: re-enable upload-test-reports for macos-13 runner [#1335](https://github.com/apache/datafusion-comet/pull/1335) (viirya)
71+
- chore: Upgrade to Arrow 53.4.0 [#1338](https://github.com/apache/datafusion-comet/pull/1338) (andygrove)
72+
- Feat: Support arrays_overlap function [#1312](https://github.com/apache/datafusion-comet/pull/1312) (erenavsarogullari)
73+
- chore: Move all array\_\* serde to new framework, use correct INCOMPAT config [#1349](https://github.com/apache/datafusion-comet/pull/1349) (andygrove)
74+
- chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) [#1332](https://github.com/apache/datafusion-comet/pull/1332) (andygrove)
75+
- minor: commit compatibility doc [#1358](https://github.com/apache/datafusion-comet/pull/1358) (kazuyukitanimura)
76+
- minor: update fuzz dependency [#1357](https://github.com/apache/datafusion-comet/pull/1357) (kazuyukitanimura)
77+
- chore: Remove redundant processing from exprToProtoInternal [#1351](https://github.com/apache/datafusion-comet/pull/1351) (andygrove)
78+
- chore: Adding an optional `hdfs` crate [#1377](https://github.com/apache/datafusion-comet/pull/1377) (comphead)
79+
- chore: Refactor aggregate expression serde [#1380](https://github.com/apache/datafusion-comet/pull/1380) (andygrove)

docs/source/user-guide/compatibility.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,43 @@ specific language governing permissions and limitations
1717
under the License.
1818
-->
1919

20+
<!--
21+
TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE
22+
(docs/templates/compatibility-template.md) AND NOT THE GENERATED FILE
23+
(docs/source/user-guide/compatibility.md) OTHERWISE YOUR CHANGES MAY BE LOST
24+
-->
25+
2026
# Compatibility Guide
2127

2228
Comet aims to provide consistent results with the version of Apache Spark that is being used.
2329

2430
This guide offers information about areas of functionality where there are known differences.
2531

32+
## Parquet Scans
33+
34+
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
35+
`spark.comet.scan.impl` is used to select an implementation.
36+
37+
| Implementation | Description |
38+
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
39+
| `native_comet` | This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
40+
| `native_datafusion` | This implementation delegates to DataFusion's `ParquetExec`. |
41+
| `native_iceberg_compat` | This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
42+
43+
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
44+
provide the following benefits over the `native_comet` implementation:
45+
46+
- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
47+
- Provide support for reading complex types (structs, arrays, and maps)
48+
- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
49+
50+
These new implementations are not fully implemented. Some of the current limitations are:
51+
52+
- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet
53+
will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
54+
This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
55+
- These implementations do not yet fully support timestamps, decimals, or complex types.
56+
2657
## ANSI mode
2758

2859
Comet currently ignores ANSI mode in most cases, and therefore can produce different results than Spark. By default,

docs/source/user-guide/configs.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@ specific language governing permissions and limitations
1717
under the License.
1818
-->
1919

20+
<!--
21+
TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE
22+
(docs/templates/configs-template.md) AND NOT THE GENERATED FILE
23+
(docs/source/user-guide/configs.md) OTHERWISE YOUR CHANGES MAY BE LOST
24+
-->
25+
2026
# Comet Configuration Settings
2127

2228
Comet provides the following configuration settings.
@@ -76,7 +82,7 @@ Comet provides the following configuration settings.
7682
| spark.comet.parquet.read.parallel.io.enabled | Whether to enable Comet's parallel reader for Parquet files. The parallel reader reads ranges of consecutive data in a file in parallel. It is faster for large files and row groups but uses more resources. | true |
7783
| spark.comet.parquet.read.parallel.io.thread-pool.size | The maximum number of parallel threads the parallel reader will use in a single executor. For executors configured with a smaller number of cores, use a smaller number. | 16 |
7884
| spark.comet.regexp.allowIncompatible | Comet is not currently fully compatible with Spark for all regular expressions. Set this config to true to allow them anyway. For more information, refer to the Comet Compatibility Guide (https://datafusion.apache.org/comet/user-guide/compatibility.html). | false |
79-
| spark.comet.scan.allowIncompatible | Comet is not currently fully compatible with Spark for all datatypes. Set this config to true to allow them anyway. For more information, refer to the Comet Compatibility Guide (https://datafusion.apache.org/comet/user-guide/compatibility.html). | true |
85+
| spark.comet.scan.allowIncompatible | Comet is not currently fully compatible with Spark for all datatypes. Set this config to true to allow them anyway. For more information, refer to the Comet Compatibility Guide (https://datafusion.apache.org/comet/user-guide/compatibility.html). | false |
8086
| spark.comet.scan.enabled | Whether to enable native scans. When this is turned on, Spark will use Comet to read supported data sources (currently only Parquet is supported natively). Note that to enable native vectorized execution, both this config and 'spark.comet.exec.enabled' need to be enabled. | true |
8187
| spark.comet.scan.preFetch.enabled | Whether to enable pre-fetching feature of CometScan. | false |
8288
| spark.comet.scan.preFetch.threadNum | The number of threads running pre-fetching for CometScan. Effective if spark.comet.scan.preFetch.enabled is enabled. Note that more pre-fetching threads means more memory requirement to store pre-fetched row groups. | 2 |

docs/templates/compatibility-template.md

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,43 @@
1717
under the License.
1818
-->
1919

20+
<!--
21+
TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE
22+
(docs/templates/compatibility-template.md) AND NOT THE GENERATED FILE
23+
(docs/source/user-guide/compatibility.md) OTHERWISE YOUR CHANGES MAY BE LOST
24+
-->
25+
2026
# Compatibility Guide
2127

2228
Comet aims to provide consistent results with the version of Apache Spark that is being used.
2329

2430
This guide offers information about areas of functionality where there are known differences.
2531

32+
## Parquet Scans
33+
34+
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
35+
`spark.comet.scan.impl` is used to select an implementation.
36+
37+
| Implementation | Description |
38+
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
39+
| `native_comet` | This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
40+
| `native_datafusion` | This implementation delegates to DataFusion's `ParquetExec`. |
41+
| `native_iceberg_compat` | This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
42+
43+
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
44+
provide the following benefits over the `native_comet` implementation:
45+
46+
- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
47+
- Provide support for reading complex types (structs, arrays, and maps)
48+
- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
49+
50+
These new implementations are not fully implemented. Some of the current limitations are:
51+
52+
- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet
53+
will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
54+
This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
55+
- These implementations do not yet fully support timestamps, decimals, or complex types.
56+
2657
## ANSI mode
2758

2859
Comet currently ignores ANSI mode in most cases, and therefore can produce different results than Spark. By default,
@@ -47,7 +78,7 @@ will fall back to Spark but can be enabled by setting `spark.comet.expression.al
4778

4879
## Array Expressions
4980

50-
Comet has experimental support for a number of array expressions. These are experimental and currently marked
81+
Comet has experimental support for a number of array expressions. These are experimental and currently marked
5182
as incompatible and can be enabled by setting `spark.comet.expression.allowIncompatible=true`.
5283

5384
## Regular Expressions
@@ -82,5 +113,5 @@ The following cast operations are not compatible with Spark for all inputs and a
82113

83114
### Unsupported Casts
84115

85-
Any cast not listed in the previous tables is currently unsupported. We are working on adding more. See the
116+
Any cast not listed in the previous tables is currently unsupported. We are working on adding more. See the
86117
[tracking issue](https://github.com/apache/datafusion-comet/issues/286) for more details.

docs/templates/configs-template.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@
1717
under the License.
1818
-->
1919

20+
<!--
21+
TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE
22+
(docs/templates/configs-template.md) AND NOT THE GENERATED FILE
23+
(docs/source/user-guide/configs.md) OTHERWISE YOUR CHANGES MAY BE LOST
24+
-->
25+
2026
# Comet Configuration Settings
2127

2228
Comet provides the following configuration settings.

fuzz-testing/src/main/scala/org/apache/comet/fuzz/Main.scala

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
3333
val numFiles: ScallopOption[Int] =
3434
opt[Int](required = true, descr = "Number of files to generate")
3535
val numRows: ScallopOption[Int] = opt[Int](required = true, descr = "Number of rows per file")
36+
val randomSeed: ScallopOption[Long] =
37+
opt[Long](required = false, descr = "Random seed to use")
3638
val generateArrays: ScallopOption[Boolean] =
3739
opt[Boolean](required = false, descr = "Whether to generate arrays")
3840
val generateStructs: ScallopOption[Boolean] =
@@ -48,6 +50,8 @@ class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
4850
opt[Int](required = false, descr = "Number of input files to use")
4951
val numQueries: ScallopOption[Int] =
5052
opt[Int](required = true, descr = "Number of queries to generate")
53+
val randomSeed: ScallopOption[Long] =
54+
opt[Long](required = false, descr = "Random seed to use")
5155
}
5256
addSubcommand(generateQueries)
5357
object runQueries extends Subcommand("run") {
@@ -67,11 +71,13 @@ object Main {
6771
.getOrCreate()
6872

6973
def main(args: Array[String]): Unit = {
70-
val r = new Random(42)
71-
7274
val conf = new Conf(args.toIndexedSeq)
7375
conf.subcommand match {
7476
case Some(conf.generateData) =>
77+
val r = conf.generateData.randomSeed.toOption match {
78+
case Some(seed) => new Random(seed)
79+
case None => new Random()
80+
}
7581
val options = DataGenOptions(
7682
allowNull = true,
7783
generateArray = conf.generateData.generateArrays(),
@@ -87,6 +93,10 @@ object Main {
8793
options)
8894
}
8995
case Some(conf.generateQueries) =>
96+
val r = conf.generateQueries.randomSeed.toOption match {
97+
case Some(seed) => new Random(seed)
98+
case None => new Random()
99+
}
90100
QueryGen.generateRandomQueries(
91101
r,
92102
spark,

native/Cargo.lock

Lines changed: 0 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

native/core/Cargo.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ datafusion-comet-spark-expr = { workspace = true }
7676
datafusion-comet-proto = { workspace = true }
7777
object_store = { workspace = true }
7878
url = { workspace = true }
79-
chrono = { workspace = true }
8079
parking_lot = "0.12.3"
8180
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true}
8281

0 commit comments

Comments
 (0)