Skip to content

Commit dca91b1

Browse files
authored
Blog post announcing Comet 0.13.0 (#137)
1 parent ffc1580 commit dca91b1

File tree

1 file changed

+135
-0
lines changed

1 file changed

+135
-0
lines changed
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
---
2+
layout: post
3+
title: Apache DataFusion Comet 0.12.0 Release
4+
date: 2025-12-04
5+
author: pmc
6+
categories: [subprojects]
7+
---
8+
9+
<!--
10+
{% comment %}
11+
Licensed to the Apache Software Foundation (ASF) under one or more
12+
contributor license agreements. See the NOTICE file distributed with
13+
this work for additional information regarding copyright ownership.
14+
The ASF licenses this file to you under the Apache License, Version 2.0
15+
(the "License"); you may not use this file except in compliance with
16+
the License. You may obtain a copy of the License at
17+
18+
http://www.apache.org/licenses/LICENSE-2.0
19+
20+
Unless required by applicable law or agreed to in writing, software
21+
distributed under the License is distributed on an "AS IS" BASIS,
22+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
23+
See the License for the specific language governing permissions and
24+
limitations under the License.
25+
{% endcomment %}
26+
-->
27+
28+
[TOC]
29+
30+
The Apache DataFusion PMC is pleased to announce version 0.13.0 of the [Comet](https://datafusion.apache.org/comet/) subproject.
31+
32+
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
33+
improved performance and efficiency without requiring any code changes.
34+
35+
This release covers approximately eight weeks of development work and is the result of merging 160 PRs from 15
36+
contributors. See the [change log] for more information.
37+
38+
[change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.13.0.md
39+
40+
## Key Features
41+
42+
### Native Parquet Write Support (Experimental)
43+
44+
This release introduces experimental native Parquet write capabilities, allowing Comet to intercept and execute Parquet write operations natively through DataFusion. Key capabilities include:
45+
46+
- File commit protocol support for reliable writes
47+
- Remote HDFS writing via OpenDAL integration
48+
- Complex type support (arrays, maps, structs)
49+
- Proper handling of object store settings
50+
51+
To enable native Parquet writes, set:
52+
53+
```
54+
spark.comet.allowIncompatibleOp.DataWritingCommandExec=true
55+
spark.comet.parquet.write.enabled=true
56+
```
57+
58+
**Note**: This feature is highly experimental and should not be used in production environments. It is currently categorized as a testing feature and is disabled by default.
59+
60+
### Native Iceberg Improvements
61+
62+
Comet's fully-native Iceberg integration received significant enhancements in this release:
63+
64+
**REST Catalog Support**: Native Iceberg scans now support REST catalogs, enabling integration with catalog services like Apache Polaris and Tabular. Configure with:
65+
66+
```shell
67+
--conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog
68+
--conf spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog
69+
--conf spark.sql.catalog.rest_cat.uri=http://localhost:8181
70+
--conf spark.comet.scan.icebergNative.enabled=true
71+
```
72+
73+
**Session Token Authentication**: Added support for session tokens in native Iceberg scans for secure S3 access.
74+
75+
**Performance Optimizations**:
76+
- Deduplicated serialized metadata reducing memory overhead
77+
- Switched from JSON to protobuf for partition value serialization
78+
- Removed IcebergFileStream in favor of iceberg-rust's built-in parallelization
79+
- Reduced metadata serialization points
80+
- Added SchemaAdapter caching
81+
82+
To enable fully-native Iceberg scanning:
83+
84+
```
85+
spark.comet.scan.icebergNative.enabled=true
86+
```
87+
88+
The native reader supports Iceberg table spec v1 and v2, all primitive and complex types, schema evolution, time travel, positional and equality deletes, filter pushdown, and various storage backends (local, HDFS, S3).
89+
90+
### Native CSV Reading (Experimental)
91+
92+
Experimental support for native CSV file reading has been added, expanding Comet's file format capabilities beyond Parquet.
93+
94+
### New Expressions
95+
96+
The release adds support for numerous expressions:
97+
- Array functions: `explode`, `explode_outer`, `size`
98+
- Date/time functions: `unix_date`, `date_format`, `datediff`, `last_day`, `unix_timestamp`
99+
- String functions: `left`
100+
- JSON functions: `from_json` (partial support)
101+
102+
### ANSI Mode Support
103+
104+
Sum and average aggregate expressions now support ANSI mode for both integer and decimal inputs, enabling overflow checking in strict SQL mode.
105+
106+
### Native Shuffle Improvements
107+
108+
- Round-robin partitioning is now supported in native shuffle
109+
- Spill metrics are now reported correctly
110+
- Configurable shuffle writer buffer size via `spark.comet.shuffle.write.bufferSize`
111+
112+
## Performance Improvements
113+
114+
This release includes extensive performance optimizations:
115+
116+
- **String to integer casting**: Significant speedups through optimized parsing
117+
- **String functions**: Optimized `lpad`/`rpad` to remove unnecessary memory allocations
118+
- **Date operations**: Improved `normalize_nan` and date truncate performance
119+
- **Query planning**: Cached query plans to avoid per-partition serialization overhead
120+
- **Memory efficiency**: Reduced GC pressure in protobuf serialization
121+
- **Hash operations**: Optimized complex-type hash implementations including murmur3 support for nested types
122+
- **Runtime efficiency**: Eliminated busy-polling of Tokio stream for plans without CometScan
123+
- **Metrics overhead**: Reduced timer and syscall overhead in native shuffle writer
124+
125+
## Deprecations
126+
127+
The `native_comet` scan mode is now deprecated in favor of `native_iceberg_compat` and will be removed in a future release. The `auto` scan mode no longer falls back to `native_comet`.
128+
129+
## Compatibility
130+
131+
This release upgrades to DataFusion 51, Arrow 57, and the latest iceberg-rust. The minimum supported Rust version is now 1.88.
132+
133+
Supported platforms include Spark 3.4.3, 3.5.4-3.5.7, and Spark 4.0.x with various JDK and Scala combinations.
134+
135+
The community encourages users to test Comet with existing Spark workloads and welcomes contributions to ongoing development.

0 commit comments

Comments
 (0)