Skip to content

Commit 64ebfcc

Browse files
andygroveclaude
andauthored
docs: various documentation updates in preparation for next release (apache#3254)
* docs: add missing expressions to user guide Add documentation for expressions that were implemented but not documented in the supported expressions list: - Left (string function) - DateDiff, DateFormat, LastDay, UnixDate, UnixTimestamp (date/time) - Sha1 (hashing) - JsonToStructs (struct) Also fixes TruncTimestamp SQL from `trunc_date` to `date_trunc`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update documentation to reflect auto scan no longer falls back to native_comet Updates contributor guide documentation following the change in c9af2c6: - parquet_scans.md: clarify that auto mode falls back to Spark's native scan, not native_comet - roadmap.md: reflect that the switch to native_iceberg_compat for auto mode is complete Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add REST catalog example to Iceberg user guide Adds documentation showing how to configure Comet's native Iceberg scan with a REST catalog, including example Spark configuration and sample queries demonstrating namespace creation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add supported features section to Iceberg user guide Adds comprehensive documentation of native Iceberg reader capabilities: - Table spec versions (v1, v2 supported; v3 falls back) - Schema and data types including complex types and schema evolution - Time travel and branch reads - MOR table delete handling (positional and equality deletes) - Filter pushdown operations - Partitioning with various transform types - Storage backends (local, HDFS, S3) Also improves the limitations section with clearer explanations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: apply prettier formatting to iceberg.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update CSV section to document native CSV scan support The datasources.md previously stated that Comet does not provide native CSV scan, but experimental native CSV support was added in commit f538424. Updated to reflect the new spark.comet.scan.csv.v2.enabled option. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: fix incomplete sentence and update roadmap in contributor guide - parquet_scans.md: complete the truncated sentence in S3 Support section - roadmap.md: update native_comet removal section to reflect that auto mode now uses native_iceberg_compat (milestone achieved) - roadmap.md: fix typo "originally" -> "original" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 345e23f commit 64ebfcc

File tree

5 files changed

+117
-27
lines changed

5 files changed

+117
-27
lines changed

docs/source/contributor-guide/parquet_scans.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,8 @@ The `native_datafusion` and `native_iceberg_compat` scans share the following li
4545
- When reading Parquet files written by systems other than Spark that contain columns with the logical types `UINT_8`
4646
or `UINT_16`, Comet will produce different results than Spark because Spark does not preserve or understand these
4747
logical types. Arrow-based readers, such as DataFusion and Comet do respect these types and read the data as unsigned
48-
rather than signed. By default, Comet will fall back to `native_comet` when scanning Parquet files containing `byte` or `short`
49-
types (regardless of the logical type). This behavior can be disabled by setting
48+
rather than signed. By default, Comet will fall back to Spark's native scan when scanning Parquet files containing
49+
`byte` or `short` types (regardless of the logical type). This behavior can be disabled by setting
5050
`spark.comet.scan.allowIncompatible=true`.
5151
- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.
5252

@@ -59,11 +59,11 @@ The `native_datafusion` scan has some additional limitations:
5959

6060
## S3 Support
6161

62-
There are some
62+
There are some differences in S3 support between the scan implementations.
6363

6464
### `native_comet`
6565

66-
The default `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which
66+
The `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which
6767
is identical to the approach commonly used with vanilla Spark. AWS credential configuration and other Hadoop S3A
6868
configurations works the same way as in vanilla Spark.
6969

docs/source/contributor-guide/roadmap.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,10 @@ helpful to have a roadmap for some of the major items that require coordination
2727
### Iceberg Integration
2828

2929
Iceberg integration is still a work-in-progress ([#2060]), with major improvements expected in the next few
30-
releases. Once this integration is complete, we plan on switching from the `native_comet` scan to the
31-
`native_iceberg_compat` scan ([#2189]) so that complex types can be supported.
30+
releases. The default `auto` scan mode now uses `native_iceberg_compat` instead of `native_comet`, enabling
31+
support for complex types.
3232

3333
[#2060]: https://github.com/apache/datafusion-comet/issues/2060
34-
[#2189]: https://github.com/apache/datafusion-comet/issues/2189
3534

3635
### Spark 4.0 Support
3736

@@ -44,12 +43,12 @@ more Spark SQL tests and fully implementing ANSI support ([#313]) for all suppor
4443
### Removing the native_comet scan implementation
4544

4645
We are working towards deprecating ([#2186]) and removing ([#2177]) the `native_comet` scan implementation, which
47-
is the originally scan implementation that uses mutable buffers (which is incompatible with best practices around
46+
is the original scan implementation that uses mutable buffers (which is incompatible with best practices around
4847
Arrow FFI) and does not support complex types.
4948

50-
Once we are using the `native_iceberg_compat` scan (which is based on DataFusion's `DataSourceExec`) in the Iceberg
51-
integration, we will be able to remove the `native_comet` scan implementation, and can then improve the efficiency
52-
of our use of Arrow FFI ([#2171]).
49+
Now that the default `auto` scan mode uses `native_iceberg_compat` (which is based on DataFusion's `DataSourceExec`),
50+
we can proceed with removing the `native_comet` scan implementation, and then improve the efficiency of our use of
51+
Arrow FFI ([#2171]).
5352

5453
[#2186]: https://github.com/apache/datafusion-comet/issues/2186
5554
[#2171]: https://github.com/apache/datafusion-comet/issues/2171

docs/source/user-guide/latest/datasources.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,11 @@ Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for mo
3636

3737
### CSV
3838

39-
Comet does not provide native CSV scan, but when `spark.comet.convert.csv.enabled` is enabled, data is immediately
39+
Comet provides experimental native CSV scan support. When `spark.comet.scan.csv.v2.enabled` is enabled, CSV files
40+
are read natively for improved performance. This feature is experimental and performance benefits are
41+
workload-dependent.
42+
43+
Alternatively, when `spark.comet.convert.csv.enabled` is enabled, data from Spark's CSV reader is immediately
4044
converted into Arrow format, allowing native execution to happen after that.
4145

4246
### JSON

docs/source/user-guide/latest/expressions.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
6868
| Contains | Yes | |
6969
| EndsWith | Yes | |
7070
| InitCap | No | Behavior is different in some cases, such as hyphenated names. |
71+
| Left | Yes | Length argument must be a literal value |
7172
| Length | Yes | |
7273
| Like | Yes | |
7374
| Lower | No | Results can vary depending on locale and character set. Requires `spark.comet.caseConversion.enabled=true` |
@@ -94,15 +95,20 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
9495
| Expression | SQL | Spark-Compatible? | Compatibility Notes |
9596
| -------------- | ---------------------------- | ----------------- | -------------------------------------------------------------------------------------------------------------------- |
9697
| DateAdd | `date_add` | Yes | |
98+
| DateDiff | `datediff` | Yes | |
99+
| DateFormat | `date_format` | Yes | Partial support. Only specific format patterns are supported. |
97100
| DateSub | `date_sub` | Yes | |
98101
| DatePart | `date_part(field, source)` | Yes | Supported values of `field`: `year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute` |
99102
| Extract | `extract(field FROM source)` | Yes | Supported values of `field`: `year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute` |
100103
| FromUnixTime | `from_unixtime` | No | Does not support format, supports only -8334601211038 <= sec <= 8210266876799 |
101104
| Hour | `hour` | Yes | |
105+
| LastDay | `last_day` | Yes | |
102106
| Minute | `minute` | Yes | |
103107
| Second | `second` | Yes | |
104108
| TruncDate | `trunc` | Yes | |
105-
| TruncTimestamp | `trunc_date` | Yes | |
109+
| TruncTimestamp | `date_trunc` | Yes | |
110+
| UnixDate | `unix_date` | Yes | |
111+
| UnixTimestamp | `unix_timestamp` | Yes | |
106112
| Year | `year` | Yes | |
107113
| Month | `month` | Yes | |
108114
| DayOfMonth | `day`/`dayofmonth` | Yes | |
@@ -163,6 +169,7 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
163169
| ----------- | ----------------- |
164170
| Md5 | Yes |
165171
| Murmur3Hash | Yes |
172+
| Sha1 | Yes |
166173
| Sha2 | Yes |
167174
| XxHash64 | Yes |
168175

@@ -256,12 +263,13 @@ Comet supports using the following aggregate functions within window contexts wi
256263

257264
## Struct Expressions
258265

259-
| Expression | Spark-Compatible? |
260-
| -------------------- | ----------------- |
261-
| CreateNamedStruct | Yes |
262-
| GetArrayStructFields | Yes |
263-
| GetStructField | Yes |
264-
| StructsToJson | Yes |
266+
| Expression | Spark-Compatible? | Compatibility Notes |
267+
| -------------------- | ----------------- | ------------------------------------------ |
268+
| CreateNamedStruct | Yes | |
269+
| GetArrayStructFields | Yes | |
270+
| GetStructField | Yes | |
271+
| JsonToStructs | No | Partial support. Requires explicit schema. |
272+
| StructsToJson | Yes | |
265273

266274
## Conversion Expressions
267275

docs/source/user-guide/latest/iceberg.md

Lines changed: 87 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -183,14 +183,93 @@ $SPARK_HOME/bin/spark-shell \
183183
The same sample queries from above can be used to test Comet's fully-native Iceberg integration,
184184
however the scan node to look for is `CometIcebergNativeScan`.
185185

186+
### Supported features
187+
188+
The native Iceberg reader supports the following features:
189+
190+
**Table specifications:**
191+
192+
- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
193+
194+
**Schema and data types:**
195+
196+
- All primitive types including UUID
197+
- Complex types: arrays, maps, and structs
198+
- Schema evolution (adding and dropping columns)
199+
200+
**Time travel and branching:**
201+
202+
- `VERSION AS OF` queries to read historical snapshots
203+
- Branch reads for accessing named branches
204+
205+
**Delete handling (Merge-On-Read tables):**
206+
207+
- Positional deletes
208+
- Equality deletes
209+
- Mixed delete types
210+
211+
**Filter pushdown:**
212+
213+
- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
214+
- Logical operators (`AND`, `OR`)
215+
- NULL checks (`IS NULL`, `IS NOT NULL`)
216+
- `IN` and `NOT IN` list operations
217+
- `BETWEEN` operations
218+
219+
**Partitioning:**
220+
221+
- Standard partitioning with partition pruning
222+
- Date partitioning with `days()` transform
223+
- Bucket partitioning
224+
- Truncate transform
225+
- Hour transform
226+
227+
**Storage:**
228+
229+
- Local filesystem
230+
- Hadoop Distributed File System (HDFS)
231+
- S3-compatible storage (AWS S3, MinIO)
232+
233+
### REST Catalog
234+
235+
Comet's native Iceberg reader also supports REST catalogs. The following example shows how to
236+
configure Spark to use a REST catalog with Comet's native Iceberg scan:
237+
238+
```shell
239+
$SPARK_HOME/bin/spark-shell \
240+
--packages org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \
241+
--repositories https://repo1.maven.org/maven2/ \
242+
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
243+
--conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
244+
--conf spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
245+
--conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
246+
--conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
247+
--conf spark.plugins=org.apache.spark.CometPlugin \
248+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
249+
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
250+
--conf spark.comet.scan.icebergNative.enabled=true \
251+
--conf spark.comet.explainFallback.enabled=true \
252+
--conf spark.memory.offHeap.enabled=true \
253+
--conf spark.memory.offHeap.size=2g
254+
```
255+
256+
Note that REST catalogs require explicit namespace creation before creating tables:
257+
258+
```scala
259+
scala> spark.sql("CREATE NAMESPACE rest_cat.db")
260+
scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING) USING iceberg")
261+
scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2, 'Bob')")
262+
scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
263+
```
264+
186265
### Current limitations
187266

188-
The following scenarios are not yet supported, but are work in progress:
267+
The following scenarios will fall back to Spark's native Iceberg reader:
189268

190-
- Iceberg table spec v3 scans will fall back.
191-
- Iceberg writes will fall back.
192-
- Iceberg table scans backed by Avro or ORC data files will fall back.
193-
- Iceberg table scans partitioned on `BINARY` or `DECIMAL` (with precision >28) columns will fall back.
194-
- Iceberg scans with residual filters (_i.e._, filter expressions that are not partition values,
195-
and are evaluated on the column values at scan time) of `truncate`, `bucket`, `year`, `month`,
196-
`day`, `hour` will fall back.
269+
- Iceberg table spec v3 scans
270+
- Iceberg writes (reads are accelerated, writes use Spark)
271+
- Tables backed by Avro or ORC data files (only Parquet is accelerated)
272+
- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
273+
- Scans with residual filters using `truncate`, `bucket`, `year`, `month`, `day`, or `hour`
274+
transform functions (partition pruning still works, but row-level filtering of these
275+
transforms falls back)

0 commit comments

Comments
 (0)