docs: various documentation updates in preparation for next release (apache#3254)

andygrove · claude · web-flow · commit 64ebfcc33ee7 · 2026-01-23T14:23:23.000-05:00
* docs: add missing expressions to user guide Add documentation for expressions that were implemented but not documented in the supported expressions list: - Left (string function) - DateDiff, DateFormat, LastDay, UnixDate, UnixTimestamp (date/time) - Sha1 (hashing) - JsonToStructs (struct) Also fixes TruncTimestamp SQL from `trunc_date` to `date_trunc`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update documentation to reflect auto scan no longer falls back to native_comet Updates contributor guide documentation following the change in c9af2c6: - parquet_scans.md: clarify that auto mode falls back to Spark's native scan, not native_comet - roadmap.md: reflect that the switch to native_iceberg_compat for auto mode is complete Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add REST catalog example to Iceberg user guide Adds documentation showing how to configure Comet's native Iceberg scan with a REST catalog, including example Spark configuration and sample queries demonstrating namespace creation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add supported features section to Iceberg user guide Adds comprehensive documentation of native Iceberg reader capabilities: - Table spec versions (v1, v2 supported; v3 falls back) - Schema and data types including complex types and schema evolution - Time travel and branch reads - MOR table delete handling (positional and equality deletes) - Filter pushdown operations - Partitioning with various transform types - Storage backends (local, HDFS, S3) Also improves the limitations section with clearer explanations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: apply prettier formatting to iceberg.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update CSV section to document native CSV scan support The datasources.md previously stated that Comet does not provide native CSV scan, but experimental native CSV support was added in commit f538424. Updated to reflect the new spark.comet.scan.csv.v2.enabled option. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: fix incomplete sentence and update roadmap in contributor guide - parquet_scans.md: complete the truncated sentence in S3 Support section - roadmap.md: update native_comet removal section to reflect that auto mode now uses native_iceberg_compat (milestone achieved) - roadmap.md: fix typo "originally" -> "original" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md
@@ -45,8 +45,8 @@ The `native_datafusion` and `native_iceberg_compat` scans share the following li
 - When reading Parquet files written by systems other than Spark that contain columns with the logical types `UINT_8`
   or `UINT_16`, Comet will produce different results than Spark because Spark does not preserve or understand these
   logical types. Arrow-based readers, such as DataFusion and Comet do respect these types and read the data as unsigned
-  rather than signed. By default, Comet will fall back to `native_comet` when scanning Parquet files containing `byte` or `short`
-  types (regardless of the logical type). This behavior can be disabled by setting
+  rather than signed. By default, Comet will fall back to Spark's native scan when scanning Parquet files containing
+  `byte` or `short` types (regardless of the logical type). This behavior can be disabled by setting
   `spark.comet.scan.allowIncompatible=true`.
 - No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.
 
@@ -59,11 +59,11 @@ The `native_datafusion` scan has some additional limitations:
 
 ## S3 Support
 
-There are some
+There are some differences in S3 support between the scan implementations.
 
 ### `native_comet`
 
-The default `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which
+The `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which
 is identical to the approach commonly used with vanilla Spark. AWS credential configuration and other Hadoop S3A
 configurations works the same way as in vanilla Spark.
 
diff --git a/docs/source/contributor-guide/roadmap.md b/docs/source/contributor-guide/roadmap.md
@@ -27,11 +27,10 @@ helpful to have a roadmap for some of the major items that require coordination
 ### Iceberg Integration
 
 Iceberg integration is still a work-in-progress ([#2060]), with major improvements expected in the next few
-releases. Once this integration is complete, we plan on switching from the `native_comet` scan to the
-`native_iceberg_compat` scan ([#2189]) so that complex types can be supported.
+releases. The default `auto` scan mode now uses `native_iceberg_compat` instead of `native_comet`, enabling
+support for complex types.
 
 [#2060]: https://github.com/apache/datafusion-comet/issues/2060
-[#2189]: https://github.com/apache/datafusion-comet/issues/2189
 
 ### Spark 4.0 Support
 
@@ -44,12 +43,12 @@ more Spark SQL tests and fully implementing ANSI support ([#313]) for all suppor
 ### Removing the native_comet scan implementation
 
 We are working towards deprecating ([#2186]) and removing ([#2177]) the `native_comet` scan implementation, which
-is the originally scan implementation that uses mutable buffers (which is incompatible with best practices around
+is the original scan implementation that uses mutable buffers (which is incompatible with best practices around
 Arrow FFI) and does not support complex types.
 
-Once we are using the `native_iceberg_compat` scan (which is based on DataFusion's `DataSourceExec`) in the Iceberg
-integration, we will be able to remove the `native_comet` scan implementation, and can then improve the efficiency
-of our use of Arrow FFI ([#2171]).
+Now that the default `auto` scan mode uses `native_iceberg_compat` (which is based on DataFusion's `DataSourceExec`),
+we can proceed with removing the `native_comet` scan implementation, and then improve the efficiency of our use of
+Arrow FFI ([#2171]).
 
 [#2186]: https://github.com/apache/datafusion-comet/issues/2186
 [#2171]: https://github.com/apache/datafusion-comet/issues/2171
diff --git a/docs/source/user-guide/latest/datasources.md b/docs/source/user-guide/latest/datasources.md
@@ -36,7 +36,11 @@ Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for mo
 
 ### CSV
 
-Comet does not provide native CSV scan, but when `spark.comet.convert.csv.enabled` is enabled, data is immediately
+Comet provides experimental native CSV scan support. When `spark.comet.scan.csv.v2.enabled` is enabled, CSV files
+are read natively for improved performance. This feature is experimental and performance benefits are
+workload-dependent.
+
+Alternatively, when `spark.comet.convert.csv.enabled` is enabled, data from Spark's CSV reader is immediately
 converted into Arrow format, allowing native execution to happen after that.
 
 ### JSON
diff --git a/docs/source/user-guide/latest/expressions.md b/docs/source/user-guide/latest/expressions.md
@@ -68,6 +68,7 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
 | Contains        | Yes               |                                                                                                            |
 | EndsWith        | Yes               |                                                                                                            |
 | InitCap         | No                | Behavior is different in some cases, such as hyphenated names.                                             |
+| Left            | Yes               | Length argument must be a literal value                                                                    |
 | Length          | Yes               |                                                                                                            |
 | Like            | Yes               |                                                                                                            |
 | Lower           | No                | Results can vary depending on locale and character set. Requires `spark.comet.caseConversion.enabled=true` |
@@ -94,15 +95,20 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
 | Expression     | SQL                          | Spark-Compatible? | Compatibility Notes                                                                                                  |
 | -------------- | ---------------------------- | ----------------- | -------------------------------------------------------------------------------------------------------------------- |
 | DateAdd        | `date_add`                   | Yes               |                                                                                                                      |
+| DateDiff       | `datediff`                   | Yes               |                                                                                                                      |
+| DateFormat     | `date_format`                | Yes               | Partial support. Only specific format patterns are supported.                                                        |
 | DateSub        | `date_sub`                   | Yes               |                                                                                                                      |
 | DatePart       | `date_part(field, source)`   | Yes               | Supported values of `field`: `year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute` |
 | Extract        | `extract(field FROM source)` | Yes               | Supported values of `field`: `year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute` |
 | FromUnixTime   | `from_unixtime`              | No                | Does not support format, supports only -8334601211038 <= sec <= 8210266876799                                        |
 | Hour           | `hour`                       | Yes               |                                                                                                                      |
+| LastDay        | `last_day`                   | Yes               |                                                                                                                      |
 | Minute         | `minute`                     | Yes               |                                                                                                                      |
 | Second         | `second`                     | Yes               |                                                                                                                      |
 | TruncDate      | `trunc`                      | Yes               |                                                                                                                      |
-| TruncTimestamp | `trunc_date`                 | Yes               |                                                                                                                      |
+| TruncTimestamp | `date_trunc`                 | Yes               |                                                                                                                      |
+| UnixDate       | `unix_date`                  | Yes               |                                                                                                                      |
+| UnixTimestamp  | `unix_timestamp`             | Yes               |                                                                                                                      |
 | Year           | `year`                       | Yes               |                                                                                                                      |
 | Month          | `month`                      | Yes               |                                                                                                                      |
 | DayOfMonth     | `day`/`dayofmonth`           | Yes               |                                                                                                                      |
@@ -163,6 +169,7 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
 | ----------- | ----------------- |
 | Md5         | Yes               |
 | Murmur3Hash | Yes               |
+| Sha1        | Yes               |
 | Sha2        | Yes               |
 | XxHash64    | Yes               |
 
@@ -256,12 +263,13 @@ Comet supports using the following aggregate functions within window contexts wi
 
 ## Struct Expressions
 
-| Expression           | Spark-Compatible? |
-| -------------------- | ----------------- |
-| CreateNamedStruct    | Yes               |
-| GetArrayStructFields | Yes               |
-| GetStructField       | Yes               |
-| StructsToJson        | Yes               |
+| Expression           | Spark-Compatible? | Compatibility Notes                        |
+| -------------------- | ----------------- | ------------------------------------------ |
+| CreateNamedStruct    | Yes               |                                            |
+| GetArrayStructFields | Yes               |                                            |
+| GetStructField       | Yes               |                                            |
+| JsonToStructs        | No                | Partial support. Requires explicit schema. |
+| StructsToJson        | Yes               |                                            |
 
 ## Conversion Expressions
 
diff --git a/docs/source/user-guide/latest/iceberg.md b/docs/source/user-guide/latest/iceberg.md
@@ -183,14 +183,93 @@ $SPARK_HOME/bin/spark-shell \
 The same sample queries from above can be used to test Comet's fully-native Iceberg integration,
 however the scan node to look for is `CometIcebergNativeScan`.
 
+### Supported features
+
+The native Iceberg reader supports the following features:
+
+**Table specifications:**
+
+- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
+
+**Schema and data types:**
+
+- All primitive types including UUID
+- Complex types: arrays, maps, and structs
+- Schema evolution (adding and dropping columns)
+
+**Time travel and branching:**
+
+- `VERSION AS OF` queries to read historical snapshots
+- Branch reads for accessing named branches
+
+**Delete handling (Merge-On-Read tables):**
+
+- Positional deletes
+- Equality deletes
+- Mixed delete types
+
+**Filter pushdown:**
+
+- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
+- Logical operators (`AND`, `OR`)
+- NULL checks (`IS NULL`, `IS NOT NULL`)
+- `IN` and `NOT IN` list operations
+- `BETWEEN` operations
+
+**Partitioning:**
+
+- Standard partitioning with partition pruning
+- Date partitioning with `days()` transform
+- Bucket partitioning
+- Truncate transform
+- Hour transform
+
+**Storage:**
+
+- Local filesystem
+- Hadoop Distributed File System (HDFS)
+- S3-compatible storage (AWS S3, MinIO)
+
+### REST Catalog
+
+Comet's native Iceberg reader also supports REST catalogs. The following example shows how to
+configure Spark to use a REST catalog with Comet's native Iceberg scan:
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+    --packages org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \
+    --repositories https://repo1.maven.org/maven2/ \
+    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
+    --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
+    --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
+    --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
+    --conf spark.plugins=org.apache.spark.CometPlugin \
+    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
+    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+    --conf spark.comet.scan.icebergNative.enabled=true \
+    --conf spark.comet.explainFallback.enabled=true \
+    --conf spark.memory.offHeap.enabled=true \
+    --conf spark.memory.offHeap.size=2g
+```
+
+Note that REST catalogs require explicit namespace creation before creating tables:
+
+```scala
+scala> spark.sql("CREATE NAMESPACE rest_cat.db")
+scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING) USING iceberg")
+scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2, 'Bob')")
+scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
+```
+
 ### Current limitations
 
-The following scenarios are not yet supported, but are work in progress:
+The following scenarios will fall back to Spark's native Iceberg reader:
 
-- Iceberg table spec v3 scans will fall back.
-- Iceberg writes will fall back.
-- Iceberg table scans backed by Avro or ORC data files will fall back.
-- Iceberg table scans partitioned on `BINARY` or `DECIMAL` (with precision >28) columns will fall back.
-- Iceberg scans with residual filters (_i.e._, filter expressions that are not partition values,
-  and are evaluated on the column values at scan time) of `truncate`, `bucket`, `year`, `month`,
-  `day`, `hour` will fall back.
+- Iceberg table spec v3 scans
+- Iceberg writes (reads are accelerated, writes use Spark)
+- Tables backed by Avro or ORC data files (only Parquet is accelerated)
+- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
+- Scans with residual filters using `truncate`, `bucket`, `year`, `month`, `day`, or `hour`
+  transform functions (partition pruning still works, but row-level filtering of these
+  transforms falls back)