docs: Documentation updates for 0.9.0 release (#1981)

andygrove · web-flow · commit 5ddd92134fcb · 2025-07-08T08:41:22.000-06:00
diff --git a/README.md b/README.md
@@ -34,6 +34,8 @@ Apache DataFusion Comet is a high-performance accelerator for Apache Spark, buil
 performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
 Spark ecosystem without requiring any code changes.
 
+Comet also accelerates Apache Iceberg, when performing Parquet scans from Spark. 
+
 [Apache DataFusion]: https://datafusion.apache.org
 
 # Benefits of Using Comet
diff --git a/docs/source/contributor-guide/benchmarking_aws_ec2.md b/docs/source/contributor-guide/benchmarking_aws_ec2.md
@@ -179,8 +179,8 @@ $SPARK_HOME/bin/spark-submit \
 Install Comet JAR from Maven:
 
 ```shell
-wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar -P $SPARK_HOME/jars
-export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.7.0.jar
+wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.9.0/comet-spark-spark3.5_2.12-0.9.0.jar -P $SPARK_HOME/jars
+export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.9.0.jar
 ```
 
 Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -43,7 +43,6 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
    Comet Overview <user-guide/overview>
    Installing Comet <user-guide/installation>
    Building From Source <user-guide/source>
-   Kubernetes Guide <user-guide/kubernetes>
    Supported Data Sources <user-guide/datasources>
    Supported Data Types <user-guide/datatypes>
    Supported Operators <user-guide/operators>
@@ -52,6 +51,8 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
    Compatibility Guide <user-guide/compatibility>
    Tuning Guide <user-guide/tuning>
    Metrics Guide <user-guide/metrics>
+   Iceberg Guide <user-guide/iceberg>
+   Kubernetes Guide <user-guide/kubernetes>
 
 .. _toc.contributor-guide-links:
 .. toctree::
diff --git a/docs/source/user-guide/compatibility.md b/docs/source/user-guide/compatibility.md
diff --git a/docs/source/user-guide/datasources.md b/docs/source/user-guide/datasources.md
@@ -28,6 +28,12 @@ in the schema are supported. When this option is not enabled, the scan will fall
 enabling `spark.comet.convert.parquet.enabled` will immediately convert the data into Arrow format, allowing native 
 execution to happen after that, but the process may not be efficient.
 
+### Apache Iceberg
+
+Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for more information.
+
+[Iceberg Guide]: iceberg.md
+
 ### CSV
 
 Comet does not provide native CSV scan, but when `spark.comet.convert.csv.enabled` is enabled, data is immediately
@@ -88,7 +94,7 @@ root
  |    |-- lastName: string (nullable = true)
  |    |-- ageInYears: integer (nullable = true)
 
-25/01/30 16:50:43 INFO core/src/lib.rs: Comet native library version 0.7.0 initialized
+25/01/30 16:50:43 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
 == Physical Plan ==
 * CometColumnarToRow (2)
 +- CometNativeScan:  (1)
diff --git a/docs/source/user-guide/datatypes.md b/docs/source/user-guide/datatypes.md
@@ -39,5 +39,7 @@ The following Spark data types are currently available:
   - Timestamp
   - TimestampNTZ
 - Null
-- Struct
-- Array
+- Complex Types
+  - Struct
+  - Array
+  - Map
diff --git a/docs/source/user-guide/iceberg.md b/docs/source/user-guide/iceberg.md
@@ -44,22 +44,13 @@ export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.10.0-SNAPSHOT.ja
 
 ## Build Iceberg
 
-Clone the Iceberg repository.
+Clone the Iceberg repository and apply code changes needed by Comet
 
 ```shell
 git clone git@github.com:apache/iceberg.git
-```
-
-It will be necessary to make some small changes to Iceberg:
-
-- Update Gradle files to change Comet version to `0.10.0-SNAPSHOT`.
-- Replace `import org.apache.comet.shaded.arrow.c.CometSchemaImporter;` with `import org.apache.comet.CometSchemaImporter;`
-- Modify `SparkBatchQueryScan` so that it implements the `SupportsComet` interface
-- Stop shading Parquet by commenting out the following lines in the iceberg-spark build:
-
-```
-//    relocate 'org.apache.parquet', 'org.apache.iceberg.shaded.org.apache.parquet'
-//    relocate 'shaded.parquet', 'org.apache.iceberg.shaded.org.apache.parquet.shaded'
+cd iceberg
+git checkout apache-iceberg-1.8.1
+git apply ../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
 ```
 
 Perform a clean build
@@ -74,7 +65,7 @@ Perform a clean build
 Set `ICEBERG_JAR` environment variable.
 
 ```shell
-export ICEBERG_JAR=`pwd`/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.10.0-SNAPSHOT.jar
+export ICEBERG_JAR=`pwd`/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.9.0-SNAPSHOT.jar
 ```
 
 Launch Spark Shell:
@@ -93,7 +84,7 @@ $SPARK_HOME/bin/spark-shell \
     --conf spark.sql.iceberg.parquet.reader-type=COMET \
     --conf spark.comet.explainFallback.enabled=true \
     --conf spark.memory.offHeap.enabled=true \
-    --conf spark.memory.offHeap.size=16g
+    --conf spark.memory.offHeap.size=2g
 ```
 
 Create an Iceberg table. Note that Comet will not accelerate this part.
@@ -113,12 +104,6 @@ This should produce the following output:
 
 ```
 scala> spark.sql(s"SELECT * from t1").show()
-25/04/28 07:29:37 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
-25/04/28 07:29:37 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
- CollectLimit
-+-  Project [COMET: toprettystring is not supported]
-   +- CometScanWrapper
-
 +---+---+
 | c0| c1|
 +---+---+
@@ -145,3 +130,12 @@ scala> spark.sql(s"SELECT * from t1").show()
 +---+---+
 only showing top 20 rows
 ```
+
+Confirm that the query was accelerated by Comet:
+
+```
+scala> spark.sql(s"SELECT * from t1").explain()
+== Physical Plan ==
+*(1) CometColumnarToRow
++- CometBatchScan spark_catalog.default.t1[c0#26, c1#27] spark_catalog.default.t1 (branch=null) [filters=, groupedBy=] RuntimeFilters: []
+```
diff --git a/docs/source/user-guide/installation.md b/docs/source/user-guide/installation.md
@@ -51,7 +51,7 @@ use only and should not be used in production yet.
 
 | Spark Version  | Java Version | Scala Version | Comet Tests in CI | Spark SQL Tests in CI |
 | -------------- | ------------ | ------------- | ----------------- |-----------------------|
-| 4.0.0-preview1 | 17           | 2.13          | Yes               | Yes                   |
+| 4.0.0 | 17           | 2.13          | Yes               | Yes                   |
 
 Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by
 Cloud Service Providers.
diff --git a/docs/source/user-guide/overview.md b/docs/source/user-guide/overview.md
@@ -30,14 +30,6 @@ The following diagram provides an overview of Comet's architecture.
 
 ![Comet Overview](../_static/images/comet-overview.png)
 
-Comet aims to support:
-
-- a native Parquet implementation, including both reader and writer
-- full implementation of Spark operators, including
-  Filter/Project/Aggregation/Join/Exchange etc.
-- full implementation of Spark built-in expressions.
-- a UDF framework for users to migrate their existing UDF to native
-
 ## Architecture
 
 The following diagram shows how Comet integrates with Apache Spark.
diff --git a/docs/source/user-guide/source.md b/docs/source/user-guide/source.md
@@ -27,7 +27,7 @@ Official source releases can be downloaded from https://dist.apache.org/repos/di
 
 ```console
 # Pick the latest version
-export COMET_VERSION=0.7.0
+export COMET_VERSION=0.9.0
 # Download the tarball
 curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz"
 # Unpack
diff --git a/docs/source/user-guide/tuning.md b/docs/source/user-guide/tuning.md
@@ -21,18 +21,6 @@ under the License.
 
 Comet provides some tuning options to help you get the best performance from your queries.
 
-## Parquet Scans
-
-Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. These scans are described in the
-[Compatibility Guide].
-
-[Compatibility Guide]: compatibility.md
-
-When using `native_datafusion` or `native_iceberg_compat`, there are known performance issues when pushing filters
-down to Parquet scans. Until this issue is resolved, performance can be improved by setting
-`spark.sql.parquet.filterPushdown=false`.
-
 ## Configuring Tokio Runtime
 
 Comet uses a global tokio runtime per executor process using tokio's defaults of one worker thread per core and a