Skip to content

Commit 5ddd921

Browse files
authored
docs: Documentation updates for 0.9.0 release (#1981)
1 parent 5454ed4 commit 5ddd921

File tree

11 files changed

+146
-165
lines changed

11 files changed

+146
-165
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ Apache DataFusion Comet is a high-performance accelerator for Apache Spark, buil
3434
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
3535
Spark ecosystem without requiring any code changes.
3636

37+
Comet also accelerates Apache Iceberg, when performing Parquet scans from Spark.
38+
3739
[Apache DataFusion]: https://datafusion.apache.org
3840

3941
# Benefits of Using Comet

docs/source/contributor-guide/benchmarking_aws_ec2.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -179,8 +179,8 @@ $SPARK_HOME/bin/spark-submit \
179179
Install Comet JAR from Maven:
180180

181181
```shell
182-
wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar -P $SPARK_HOME/jars
183-
export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.7.0.jar
182+
wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.9.0/comet-spark-spark3.5_2.12-0.9.0.jar -P $SPARK_HOME/jars
183+
export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.9.0.jar
184184
```
185185

186186
Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):

docs/source/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,6 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
4343
Comet Overview <user-guide/overview>
4444
Installing Comet <user-guide/installation>
4545
Building From Source <user-guide/source>
46-
Kubernetes Guide <user-guide/kubernetes>
4746
Supported Data Sources <user-guide/datasources>
4847
Supported Data Types <user-guide/datatypes>
4948
Supported Operators <user-guide/operators>
@@ -52,6 +51,8 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
5251
Compatibility Guide <user-guide/compatibility>
5352
Tuning Guide <user-guide/tuning>
5453
Metrics Guide <user-guide/metrics>
54+
Iceberg Guide <user-guide/iceberg>
55+
Kubernetes Guide <user-guide/kubernetes>
5556

5657
.. _toc.contributor-guide-links:
5758
.. toctree::

docs/source/user-guide/compatibility.md

Lines changed: 112 additions & 116 deletions
Large diffs are not rendered by default.

docs/source/user-guide/datasources.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ in the schema are supported. When this option is not enabled, the scan will fall
2828
enabling `spark.comet.convert.parquet.enabled` will immediately convert the data into Arrow format, allowing native
2929
execution to happen after that, but the process may not be efficient.
3030

31+
### Apache Iceberg
32+
33+
Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for more information.
34+
35+
[Iceberg Guide]: iceberg.md
36+
3137
### CSV
3238

3339
Comet does not provide native CSV scan, but when `spark.comet.convert.csv.enabled` is enabled, data is immediately
@@ -88,7 +94,7 @@ root
8894
| |-- lastName: string (nullable = true)
8995
| |-- ageInYears: integer (nullable = true)
9096

91-
25/01/30 16:50:43 INFO core/src/lib.rs: Comet native library version 0.7.0 initialized
97+
25/01/30 16:50:43 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
9298
== Physical Plan ==
9399
* CometColumnarToRow (2)
94100
+- CometNativeScan: (1)

docs/source/user-guide/datatypes.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,5 +39,7 @@ The following Spark data types are currently available:
3939
- Timestamp
4040
- TimestampNTZ
4141
- Null
42-
- Struct
43-
- Array
42+
- Complex Types
43+
- Struct
44+
- Array
45+
- Map

docs/source/user-guide/iceberg.md

Lines changed: 15 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -44,22 +44,13 @@ export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.10.0-SNAPSHOT.ja
4444

4545
## Build Iceberg
4646

47-
Clone the Iceberg repository.
47+
Clone the Iceberg repository and apply code changes needed by Comet
4848

4949
```shell
5050
git clone [email protected]:apache/iceberg.git
51-
```
52-
53-
It will be necessary to make some small changes to Iceberg:
54-
55-
- Update Gradle files to change Comet version to `0.10.0-SNAPSHOT`.
56-
- Replace `import org.apache.comet.shaded.arrow.c.CometSchemaImporter;` with `import org.apache.comet.CometSchemaImporter;`
57-
- Modify `SparkBatchQueryScan` so that it implements the `SupportsComet` interface
58-
- Stop shading Parquet by commenting out the following lines in the iceberg-spark build:
59-
60-
```
61-
// relocate 'org.apache.parquet', 'org.apache.iceberg.shaded.org.apache.parquet'
62-
// relocate 'shaded.parquet', 'org.apache.iceberg.shaded.org.apache.parquet.shaded'
51+
cd iceberg
52+
git checkout apache-iceberg-1.8.1
53+
git apply ../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
6354
```
6455

6556
Perform a clean build
@@ -74,7 +65,7 @@ Perform a clean build
7465
Set `ICEBERG_JAR` environment variable.
7566

7667
```shell
77-
export ICEBERG_JAR=`pwd`/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.10.0-SNAPSHOT.jar
68+
export ICEBERG_JAR=`pwd`/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.9.0-SNAPSHOT.jar
7869
```
7970

8071
Launch Spark Shell:
@@ -93,7 +84,7 @@ $SPARK_HOME/bin/spark-shell \
9384
--conf spark.sql.iceberg.parquet.reader-type=COMET \
9485
--conf spark.comet.explainFallback.enabled=true \
9586
--conf spark.memory.offHeap.enabled=true \
96-
--conf spark.memory.offHeap.size=16g
87+
--conf spark.memory.offHeap.size=2g
9788
```
9889

9990
Create an Iceberg table. Note that Comet will not accelerate this part.
@@ -113,12 +104,6 @@ This should produce the following output:
113104

114105
```
115106
scala> spark.sql(s"SELECT * from t1").show()
116-
25/04/28 07:29:37 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
117-
25/04/28 07:29:37 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
118-
CollectLimit
119-
+- Project [COMET: toprettystring is not supported]
120-
+- CometScanWrapper
121-
122107
+---+---+
123108
| c0| c1|
124109
+---+---+
@@ -145,3 +130,12 @@ scala> spark.sql(s"SELECT * from t1").show()
145130
+---+---+
146131
only showing top 20 rows
147132
```
133+
134+
Confirm that the query was accelerated by Comet:
135+
136+
```
137+
scala> spark.sql(s"SELECT * from t1").explain()
138+
== Physical Plan ==
139+
*(1) CometColumnarToRow
140+
+- CometBatchScan spark_catalog.default.t1[c0#26, c1#27] spark_catalog.default.t1 (branch=null) [filters=, groupedBy=] RuntimeFilters: []
141+
```

docs/source/user-guide/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ use only and should not be used in production yet.
5151

5252
| Spark Version | Java Version | Scala Version | Comet Tests in CI | Spark SQL Tests in CI |
5353
| -------------- | ------------ | ------------- | ----------------- |-----------------------|
54-
| 4.0.0-preview1 | 17 | 2.13 | Yes | Yes |
54+
| 4.0.0 | 17 | 2.13 | Yes | Yes |
5555

5656
Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by
5757
Cloud Service Providers.

docs/source/user-guide/overview.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,6 @@ The following diagram provides an overview of Comet's architecture.
3030

3131
![Comet Overview](../_static/images/comet-overview.png)
3232

33-
Comet aims to support:
34-
35-
- a native Parquet implementation, including both reader and writer
36-
- full implementation of Spark operators, including
37-
Filter/Project/Aggregation/Join/Exchange etc.
38-
- full implementation of Spark built-in expressions.
39-
- a UDF framework for users to migrate their existing UDF to native
40-
4133
## Architecture
4234

4335
The following diagram shows how Comet integrates with Apache Spark.

docs/source/user-guide/source.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Official source releases can be downloaded from https://dist.apache.org/repos/di
2727

2828
```console
2929
# Pick the latest version
30-
export COMET_VERSION=0.7.0
30+
export COMET_VERSION=0.9.0
3131
# Download the tarball
3232
curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz"
3333
# Unpack

0 commit comments

Comments
 (0)