Skip to content

Commit 1c111d5

Browse files
authored
docs: Improve Gluten comparison based on feedback from the community (#2048)
1 parent 2933d16 commit 1c111d5

File tree

1 file changed

+17
-34
lines changed

1 file changed

+17
-34
lines changed

docs/source/user-guide/gluten_comparison.md

Lines changed: 17 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
# Comparison of Comet and Gluten
2121

2222
This document provides a comparison of the Comet and Gluten projects to help guide users who are looking to choose
23-
between them. This document is likely biased because it is maintained by the Comet community.
23+
between them. This document is likely biased because the Comet community maintains it.
2424

2525
We recommend trying out both Comet and Gluten to see which is the best fit for your needs.
2626

@@ -29,7 +29,7 @@ This document is based on Comet 0.9.0 and Gluten 1.4.0.
2929
## Architecture
3030

3131
Comet and Gluten have very similar architectures. Both are Spark plugins that translate Spark physical plans to
32-
a serialized representation and pass them to native code for execution.
32+
a serialized representation and pass the serialized plan to native code for execution.
3333

3434
Gluten serializes the plans using the Substrait format and has an extensible architecture that supports execution
3535
against multiple engines. Velox and Clickhouse are currently supported, but Velox is more widely used.
@@ -48,8 +48,13 @@ Apache Software Foundation.
4848

4949
Velox and DataFusion are both mature query engines that are growing in popularity.
5050

51-
Comet may be a better choice for users with plans for integrating with other Rust software in the future, and
52-
Gluten+Velox may be a better choice for users with plans for integrating with other C++ code.
51+
From the point of view of the usage of these query engines in Gluten and Comet, the most significant difference is
52+
the choice of implementation language (Rust vs C++) and this may be the main factor that users should consider when
53+
choosing a solution. For users wishing to implement UDFs in Rust, Comet would likely be a better choice. For users
54+
wishing to implement UDFs in C++, Gluten would likely be a better choice.
55+
56+
If users are just interested in speeding up their existing Spark jobs and do not need to implement UDFs in native
57+
code, then we suggest benchmarking with both solutions and choosing the fastest one for your use case.
5358

5459
![github-stars-datafusion-velox.png](../_static/images/github-stars-datafusion-velox.png)
5560

@@ -69,47 +74,25 @@ suite. See the [Gluten Compatibility Guide] for more information.
6974
## Performance
7075

7176
When running a benchmark derived from TPC-H on a single node against local Parquet files, we see that both Comet
72-
and Gluten provide a good speedup when compared to Spark. Comet provides a 2.4x speedup compares to a 2.8x speedup
77+
and Gluten provide an impressive speedup when compared to Spark. Comet provides a 2.4x speedup compares to a 2.8x speedup
7378
with Gluten.
7479

75-
Gluten is currently slightly faster than Comet, but we expect to close that gap over time.
80+
Gluten is currently faster than Comet for this particular benchmark, but we expect to close that gap over time.
81+
82+
Although TPC-H is a good benchmark for operators such as joins and aggregates, it doesn't necessarily represent
83+
real-world queries, especially for ETL use cases. For example, there are no complex types involved and no string
84+
manipulation, regular expressions, or other advanced expressions. We recommend running your own benchmarks based
85+
on your existing Spark jobs.
7686

7787
![tpch_allqueries_comet_gluten.png](../_static/images//benchmark-results/0.9.0/tpch_spark_comet_gluten.png)
7888

7989
The scripts that were used to generate these results can be found [here](https://github.com/apache/datafusion-comet/tree/main/dev/benchmarks).
8090

81-
## Ease of Development
82-
83-
Comet has a much smaller codebase than Gluten. A fresh clone of the respective repositories shows that Comet has ~41k
84-
lines of Scala+Java code and ~40k lines of Rust code. Gluten has ~207k lines of Scala+Java code and ~89k lines of C++
85-
code.
91+
## Ease of Development & Contributing
8692

8793
Setting up a local development environment with Comet is generally easier than with Gluten due to Rust's package
8894
management capabilities vs the complexities around installing C++ dependencies.
8995

90-
### Comet Lines of Code
91-
92-
```
93-
-------------------------------------------------------------------------------
94-
Language files blank comment code
95-
-------------------------------------------------------------------------------
96-
Rust 159 4870 5388 39989
97-
Scala 171 4849 6277 32538
98-
Java 66 1556 2619 8724
99-
```
100-
101-
### Gluten Lines of Code
102-
103-
```
104-
--------------------------------------------------------------------------------
105-
Language files blank comment code
106-
--------------------------------------------------------------------------------
107-
Scala 1312 23264 37534 179664
108-
C++ 421 9841 10245 64554
109-
Java 328 5063 6726 26520
110-
C/C++ Header 304 4875 6255 23527
111-
```
112-
11396
## Summary
11497

11598
Comet and Gluten are both good solutions for accelerating Spark jobs. We recommend trying both to see which is the

0 commit comments

Comments
 (0)