Skip to content

Commit 18425d9

Browse files
authored
docs: Update compatibility docs for new native scans (#1657)
1 parent c04784a commit 18425d9

File tree

2 files changed

+72
-38
lines changed

2 files changed

+72
-38
lines changed

docs/source/user-guide/compatibility.md

Lines changed: 36 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -29,30 +29,47 @@ Comet aims to provide consistent results with the version of Apache Spark that i
2929

3030
This guide offers information about areas of functionality where there are known differences.
3131

32-
## Parquet Scans
33-
34-
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
35-
`spark.comet.scan.impl` is used to select an implementation.
32+
# Compatibility Guide
3633

37-
| Implementation | Description |
38-
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
39-
| `native_comet` | This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
40-
| `native_datafusion` | This implementation delegates to DataFusion's `ParquetExec`. |
41-
| `native_iceberg_compat` | This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
34+
Comet aims to provide consistent results with the version of Apache Spark that is being used.
4235

43-
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
44-
provide the following benefits over the `native_comet` implementation:
36+
This guide offers information about areas of functionality where there are known differences.
4537

46-
- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
47-
- Provide support for reading complex types (structs, arrays, and maps)
48-
- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
38+
## Parquet Scans
4939

50-
These new implementations are not fully implemented. Some of the current limitations are:
40+
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
41+
`spark.comet.scan.impl` is used to select an implementation.
5142

52-
- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet
53-
will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
54-
This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
55-
- These implementations do not yet fully support timestamps, decimals, or complex types.
43+
| Implementation | Description |
44+
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
45+
| `native_comet` | This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
46+
| `native_datafusion` | This implementation delegates to DataFusion's `DataSourceExec`. |
47+
| `native_iceberg_compat` | This implementation also delegates to DataFusion's `DataSourceExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
48+
49+
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans provide the following benefits over the `native_comet`
50+
implementation:
51+
52+
- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
53+
- Provides support for reading complex types (structs, arrays, and maps)
54+
- Removes the use of reusable mutable-buffers in Comet, which is complex to maintain
55+
- Improves performance
56+
57+
The new scans currently have the following limitations:
58+
59+
- When reading Parquet files written by systems other than Spark that contain columns with the logical types `UINT_8`
60+
or `UINT_16`, Comet will produce different results than Spark because Spark does not preserve or understand these
61+
logical types. Arrow-based readers, such as DataFusion and Comet do respect these types and read the data as unsigned
62+
rather than signed. By default, Comet will fall back to Spark when scanning Parquet files containing `byte` or `short`
63+
types (regardless of the logical type). This behavior can be disabled by setting
64+
`spark.comet.scan.allowIncompatible=true`.
65+
- Reading legacy INT96 timestamps contained within complex types can produce different results to Spark
66+
- There is a known performance issue when pushing filters down to Parquet. See the [Comet Tuning Guide] for more
67+
information.
68+
- There are failures in the Spark SQL test suite when enabling these new scans (tracking issues: [#1542] and [#1545]).
69+
70+
[#1545]: https://github.com/apache/datafusion-comet/issues/1545
71+
[#1542]: https://github.com/apache/datafusion-comet/issues/1542
72+
[Comet Tuning Guide]: tuning.md
5673

5774
## ANSI mode
5875

docs/templates/compatibility-template.md

Lines changed: 36 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -29,30 +29,47 @@ Comet aims to provide consistent results with the version of Apache Spark that i
2929

3030
This guide offers information about areas of functionality where there are known differences.
3131

32-
## Parquet Scans
33-
34-
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
35-
`spark.comet.scan.impl` is used to select an implementation.
32+
# Compatibility Guide
3633

37-
| Implementation | Description |
38-
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
39-
| `native_comet` | This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
40-
| `native_datafusion` | This implementation delegates to DataFusion's `ParquetExec`. |
41-
| `native_iceberg_compat` | This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
34+
Comet aims to provide consistent results with the version of Apache Spark that is being used.
4235

43-
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
44-
provide the following benefits over the `native_comet` implementation:
36+
This guide offers information about areas of functionality where there are known differences.
4537

46-
- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
47-
- Provide support for reading complex types (structs, arrays, and maps)
48-
- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
38+
## Parquet Scans
4939

50-
These new implementations are not fully implemented. Some of the current limitations are:
40+
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
41+
`spark.comet.scan.impl` is used to select an implementation.
5142

52-
- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet
53-
will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
54-
This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
55-
- These implementations do not yet fully support timestamps, decimals, or complex types.
43+
| Implementation | Description |
44+
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
45+
| `native_comet` | This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
46+
| `native_datafusion` | This implementation delegates to DataFusion's `DataSourceExec`. |
47+
| `native_iceberg_compat` | This implementation also delegates to DataFusion's `DataSourceExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
48+
49+
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans provide the following benefits over the `native_comet`
50+
implementation:
51+
52+
- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
53+
- Provides support for reading complex types (structs, arrays, and maps)
54+
- Removes the use of reusable mutable-buffers in Comet, which is complex to maintain
55+
- Improves performance
56+
57+
The new scans currently have the following limitations:
58+
59+
- When reading Parquet files written by systems other than Spark that contain columns with the logical types `UINT_8`
60+
or `UINT_16`, Comet will produce different results than Spark because Spark does not preserve or understand these
61+
logical types. Arrow-based readers, such as DataFusion and Comet do respect these types and read the data as unsigned
62+
rather than signed. By default, Comet will fall back to Spark when scanning Parquet files containing `byte` or `short`
63+
types (regardless of the logical type). This behavior can be disabled by setting
64+
`spark.comet.scan.allowIncompatible=true`.
65+
- Reading legacy INT96 timestamps contained within complex types can produce different results to Spark
66+
- There is a known performance issue when pushing filters down to Parquet. See the [Comet Tuning Guide] for more
67+
information.
68+
- There are failures in the Spark SQL test suite when enabling these new scans (tracking issues: [#1542] and [#1545]).
69+
70+
[#1545]: https://github.com/apache/datafusion-comet/issues/1545
71+
[#1542]: https://github.com/apache/datafusion-comet/issues/1542
72+
[Comet Tuning Guide]: tuning.md
5673

5774
## ANSI mode
5875

0 commit comments

Comments
 (0)