You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`native_comet`| This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
40
-
|`native_datafusion`| This implementation delegates to DataFusion's `ParquetExec`. |
41
-
|`native_iceberg_compat`| This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
34
+
Comet aims to provide consistent results with the version of Apache Spark that is being used.
42
35
43
-
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
44
-
provide the following benefits over the `native_comet` implementation:
36
+
This guide offers information about areas of functionality where there are known differences.
45
37
46
-
- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
47
-
- Provide support for reading complex types (structs, arrays, and maps)
48
-
- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
38
+
## Parquet Scans
49
39
50
-
These new implementations are not fully implemented. Some of the current limitations are:
40
+
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
41
+
`spark.comet.scan.impl` is used to select an implementation.
51
42
52
-
- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet
53
-
will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
54
-
This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
55
-
- These implementations do not yet fully support timestamps, decimals, or complex types.
|`native_comet`| This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
46
+
|`native_datafusion`| This implementation delegates to DataFusion's `DataSourceExec`. |
47
+
|`native_iceberg_compat`| This implementation also delegates to DataFusion's `DataSourceExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
48
+
49
+
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans provide the following benefits over the `native_comet`
50
+
implementation:
51
+
52
+
- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
53
+
- Provides support for reading complex types (structs, arrays, and maps)
54
+
- Removes the use of reusable mutable-buffers in Comet, which is complex to maintain
55
+
- Improves performance
56
+
57
+
The new scans currently have the following limitations:
58
+
59
+
- When reading Parquet files written by systems other than Spark that contain columns with the logical types `UINT_8`
60
+
or `UINT_16`, Comet will produce different results than Spark because Spark does not preserve or understand these
61
+
logical types. Arrow-based readers, such as DataFusion and Comet do respect these types and read the data as unsigned
62
+
rather than signed. By default, Comet will fall back to Spark when scanning Parquet files containing `byte` or `short`
63
+
types (regardless of the logical type). This behavior can be disabled by setting
64
+
`spark.comet.scan.allowIncompatible=true`.
65
+
- Reading legacy INT96 timestamps contained within complex types can produce different results to Spark
66
+
- There is a known performance issue when pushing filters down to Parquet. See the [Comet Tuning Guide] for more
67
+
information.
68
+
- There are failures in the Spark SQL test suite when enabling these new scans (tracking issues: [#1542] and [#1545]).
|`native_comet`| This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
40
-
|`native_datafusion`| This implementation delegates to DataFusion's `ParquetExec`. |
41
-
|`native_iceberg_compat`| This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
34
+
Comet aims to provide consistent results with the version of Apache Spark that is being used.
42
35
43
-
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
44
-
provide the following benefits over the `native_comet` implementation:
36
+
This guide offers information about areas of functionality where there are known differences.
45
37
46
-
- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
47
-
- Provide support for reading complex types (structs, arrays, and maps)
48
-
- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
38
+
## Parquet Scans
49
39
50
-
These new implementations are not fully implemented. Some of the current limitations are:
40
+
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
41
+
`spark.comet.scan.impl` is used to select an implementation.
51
42
52
-
- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet
53
-
will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
54
-
This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
55
-
- These implementations do not yet fully support timestamps, decimals, or complex types.
|`native_comet`| This is the default implementation. It provides strong compatibility with Spark but does not support complex types. |
46
+
|`native_datafusion`| This implementation delegates to DataFusion's `DataSourceExec`. |
47
+
|`native_iceberg_compat`| This implementation also delegates to DataFusion's `DataSourceExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
48
+
49
+
The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans provide the following benefits over the `native_comet`
50
+
implementation:
51
+
52
+
- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
53
+
- Provides support for reading complex types (structs, arrays, and maps)
54
+
- Removes the use of reusable mutable-buffers in Comet, which is complex to maintain
55
+
- Improves performance
56
+
57
+
The new scans currently have the following limitations:
58
+
59
+
- When reading Parquet files written by systems other than Spark that contain columns with the logical types `UINT_8`
60
+
or `UINT_16`, Comet will produce different results than Spark because Spark does not preserve or understand these
61
+
logical types. Arrow-based readers, such as DataFusion and Comet do respect these types and read the data as unsigned
62
+
rather than signed. By default, Comet will fall back to Spark when scanning Parquet files containing `byte` or `short`
63
+
types (regardless of the logical type). This behavior can be disabled by setting
64
+
`spark.comet.scan.allowIncompatible=true`.
65
+
- Reading legacy INT96 timestamps contained within complex types can produce different results to Spark
66
+
- There is a known performance issue when pushing filters down to Parquet. See the [Comet Tuning Guide] for more
67
+
information.
68
+
- There are failures in the Spark SQL test suite when enabling these new scans (tracking issues: [#1542] and [#1545]).
0 commit comments