fix: Change default value of COMET_SCAN_ALLOW_INCOMPATIBLE and add documentation (#1398)

andygrove · web-flow · commit 3d127801b3eb · 2025-02-14T14:56:00.000-07:00
* change default value of COMET_SCAN_ALLOW_INCOMPATIBLE and add documentation

* docs

* rename

* address feedback

* rename method based on feedback
diff --git a/common/src/main/scala/org/apache/comet/CometConf.scala b/common/src/main/scala/org/apache/comet/CometConf.scala
@@ -614,7 +614,7 @@ object CometConf extends ShimCometConf {
         "Comet is not currently fully compatible with Spark for all datatypes. " +
           s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
       .booleanConf
-      .createWithDefault(true)
+      .createWithDefault(false)
 
   val COMET_EXPR_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
     conf("spark.comet.expression.allowIncompatible")
diff --git a/docs/source/user-guide/compatibility.md b/docs/source/user-guide/compatibility.md
@@ -17,12 +17,43 @@ specific language governing permissions and limitations
 under the License.
 -->
 
+<!--
+TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE
+(docs/templates/compatibility-template.md) AND NOT THE GENERATED FILE
+(docs/source/user-guide/compatibility.md) OTHERWISE YOUR CHANGES MAY BE LOST
+-->
+
 # Compatibility Guide
 
 Comet aims to provide consistent results with the version of Apache Spark that is being used.
 
 This guide offers information about areas of functionality where there are known differences.
 
+## Parquet Scans
+
+Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
+`spark.comet.scan.impl` is used to select an implementation.
+
+| Implementation          | Description                                                                                                                                                                            |
+| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `native_comet`          | This is the default implementation. It provides strong compatibility with Spark but does not support complex types.                                                                    |
+| `native_datafusion`     | This implementation delegates to DataFusion's `ParquetExec`.                                                                                                                           |
+| `native_iceberg_compat` | This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
+
+The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
+provide the following benefits over the `native_comet` implementation:
+
+- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
+- Provide support for reading complex types (structs, arrays, and maps)
+- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
+
+These new implementations are not fully implemented. Some of the current limitations are:
+
+- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet
+will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
+This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
+- These implementations do not yet fully support timestamps, decimals, or complex types.
+
 ## ANSI mode
 
 Comet currently ignores ANSI mode in most cases, and therefore can produce different results than Spark. By default,
diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md
@@ -17,6 +17,12 @@ specific language governing permissions and limitations
 under the License.
 -->
 
+<!--
+TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE
+(docs/templates/configs-template.md) AND NOT THE GENERATED FILE
+(docs/source/user-guide/configs.md) OTHERWISE YOUR CHANGES MAY BE LOST
+-->
+
 # Comet Configuration Settings
 
 Comet provides the following configuration settings.
@@ -76,7 +82,7 @@ Comet provides the following configuration settings.
 | spark.comet.parquet.read.parallel.io.enabled | Whether to enable Comet's parallel reader for Parquet files. The parallel reader reads ranges of consecutive data in a  file in parallel. It is faster for large files and row groups but uses more resources. | true |
 | spark.comet.parquet.read.parallel.io.thread-pool.size | The maximum number of parallel threads the parallel reader will use in a single executor. For executors configured with a smaller number of cores, use a smaller number. | 16 |
 | spark.comet.regexp.allowIncompatible | Comet is not currently fully compatible with Spark for all regular expressions. Set this config to true to allow them anyway. For more information, refer to the Comet Compatibility Guide (https://datafusion.apache.org/comet/user-guide/compatibility.html). | false |
-| spark.comet.scan.allowIncompatible | Comet is not currently fully compatible with Spark for all datatypes. Set this config to true to allow them anyway. For more information, refer to the Comet Compatibility Guide (https://datafusion.apache.org/comet/user-guide/compatibility.html). | true |
+| spark.comet.scan.allowIncompatible | Comet is not currently fully compatible with Spark for all datatypes. Set this config to true to allow them anyway. For more information, refer to the Comet Compatibility Guide (https://datafusion.apache.org/comet/user-guide/compatibility.html). | false |
 | spark.comet.scan.enabled | Whether to enable native scans. When this is turned on, Spark will use Comet to read supported data sources (currently only Parquet is supported natively). Note that to enable native vectorized execution, both this config and 'spark.comet.exec.enabled' need to be enabled. | true |
 | spark.comet.scan.preFetch.enabled | Whether to enable pre-fetching feature of CometScan. | false |
 | spark.comet.scan.preFetch.threadNum | The number of threads running pre-fetching for CometScan. Effective if spark.comet.scan.preFetch.enabled is enabled. Note that more pre-fetching threads means more memory requirement to store pre-fetched row groups. | 2 |
diff --git a/docs/templates/compatibility-template.md b/docs/templates/compatibility-template.md
@@ -17,12 +17,43 @@
   under the License.
 -->
 
+<!--
+  TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE
+  (docs/templates/compatibility-template.md) AND NOT THE GENERATED FILE
+  (docs/source/user-guide/compatibility.md) OTHERWISE YOUR CHANGES MAY BE LOST
+-->
+
 # Compatibility Guide
 
 Comet aims to provide consistent results with the version of Apache Spark that is being used.
 
 This guide offers information about areas of functionality where there are known differences.
 
+## Parquet Scans
+
+Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
+`spark.comet.scan.impl` is used to select an implementation.
+
+| Implementation          | Description                                                                                                                                                                            |
+| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `native_comet`          | This is the default implementation. It provides strong compatibility with Spark but does not support complex types.                                                                    |
+| `native_datafusion`     | This implementation delegates to DataFusion's `ParquetExec`.                                                                                                                           |
+| `native_iceberg_compat` | This implementation also delegates to DataFusion's `ParquetExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
+
+The new (and currently experimental) `native_datafusion` and `native_iceberg_compat` scans are being added to
+provide the following benefits over the `native_comet` implementation:
+
+- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
+- Provide support for reading complex types (structs, arrays, and maps)
+- Remove the use of reusable mutable-buffers in Comet, which is complex to maintain
+
+These new implementations are not fully implemented. Some of the current limitations are:
+
+- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce results that don't match Spark. By default, Comet  
+  will fall back to Spark when using these scan implementations to read Parquet files containing 8 or 16-bit integers.
+  This behavior can be disabled by setting `spark.comet.scan.allowIncompatible=true`.
+- These implementations do not yet fully support timestamps, decimals, or complex types.
+
 ## ANSI mode
 
 Comet currently ignores ANSI mode in most cases, and therefore can produce different results than Spark. By default,
@@ -47,7 +78,7 @@ will fall back to Spark but can be enabled by setting `spark.comet.expression.al
 
 ## Array Expressions
 
-Comet has experimental support for a number of array expressions. These are experimental and currently marked 
+Comet has experimental support for a number of array expressions. These are experimental and currently marked
 as incompatible and can be enabled by setting `spark.comet.expression.allowIncompatible=true`.
 
 ## Regular Expressions
@@ -82,5 +113,5 @@ The following cast operations are not compatible with Spark for all inputs and a
 
 ### Unsupported Casts
 
-Any cast not listed in the previous tables is currently unsupported. We are working on adding more. See the 
+Any cast not listed in the previous tables is currently unsupported. We are working on adding more. See the
 [tracking issue](https://github.com/apache/datafusion-comet/issues/286) for more details.
diff --git a/docs/templates/configs-template.md b/docs/templates/configs-template.md
@@ -17,6 +17,12 @@
   under the License.
 -->
 
+<!-- 
+  TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE FILE  
+  (docs/templates/configs-template.md) AND NOT THE GENERATED FILE
+  (docs/source/user-guide/configs.md) OTHERWISE YOUR CHANGES MAY BE LOST
+-->
+
 # Comet Configuration Settings
 
 Comet provides the following configuration settings.
diff --git a/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala b/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala
@@ -1352,15 +1352,11 @@ object CometSparkSessionExtensions extends Logging {
     org.apache.spark.SPARK_VERSION >= "4.0"
   }
 
-  def isComplexTypeReaderEnabled(conf: SQLConf): Boolean = {
+  def usingDataFusionParquetExec(conf: SQLConf): Boolean = {
     CometConf.COMET_NATIVE_SCAN_IMPL.get(conf) == CometConf.SCAN_NATIVE_ICEBERG_COMPAT ||
     CometConf.COMET_NATIVE_SCAN_IMPL.get(conf) == CometConf.SCAN_NATIVE_DATAFUSION
   }
 
-  def usingDataFusionParquetReader(conf: SQLConf): Boolean = {
-    isComplexTypeReaderEnabled(conf) && !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get(conf)
-  }
-
   /** Calculates required memory overhead in MB per executor process for Comet. */
   def getCometMemoryOverheadInMiB(sparkConf: SparkConf): Long = {
     val baseMemoryMiB = if (cometUnifiedMemoryManagerEnabled(sparkConf)) {
diff --git a/spark/src/main/scala/org/apache/comet/DataTypeSupport.scala b/spark/src/main/scala/org/apache/comet/DataTypeSupport.scala
@@ -37,7 +37,7 @@ trait DataTypeSupport {
 
   private def isGloballySupported(dt: DataType): Boolean = dt match {
     case ByteType | ShortType
-        if CometSparkSessionExtensions.isComplexTypeReaderEnabled(SQLConf.get) &&
+        if CometSparkSessionExtensions.usingDataFusionParquetExec(SQLConf.get) &&
           !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get() =>
       false
     case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
diff --git a/spark/src/test/scala/org/apache/comet/CometCastSuite.scala b/spark/src/test/scala/org/apache/comet/CometCastSuite.scala
@@ -59,8 +59,9 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
 
   private val timestampPattern = "0123456789/:T" + whitespaceChars
 
-  lazy val usingDataFusionParquetReader: Boolean =
-    CometSparkSessionExtensions.usingDataFusionParquetReader(conf)
+  lazy val usingParquetExecWithIncompatTypes: Boolean =
+    CometSparkSessionExtensions.usingDataFusionParquetExec(conf) &&
+      !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get(conf)
 
   test("all valid cast combinations covered") {
     val names = testNames
@@ -151,71 +152,71 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
     castTest(
       generateBytes(),
       DataTypes.BooleanType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ByteType to ShortType") {
     castTest(
       generateBytes(),
       DataTypes.ShortType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ByteType to IntegerType") {
     castTest(
       generateBytes(),
       DataTypes.IntegerType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ByteType to LongType") {
     castTest(
       generateBytes(),
       DataTypes.LongType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ByteType to FloatType") {
     castTest(
       generateBytes(),
       DataTypes.FloatType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ByteType to DoubleType") {
     castTest(
       generateBytes(),
       DataTypes.DoubleType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ByteType to DecimalType(10,2)") {
     castTest(
       generateBytes(),
       DataTypes.createDecimalType(10, 2),
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ByteType to StringType") {
     castTest(
       generateBytes(),
       DataTypes.StringType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   ignore("cast ByteType to BinaryType") {
     castTest(
       generateBytes(),
       DataTypes.BinaryType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   ignore("cast ByteType to TimestampType") {
     // input: -1, expected: 1969-12-31 15:59:59.0, actual: 1969-12-31 15:59:59.999999
     castTest(
       generateBytes(),
       DataTypes.TimestampType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   // CAST from ShortType
@@ -224,72 +225,72 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
     castTest(
       generateShorts(),
       DataTypes.BooleanType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ShortType to ByteType") {
     // https://github.com/apache/datafusion-comet/issues/311
     castTest(
       generateShorts(),
       DataTypes.ByteType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ShortType to IntegerType") {
     castTest(
       generateShorts(),
       DataTypes.IntegerType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ShortType to LongType") {
     castTest(
       generateShorts(),
       DataTypes.LongType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ShortType to FloatType") {
     castTest(
       generateShorts(),
       DataTypes.FloatType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ShortType to DoubleType") {
     castTest(
       generateShorts(),
       DataTypes.DoubleType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ShortType to DecimalType(10,2)") {
     castTest(
       generateShorts(),
       DataTypes.createDecimalType(10, 2),
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   test("cast ShortType to StringType") {
     castTest(
       generateShorts(),
       DataTypes.StringType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   ignore("cast ShortType to BinaryType") {
     castTest(
       generateShorts(),
       DataTypes.BinaryType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   ignore("cast ShortType to TimestampType") {
     // input: -1003, expected: 1969-12-31 15:43:17.0, actual: 1969-12-31 15:59:59.998997
     castTest(
       generateShorts(),
       DataTypes.TimestampType,
-      hasIncompatibleType = usingDataFusionParquetReader)
+      hasIncompatibleType = usingParquetExecWithIncompatTypes)
   }
 
   // CAST from integer
diff --git a/spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala b/spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala
@@ -141,7 +141,7 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper {
                 Byte.MaxValue)
               withParquetTable(path.toString, "tbl") {
                 val qry = "select _9 from tbl order by _11"
-                if (CometSparkSessionExtensions.isComplexTypeReaderEnabled(conf)) {
+                if (CometSparkSessionExtensions.usingDataFusionParquetExec(conf)) {
                   if (!allowIncompatible) {
                     checkSparkAnswer(qry)
                   } else {
diff --git a/spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala b/spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala
@@ -139,7 +139,7 @@ abstract class ParquetReadSuite extends CometTestBase {
             i.toDouble,
             DateTimeUtils.toJavaDate(i))
         }
-        if (!CometSparkSessionExtensions.isComplexTypeReaderEnabled(
+        if (!CometSparkSessionExtensions.usingDataFusionParquetExec(
             conf) || CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get()) {
           checkParquetScan(data)
         }
@@ -162,7 +162,7 @@ abstract class ParquetReadSuite extends CometTestBase {
             i.toDouble,
             DateTimeUtils.toJavaDate(i))
         }
-        if (!CometSparkSessionExtensions.isComplexTypeReaderEnabled(
+        if (!CometSparkSessionExtensions.usingDataFusionParquetExec(
             conf) || CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get()) {
           checkParquetScan(data)
         }
@@ -184,7 +184,7 @@ abstract class ParquetReadSuite extends CometTestBase {
         DateTimeUtils.toJavaDate(i))
     }
     val filter = (row: Row) => row.getBoolean(0)
-    if (!CometSparkSessionExtensions.isComplexTypeReaderEnabled(
+    if (!CometSparkSessionExtensions.usingDataFusionParquetExec(
         conf) || CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get()) {
       checkParquetScan(data, filter)
     }
diff --git a/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala b/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala

Original file line number	Diff line number	Diff line change
`@@ -139,7 +139,7 @@ abstract class ParquetReadSuite extends CometTestBase {`
`139`	`139`	`i.toDouble,`
`140`	`140`	`DateTimeUtils.toJavaDate(i))`
`141`	`141`	`}`
`142`		`- if (!CometSparkSessionExtensions.isComplexTypeReaderEnabled(`
	`142`	`+ if (!CometSparkSessionExtensions.usingDataFusionParquetExec(`
`143`	`143`	`conf) \|\| CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get()) {`
`144`	`144`	`checkParquetScan(data)`
`145`	`145`	`}`
`@@ -162,7 +162,7 @@ abstract class ParquetReadSuite extends CometTestBase {`
`162`	`162`	`i.toDouble,`
`163`	`163`	`DateTimeUtils.toJavaDate(i))`
`164`	`164`	`}`
`165`		`- if (!CometSparkSessionExtensions.isComplexTypeReaderEnabled(`
	`165`	`+ if (!CometSparkSessionExtensions.usingDataFusionParquetExec(`
`166`	`166`	`conf) \|\| CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get()) {`
`167`	`167`	`checkParquetScan(data)`
`168`	`168`	`}`
`@@ -184,7 +184,7 @@ abstract class ParquetReadSuite extends CometTestBase {`
`184`	`184`	`DateTimeUtils.toJavaDate(i))`
`185`	`185`	`}`
`186`	`186`	`val filter = (row: Row) => row.getBoolean(0)`
`187`		`- if (!CometSparkSessionExtensions.isComplexTypeReaderEnabled(`
	`187`	`+ if (!CometSparkSessionExtensions.usingDataFusionParquetExec(`
`188`	`188`	`conf) \|\| CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get()) {`
`189`	`189`	`checkParquetScan(data, filter)`
`190`	`190`	`}`