fix: regressions in `CometToPrettyStringSuite` #2384

hsiang-c · 2025-09-11T22:53:52Z

Which issue does this PR close?

Closes Fix regressions in CometToPrettyStringSuite #2307
Partially closes Implement Spark-compatible cast to/from binary type #377

Rationale for this change

Revive and fix CometToPrettyStringSuite
Before Spark 4.0.0, the default format of pretty printing binary is by the following method:

  /**
   * Returns a pretty string of the byte array which prints each byte as a hex digit and add spaces
   * between them. For example, [1A C0].
   */
  def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]")

With https://issues.apache.org/jira/browse/SPARK-47911, a universal BinaryFormatter with 5 BinaryOutputStyle: UTF8, BASIC, BASE64, HEX, HEX_DISCRETE is used to display binary data.
HEX_DISCRETE is the backward compatible style.
BinaryFormatter is configurable by SQLConf.BINARY_OUTPUT_STYLE:

    val style = SQLConf.get.getConf(SQLConf.BINARY_OUTPUT_STYLE)
    style.map(BinaryOutputStyle.withName) match {
      case Some(BinaryOutputStyle.UTF8) =>
        (array: Array[Byte]) => UTF8String.fromBytes(array)
      case Some(BinaryOutputStyle.BASIC) =>
        (array: Array[Byte]) => UTF8String.fromString(array.mkString("[", ", ", "]"))
      case Some(BinaryOutputStyle.BASE64) =>
        (array: Array[Byte]) =>
          UTF8String.fromString(java.util.Base64.getEncoder.withoutPadding().encodeToString(array))
      case Some(BinaryOutputStyle.HEX) =>
        (array: Array[Byte]) => Hex.hex(array)
      case _ =>
        (array: Array[Byte]) => UTF8String.fromString(SparkStringUtils.getHexString(array))
    }

p.s. the aforementioned getHexString method was defined in StringUtils (Spark 3.5) and will be merged to SparkStringUtils (Spark 4.1.0)

What changes are included in this PR?

Made Binary a Compatible type in CometCast.scala
Defined BinaryOutputStyle in expr.proto so that we can pass SQLConf.BINARY_OUTPUT_STYLE from QueryPlanSerde to planner.rs as part of spark_cast_options. For Spark 3.4 and 3.5, always use HEX_DISCRETE for backward compatibility.
Defined a corresponding BinaryOutputStyle enum in spark-expr/lib.rs and a mapping from Protocol Buffer enum to Rust enum in planner.rs so that spark-expr crate doesn't have to depend on proto crate.
In cast.rs, supported one additional case (Binary, Utf8) and mimic Spark 4.0's BinaryFormatter.
The other path to the binary_to_string function is CometCast and I let binary_output_style to be None to use CometCast-specific binary_to_string logic, which performs unsafe from_utf8_unchecked conversion when the input is an INVALID UTF8 string.

How are these changes tested?

In spark-4.0, tested all 5 BinaryOutputStyle and compared the result w/ and w/o Comet
Removed CometToPrettyStringSuite from dev/ci/check-suites.py
In Spark's catalyst test, ToPrettyStringSuite passed.

codecov-commenter · 2025-09-11T23:20:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.44%. Comparing base (f09f8af) to head (374d113).
⚠️ Report is 497 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2384      +/-   ##
============================================
+ Coverage     56.12%   57.44%   +1.31%     
- Complexity      976     1297     +321     
============================================
  Files           119      147      +28     
  Lines         11743    13419    +1676     
  Branches       2251     2349      +98     
============================================
+ Hits           6591     7708    +1117     
- Misses         4012     4450     +438     
- Partials       1140     1261     +121

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-09-16T22:19:47Z

native/spark-expr/Cargo.toml

 futures = { workspace = true }
 twox-hash = "2.1.2"
 rand = { workspace = true }
+datafusion-comet-proto = { workspace = true }


We currently publish the datafusion-comet-spark-expr crate to crates.io, so adding a dependency on datafusion-comet-proto means that we either need to stop publishing datafusion-comet-spark-expr (which was the eventual plan anyway, see #2405) or we need to start publishing datafusion-comet-proto as well.

Another option is to add a new BinaryOutputStyle enum in datafusion-comet-spark-expr and then map from proto to that enum.

@andygrove Thank you for your review, let me think about it.

In 308af77, I moved the BinaryOutputStyle proto -> enum mapping from spark-expr to core so that I don't need to keep proto dependency in spark-expr. The code is cleaner as well.

Thank you for your suggestions!

andygrove · 2025-09-16T22:21:55Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

+  test("cast BinaryType to StringType") {
    // https://github.com/apache/datafusion-comet/issues/377


can we remove the link to the issue now that the test is enabled?

andygrove · 2025-09-16T22:23:14Z

spark/src/test/spark-4.0/org/apache/spark/sql/CometToPrettyStringSuite.scala

+import java.text.SimpleDateFormat
+import scala.util.Random
+
+class CometToPrettyStringSuite extends CometTestBase {


This test could now extend CometFuzzTestBase and then it would not need to implement beforeAll to generate the input data.

andygrove

This is great. Thanks @hsiang-c!

* Introduce BinaryOutputStyle from Spark 4.0 * Allow casting from binary to string * Pass binaryOutputStyle to query plan serde * Take binaryOutputStyle in planner * Implement Spark-style ToPrettyString * Match file name w/ test name * Test all 5 BinaryOutputStyle in Spark 4.0 * Fix package: 'org.apache.sql' -> 'org.apache.spark.sql' * Add CometToPrettyStringSuite back to CI * Specify binaryOutputStyle for Spark 3.4 * Let Comet deal with non pretty string casting * Enable binary to string casting test * Attempt to fix the build; ToPrettyString is Spark 3.5+ * Removed resolved issues * Type casting only function * Extract test setup logic to CometFuzzTestBase * Move binary_output_style proto <-> enum mapping to core * Move BinaryOutputStyle from cast.rs to lib.rs * Remove incorrect comments

hsiang-c added 9 commits September 11, 2025 12:51

Introduce BinaryOutputStyle from Spark 4.0

0ea8d12

Allow casting from binary to string

5366d66

Pass binaryOutputStyle to query plan serde

89f70bb

Take binaryOutputStyle in planner

018e6b7

Implement Spark-style ToPrettyString

ec8e333

Match file name w/ test name

e0edca2

Test all 5 BinaryOutputStyle in Spark 4.0

18ae1ad

Fix package: 'org.apache.sql' -> 'org.apache.spark.sql'

cac6e7e

Add CometToPrettyStringSuite back to CI

d1ff945

hsiang-c added 5 commits September 12, 2025 12:22

Specify binaryOutputStyle for Spark 3.4

d6edd8e

Let Comet deal with non pretty string casting

39031fb

Enable binary to string casting test

374d113

Merge branch 'main' into pretty_string

0e7e4ec

Attempt to fix the build; ToPrettyString is Spark 3.5+

47deeb0

andygrove reviewed Sep 16, 2025

View reviewed changes

hsiang-c added 7 commits September 17, 2025 13:07

Removed resolved issues

199133d

Type casting only function

7663644

Extract test setup logic to CometFuzzTestBase

ae1ce38

Merge branch 'main' into pretty_string

4697399

Move binary_output_style proto <-> enum mapping to core

308af77

Move BinaryOutputStyle from cast.rs to lib.rs

ce0d8f0

Remove incorrect comments

2c7d465

andygrove approved these changes Sep 17, 2025

View reviewed changes

Merge branch 'main' into pretty_string

66b331a

mbutrovich merged commit 34daa54 into apache:main Sep 20, 2025
60 checks passed

hsiang-c deleted the pretty_string branch September 21, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: regressions in `CometToPrettyStringSuite` #2384

fix: regressions in `CometToPrettyStringSuite` #2384

Uh oh!

hsiang-c commented Sep 11, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 11, 2025 •

edited

Loading

Uh oh!

andygrove Sep 16, 2025

Uh oh!

andygrove Sep 16, 2025

Uh oh!

hsiang-c Sep 16, 2025

Uh oh!

hsiang-c Sep 17, 2025

Uh oh!

andygrove Sep 16, 2025

Uh oh!

andygrove Sep 16, 2025

Uh oh!

hsiang-c Sep 16, 2025 •

edited

Loading

Uh oh!

andygrove left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		test("cast BinaryType to StringType") {
		// https://github.com/apache/datafusion-comet/issues/377

fix: regressions in CometToPrettyStringSuite #2384

fix: regressions in CometToPrettyStringSuite #2384

Uh oh!

Conversation

hsiang-c commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

hsiang-c Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

hsiang-c Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

hsiang-c Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: regressions in `CometToPrettyStringSuite` #2384

fix: regressions in `CometToPrettyStringSuite` #2384

hsiang-c commented Sep 11, 2025 •

edited

Loading

codecov-commenter commented Sep 11, 2025 •

edited

Loading

hsiang-c Sep 16, 2025 •

edited

Loading