Skip to content

Conversation

@hsiang-c
Copy link
Contributor

@hsiang-c hsiang-c commented Sep 11, 2025

Which issue does this PR close?

Rationale for this change

  • Revive and fix CometToPrettyStringSuite
  • Before Spark 4.0.0, the default format of pretty printing binary is by the following method:
  /**
   * Returns a pretty string of the byte array which prints each byte as a hex digit and add spaces
   * between them. For example, [1A C0].
   */
  def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]")
  • With https://issues.apache.org/jira/browse/SPARK-47911, a universal BinaryFormatter with 5 BinaryOutputStyle: UTF8, BASIC, BASE64, HEX, HEX_DISCRETE is used to display binary data.
  • HEX_DISCRETE is the backward compatible style.
  • BinaryFormatter is configurable by SQLConf.BINARY_OUTPUT_STYLE:
    val style = SQLConf.get.getConf(SQLConf.BINARY_OUTPUT_STYLE)
    style.map(BinaryOutputStyle.withName) match {
      case Some(BinaryOutputStyle.UTF8) =>
        (array: Array[Byte]) => UTF8String.fromBytes(array)
      case Some(BinaryOutputStyle.BASIC) =>
        (array: Array[Byte]) => UTF8String.fromString(array.mkString("[", ", ", "]"))
      case Some(BinaryOutputStyle.BASE64) =>
        (array: Array[Byte]) =>
          UTF8String.fromString(java.util.Base64.getEncoder.withoutPadding().encodeToString(array))
      case Some(BinaryOutputStyle.HEX) =>
        (array: Array[Byte]) => Hex.hex(array)
      case _ =>
        (array: Array[Byte]) => UTF8String.fromString(SparkStringUtils.getHexString(array))
    }

What changes are included in this PR?

  • Made Binary a Compatible type in CometCast.scala
  • Defined BinaryOutputStyle in expr.proto so that we can pass SQLConf.BINARY_OUTPUT_STYLE from QueryPlanSerde to planner.rs as part of spark_cast_options. For Spark 3.4 and 3.5, always use HEX_DISCRETE for backward compatibility.
  • Defined a corresponding BinaryOutputStyle enum in spark-expr/lib.rs and a mapping from Protocol Buffer enum to Rust enum in planner.rs so that spark-expr crate doesn't have to depend on proto crate.
  • In cast.rs, supported one additional case (Binary, Utf8) and mimic Spark 4.0's BinaryFormatter.
  • The other path to the binary_to_string function is CometCast and I let binary_output_style to be None to use CometCast-specific binary_to_string logic, which performs unsafe from_utf8_unchecked conversion when the input is an INVALID UTF8 string.

How are these changes tested?

  • In spark-4.0, tested all 5 BinaryOutputStyle and compared the result w/ and w/o Comet
  • Removed CometToPrettyStringSuite from dev/ci/check-suites.py
  • In Spark's catalyst test, ToPrettyStringSuite passed.

@codecov-commenter
Copy link

codecov-commenter commented Sep 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.44%. Comparing base (f09f8af) to head (374d113).
⚠️ Report is 497 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2384      +/-   ##
============================================
+ Coverage     56.12%   57.44%   +1.31%     
- Complexity      976     1297     +321     
============================================
  Files           119      147      +28     
  Lines         11743    13419    +1676     
  Branches       2251     2349      +98     
============================================
+ Hits           6591     7708    +1117     
- Misses         4012     4450     +438     
- Partials       1140     1261     +121     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

futures = { workspace = true }
twox-hash = "2.1.2"
rand = { workspace = true }
datafusion-comet-proto = { workspace = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently publish the datafusion-comet-spark-expr crate to crates.io, so adding a dependency on datafusion-comet-proto means that we either need to stop publishing datafusion-comet-spark-expr (which was the eventual plan anyway, see #2405) or we need to start publishing datafusion-comet-proto as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to add a new BinaryOutputStyle enum in datafusion-comet-spark-expr and then map from proto to that enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove Thank you for your review, let me think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 308af77, I moved the BinaryOutputStyle proto -> enum mapping from spark-expr to core so that I don't need to keep proto dependency in spark-expr. The code is cleaner as well.

Thank you for your suggestions!

Comment on lines 830 to 831
test("cast BinaryType to StringType") {
// https://github.com/apache/datafusion-comet/issues/377
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the link to the issue now that the test is enabled?

import java.text.SimpleDateFormat
import scala.util.Random

class CometToPrettyStringSuite extends CometTestBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could now extend CometFuzzTestBase and then it would not need to implement beforeAll to generate the input data.

Copy link
Contributor Author

@hsiang-c hsiang-c Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Thanks @hsiang-c!

@mbutrovich mbutrovich merged commit 34daa54 into apache:main Sep 20, 2025
60 checks passed
@hsiang-c hsiang-c deleted the pretty_string branch September 21, 2025 09:38
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
* Introduce BinaryOutputStyle from Spark 4.0

* Allow casting from binary to string

* Pass binaryOutputStyle to query plan serde

* Take binaryOutputStyle in planner

* Implement Spark-style ToPrettyString

* Match file name w/ test name

* Test all 5 BinaryOutputStyle in Spark 4.0

* Fix package: 'org.apache.sql' -> 'org.apache.spark.sql'

* Add CometToPrettyStringSuite back to CI

* Specify binaryOutputStyle for Spark 3.4

* Let Comet deal with non pretty string casting

* Enable binary to string casting test

* Attempt to fix the build; ToPrettyString is Spark 3.5+

* Removed resolved issues

* Type casting only function

* Extract test setup logic to CometFuzzTestBase

* Move binary_output_style proto <-> enum mapping to core

* Move BinaryOutputStyle from cast.rs to lib.rs

* Remove incorrect comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix regressions in CometToPrettyStringSuite Implement Spark-compatible cast to/from binary type

4 participants