-
Notifications
You must be signed in to change notification settings - Fork 267
fix: regressions in CometToPrettyStringSuite
#2384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2384 +/- ##
============================================
+ Coverage 56.12% 57.44% +1.31%
- Complexity 976 1297 +321
============================================
Files 119 147 +28
Lines 11743 13419 +1676
Branches 2251 2349 +98
============================================
+ Hits 6591 7708 +1117
- Misses 4012 4450 +438
- Partials 1140 1261 +121 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
native/spark-expr/Cargo.toml
Outdated
| futures = { workspace = true } | ||
| twox-hash = "2.1.2" | ||
| rand = { workspace = true } | ||
| datafusion-comet-proto = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently publish the datafusion-comet-spark-expr crate to crates.io, so adding a dependency on datafusion-comet-proto means that we either need to stop publishing datafusion-comet-spark-expr (which was the eventual plan anyway, see #2405) or we need to start publishing datafusion-comet-proto as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to add a new BinaryOutputStyle enum in datafusion-comet-spark-expr and then map from proto to that enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove Thank you for your review, let me think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 308af77, I moved the BinaryOutputStyle proto -> enum mapping from spark-expr to core so that I don't need to keep proto dependency in spark-expr. The code is cleaner as well.
Thank you for your suggestions!
| test("cast BinaryType to StringType") { | ||
| // https://github.com/apache/datafusion-comet/issues/377 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove the link to the issue now that the test is enabled?
| import java.text.SimpleDateFormat | ||
| import scala.util.Random | ||
|
|
||
| class CometToPrettyStringSuite extends CometTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test could now extend CometFuzzTestBase and then it would not need to implement beforeAll to generate the input data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Thanks @hsiang-c!
* Introduce BinaryOutputStyle from Spark 4.0 * Allow casting from binary to string * Pass binaryOutputStyle to query plan serde * Take binaryOutputStyle in planner * Implement Spark-style ToPrettyString * Match file name w/ test name * Test all 5 BinaryOutputStyle in Spark 4.0 * Fix package: 'org.apache.sql' -> 'org.apache.spark.sql' * Add CometToPrettyStringSuite back to CI * Specify binaryOutputStyle for Spark 3.4 * Let Comet deal with non pretty string casting * Enable binary to string casting test * Attempt to fix the build; ToPrettyString is Spark 3.5+ * Removed resolved issues * Type casting only function * Extract test setup logic to CometFuzzTestBase * Move binary_output_style proto <-> enum mapping to core * Move BinaryOutputStyle from cast.rs to lib.rs * Remove incorrect comments
Which issue does this PR close?
CometToPrettyStringSuite#2307Rationale for this change
CometToPrettyStringSuite4.0.0, the default format of pretty printingbinaryis by the following method:BinaryFormatterwith 5BinaryOutputStyle:UTF8,BASIC,BASE64,HEX,HEX_DISCRETEis used to display binary data.HEX_DISCRETEis the backward compatible style.BinaryFormatteris configurable bySQLConf.BINARY_OUTPUT_STYLE:getHexStringmethod was defined in StringUtils (Spark 3.5) and will be merged to SparkStringUtils (Spark 4.1.0)What changes are included in this PR?
BinaryaCompatibletype in CometCast.scalaBinaryOutputStyleinexpr.protoso that we can passSQLConf.BINARY_OUTPUT_STYLEfromQueryPlanSerdetoplanner.rsas part ofspark_cast_options. For Spark3.4and3.5, always useHEX_DISCRETEfor backward compatibility.BinaryOutputStyleenum inspark-expr/lib.rsand a mapping from Protocol Buffer enum to Rust enum inplanner.rsso thatspark-exprcrate doesn't have to depend onprotocrate.cast.rs, supported one additional case(Binary, Utf8)and mimic Spark 4.0'sBinaryFormatter.binary_to_string functionisCometCastand I letbinary_output_styleto beNoneto useCometCast-specificbinary_to_stringlogic, which performs unsafefrom_utf8_uncheckedconversion when the input is an INVALID UTF8 string.How are these changes tested?
spark-4.0, tested all 5BinaryOutputStyleand compared the result w/ and w/o CometCometToPrettyStringSuitefromdev/ci/check-suites.pyToPrettyStringSuitepassed.