Skip to content

Conversation

@dheerajsngh
Copy link
Collaborator

@dheerajsngh dheerajsngh commented Dec 22, 2025

PR enables server-side data integrity validation for GCS uploads. It calculates the CRC32C checksum on the client side and sends it as a trailing header (x-goog-hash) with the final upload request, allowing GCS to validate the data before finalizing the object.

Key Changes

  • New Interceptor: Added ChecksumHeaderInterceptor to detect the final upload chunk and inject the x-goog-hash header.
  • Configuration: Added fs.gs.write.trailing.checksum.enable (default: false) to control the feature.

Verification

  • Unit Tests and integration tests

More Details in the Doc: go/enable-crc32c-in-gcs-connector

@dheerajsngh
Copy link
Collaborator Author

/gcbrun

@gemini-code-assist
Copy link

Summary of Changes

Hello @dheerajsngh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Google Cloud Storage connector by implementing a trailing checksum mechanism for data uploads. This feature allows for server-side validation of data integrity by including a CRC32C checksum in the x-goog-hash header of the final upload request. A new configuration property fs.gs.write.trailing.checksum.enable provides control over this functionality, ensuring that data written to GCS maintains its integrity throughout the upload process.

Highlights

  • New Configuration for Trailing Checksum: Introduced a new configuration property, fs.gs.write.trailing.checksum.enable, allowing users to enable or disable the trailing checksum feature for Google Cloud Storage uploads. This property is integrated into the AsyncWriteChannelOptions.
  • Server-Side Data Integrity Validation: The connector now sends a CRC32C checksum in the x-goog-hash header with the final PUT request for GCS uploads. This enables Google Cloud Storage to perform server-side data integrity validation, ensuring data consistency.
  • Checksum Calculation Integration: The checksum calculation logic within AbstractGoogleAsyncWriteChannel has been updated to compute the CRC32C for the entire object when the trailing checksum feature is enabled, ensuring the full data stream is covered.
  • HTTP Request Interception: A new ChecksumHeaderInterceptor has been added to the HTTP request pipeline. This interceptor dynamically injects the calculated CRC32C checksum into the x-goog-hash header of the final PUT request, facilitating the server-side validation.
  • Comprehensive Testing: New integration tests have been added to verify the end-to-end functionality of writing data with and without the trailing checksums. Additionally, new unit tests cover the behavior of the ChecksumHeaderInterceptor and error handling for checksum mismatches.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for ensuring data integrity by implementing trailing checksum validation for uploads. The changes are well-structured, introducing a new configuration option, updating the write channel logic to compute checksums, and using an HTTP interceptor to add the checksum header to the final upload request. The use of a ThreadLocal to pass the hasher context is a clean solution, and the inclusion of comprehensive unit and integration tests is commendable. I've found one minor issue in a test file that I've commented on.

config.set("fs.gs.write.parallel.composite.upload.part.file.cleanup.type", "NEVER");
config.set("fs.gs.write.parallel.composite.upload.part.file.name.prefix", "baz");
config.setBoolean("fs.gs.write.rolling.checksum.enable", true);
config.setBoolean("fs.gs.write.trailing.checksum", false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the configuration key. It should be fs.gs.write.trailing.checksum.enable instead of fs.gs.write.trailing.checksum. The test currently passes because the default value is false, which matches the assertion. However, it's not correctly testing the configuration property. Using the constant GCS_WRITE_TRAILING_CHECKSUM_ENABLE.getKey() would be a safer way to avoid such typos.

Suggested change
config.setBoolean("fs.gs.write.trailing.checksum", false);
config.setBoolean("fs.gs.write.trailing.checksum.enable", false);

@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 84.84848% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.91%. Comparing base (24b6408) to head (f53e7fb).

Files with missing lines Patch % Lines
...d/hadoop/util/AbstractGoogleAsyncWriteChannel.java 33.33% 0 Missing and 2 partials ⚠️
...e/cloud/hadoop/util/ChecksumHeaderInterceptor.java 88.88% 0 Missing and 2 partials ⚠️
.../com/google/cloud/hadoop/util/ChecksumContext.java 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1611      +/-   ##
============================================
+ Coverage     81.90%   81.91%   +0.01%     
- Complexity     2425     2432       +7     
============================================
  Files           128      130       +2     
  Lines         10819    10849      +30     
  Branches       1302     1307       +5     
============================================
+ Hits           8861     8887      +26     
- Misses         1415     1416       +1     
- Partials        543      546       +3     
Flag Coverage Δ
integrationtest 67.06% <84.84%> (+0.10%) ⬆️
unittest 72.43% <84.84%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dheerajsngh dheerajsngh changed the title Implement trailing checksum in connector feat: Implement trailing checksum in connector Dec 22, 2025
@dheerajsngh
Copy link
Collaborator Author

/gcbrun

@dheerajsngh
Copy link
Collaborator Author

/gcbrun

@dheerajsngh
Copy link
Collaborator Author

/gcbrun

@dheerajsngh
Copy link
Collaborator Author

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant