Skip to content

Conversation

@thiyaguk09
Copy link
Owner

@thiyaguk09 thiyaguk09 commented Dec 4, 2025

Description

This pull request significantly enhances data integrity for object uploads by introducing comprehensive client-side checksum validation. It ensures that data uploaded through the JSON API path remains consistent by verifying both client-provided and client-calculated CRC32C and MD5 hashes against the server's reported hashes. This mechanism proactively detects and prevents silent data corruption, providing a more reliable upload experience.

Impact

The primary impact is a substantial improvement in data integrity and reliability for object uploads, particularly those utilizing the JSON API path (e.g., small single-chunk uploads).

  • Prevents Silent Data Corruption: By validating client-side hashes (calculated or user-provided) against the server's reported hashes, the SDK proactively detects and stops data corruption that might otherwise go unnoticed.
  • Enhanced Reliability: The stream is immediately destroyed upon hash mismatch, preventing the client from assuming a corrupted file upload was successful.
  • Feature Parity: Enables the use of the X-Goog-Hash header, which is essential for ensuring the integrity of the data payload sent to the server.
  • API Usage: Introduces new configuration options (clientCrc32c, clientMd5Hash, crc32c, md5) that allow users fine-grained control over checksum validation behavior.

Testing

  • Unit and Integration Tests Added? Yes. Extensive test coverage was added and expanded in test/resumable-upload.ts. This includes:

    • Verification of X-Goog-Hash header injection with calculated and client-provided hashes in single and multi-chunk uploads.
    • Validation tests to confirm that the stream is correctly destroyed on both CRC32C and MD5 checksum mismatches between the client and server.
    • A dedicated describe block for Validation of Client Checksums Against Server Response with various success and failure scenarios.
  • Were any tests changed? Yes. Tests were expanded and refactored (e.g., checksum application logic tests).

  • Are any breaking changes necessary? No. This change introduces new features and configuration options but does not appear to break existing functionality. The changes are largely additive and internal refactoring of hash handling and validation logic.

Additional Information

  • Refactoring for Robustness: The HashStreamValidator was improved by adding the md5Digest getter, which centralizes the MD5 calculation and caching logic. The related fix in _flush prevents a potential race condition where calling digest() multiple times would cause a runtime error.
  • Internal Consistency: Private helper methods (#validateChecksum, #applyChecksumHeaders) were introduced in resumable-upload.ts to simplify and clarify the hash handling logic, especially ensuring checksum headers are applied correctly across different upload paths.
  • Dependency on Configuration: Proper client-side validation requires users to explicitly enable calculation (crc32c: true, md5: true) or provide the pre-calculated hashes.

Checklist

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease
  • Appropriate docs were updated
  • Appropriate comments were added, particularly in complex areas or places that require background
  • No new warnings or issues will be generated from this change

Fixes #

Adds validation for client-provided (pre-calculated) and
client-calculated CRC32C and MD5 hashes when the final upload request is
made via the JSON API path (status 200).

This ensures consistency checks are performed even when the `Upload`
stream is finalized, preventing silent data corruption if the
server-reported hash (in the response body) mismatches the client's
expected hash.
@gemini-code-assist
Copy link

Summary of Changes

Hello @thiyaguk09, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the reliability of object uploads by integrating comprehensive client-side checksum validation. It ensures that data uploaded through the JSON API path maintains integrity by verifying CRC32C and MD5 hashes, either provided by the client or calculated during the upload process, against the hashes reported by the server. This proactive approach aims to eliminate silent data corruption and provide immediate feedback on data consistency.

Highlights

  • Enhanced Data Integrity: Introduces robust client-side validation for CRC32C and MD5 hashes during object uploads via the JSON API path, preventing silent data corruption.
  • Configurable Checksum Options: Allows users to provide pre-calculated CRC32C and MD5 hashes or enable automatic client-side calculation for these checksums.
  • Checksum Mismatch Detection: Implements a mechanism to compare client-side hashes with server-reported hashes upon upload completion, destroying the upload stream if a mismatch is detected.
  • X-Goog-Hash Header Support: Adds the X-Goog-Hash header to the final upload request, including the calculated or provided CRC32C and MD5 hashes for server-side verification.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds checksum validation for uploads using the JSON API path. The changes introduce logic to calculate CRC32C and MD5 hashes on the client-side and validate them against the server's response. My review has identified one critical issue and one area for improvement. There is a critical bug in hash-stream-validator.ts where hash.digest() could be called multiple times on the same hash object, which will cause a runtime error. Additionally, there is some code duplication in resumable-upload.ts for handling checksum headers in single-chunk and multi-chunk uploads, which could be refactored for better maintainability. Overall, the changes are well-structured and address the feature requirement, but the critical issue should be fixed before merging.

@thiyaguk09
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds checksum validation for client-side calculated CRC32C and MD5 hashes on the final upload request. The changes are logical and well-implemented, introducing new configuration options and updating the upload process to include the X-Goog-Hash header.

I've identified a critical issue in src/hash-stream-validator.ts that could lead to a runtime error due to a race condition when calculating the MD5 digest. I've also pointed out an opportunity for refactoring in src/resumable-upload.ts to reduce code duplication and improve maintainability.

Once these points are addressed, the PR should be in good shape.

@thiyaguk09
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces checksum validation for resumable uploads, a valuable addition for data integrity. The implementation is well-structured, and the new tests are thorough. However, I've identified a critical logic issue in how single-chunk uploads are handled when the content length is unknown, which could result in incorrect hash headers being sent. My review includes a suggested fix for this issue. With that addressed, the changes look solid.

Adds the 'Validation of Client Checksums Against Server Response' test
suite. Fixes test failures in client-provided hash scenarios by updating
mock responses to ensure server-reported checksums match the client's
expected values.
@thiyaguk09
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for ensuring data integrity by adding checksum validation for uploads. The changes allow for both client-calculated and pre-calculated CRC32C and MD5 hashes to be sent and validated against the server's response. The implementation is mostly solid, with good test coverage for the new functionality.

I've identified one critical bug in the logic for constructing the X-Goog-Hash header, which could lead to pre-calculated hashes being silently ignored in certain scenarios. I've provided a suggested fix for this. I also made a minor suggestion to refactor a small part of the code for improved clarity.

Once the critical issue is addressed, this will be a great addition to the library.

Refactors four duplicate test cases (CRC32C/MD5 success and failure)
into a single, parameterized test block within the 'Validation of Client
Checksums Against Server Response' suite.

This improves test clarity and reduces code duplication by dynamically
generating test scenarios for post-upload hash validation.
@thiyaguk09
Copy link
Owner Author

/gemini review

@thiyaguk09 thiyaguk09 marked this pull request as ready for review December 4, 2025 13:26
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for ensuring data integrity during uploads by adding client-side checksum validation for CRC32C and MD5. The implementation is well-structured, with clear separation of concerns for hash calculation and application. The test coverage is extensive and covers many scenarios, including single-chunk, multi-chunk, and pre-calculated hashes.

I've found one critical issue related to streaming uploads that could cause crashes, a high-severity issue where validation may not run for streams, and one medium-severity suggestion to improve code maintainability by reducing duplication. Please see the detailed comments.

@thiyaguk09
Copy link
Owner Author

/gemini summary

@gemini-code-assist
Copy link

Summary of Changes

This pull request significantly enhances data integrity for object uploads by introducing comprehensive client-side checksum validation. It ensures that data uploaded through the JSON API path remains consistent by verifying both client-provided and client-calculated CRC32C and MD5 hashes against the server's reported hashes. This mechanism proactively detects and prevents silent data corruption, providing a more reliable upload experience.

Highlights

  • Checksum Validation: Implemented client-side validation for CRC32C and MD5 hashes during object uploads via the JSON API path.
  • Configurable Hashes: Allows users to provide pre-calculated clientCrc32c and clientMd5Hash or enable automatic client-side calculation (crc32c, md5).
  • Data Integrity: Prevents silent data corruption by comparing client-side hashes with server-reported hashes upon successful upload, destroying the stream on mismatch.
  • X-Goog-Hash Header: Automatically adds the X-Goog-Hash header to the final upload request, containing calculated or provided CRC32C and MD5 hashes.
  • Refactored Hash Handling: Improved the HashStreamValidator with a new md5Digest getter and refactored checksum application logic in resumable-upload.ts for clarity and correctness.
Changelog
  • src/hash-stream-validator.ts
    • Added a new md5Digest getter to provide the calculated MD5 hash.
    • Modified the _flush method to prevent re-calculating the MD5 digest if it has already been accessed via the getter, resolving a potential race condition.
  • src/resumable-upload.ts
    • Introduced new configuration options (clientCrc32c, clientMd5Hash, crc32c, md5) to control checksum behavior.
    • Integrated HashStreamValidator to calculate hashes during the upload stream.
    • Added a private helper method #validateChecksum to compare client and server hashes and handle mismatches.
    • Implemented a private helper method #applyChecksumHeaders to construct and attach the X-Goog-Hash header to the final upload request.
    • Ensured HashStreamValidator is finalized and checksum headers are applied correctly for both single and multi-chunk uploads.
    • Enhanced the responseHandler to perform checksum validation against server responses for successful uploads.
  • test/resumable-upload.ts
    • Expanded test coverage to include scenarios for X-Goog-Hash header injection, verifying correct application of CRC32C and MD5 hashes (both calculated and client-provided) in single and multi-chunk uploads.
    • Added tests to confirm that checksum mismatches between client and server correctly destroy the upload stream.
    • Added a new describe block for Validation of Client Checksums Against Server Response with various scenarios.
Activity
  • gemini-code-assist[bot] provided an initial summary of the PR.
  • thiyaguk09 repeatedly requested reviews and summaries from gemini-code-assist[bot].
  • gemini-code-assist[bot] identified several critical and medium priority issues and provided specific code suggestions, many of which were subsequently addressed in the pull request:
  • A race condition in HashStreamValidator's md5Digest getter and _flush method was fixed by adding a check to prevent re-calculation.
  • Duplicated logic for applying X-Goog-Hash headers in resumable-upload.ts was refactored into a new private helper method, #applyChecksumHeaders.
  • The condition for determining final single-chunk uploads was corrected to typeof this.contentLength === 'number'.
  • The logic within #applyChecksumHeaders was improved to correctly handle both calculated and client-provided hashes.
  • The isSingleFinalUpload blocks were combined for better readability.
  • HashStreamValidator is now finalized before validation in responseHandler to ensure MD5 hashes are calculated.
  • A suggestion to simplify MD5 digest calculation in _flush by using the getter was noted, though the current implementation retains an explicit check.
  • A critical concern regarding a potential race condition when calling this.#hashValidator.end() for streaming single-chunk uploads (comment 2589079996) does not appear to be explicitly addressed in the provided diffs.

@thiyaguk09
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces client-side checksum validation for uploads, a great feature for ensuring data integrity. The implementation is comprehensive, adding new configuration options, hash calculation, and validation against server responses. The test coverage is extensive, with both unit and integration tests for various success and failure scenarios.

However, I've identified a critical issue in the handling of single-chunk uploads where the content is streamed in multiple parts. The hash is finalized prematurely, leading to incorrect checksums. I've provided a suggestion to fix this by buffering the content before hash calculation, aligning it with the logic for multi-chunk uploads.

@thiyaguk09
Copy link
Owner Author

/gemini summary

@gemini-code-assist
Copy link

Summary of Changes

This pull request significantly enhances the data integrity of object uploads by implementing robust client-side checksum validation for both CRC32C and MD5 hashes. It introduces new configuration options for users to either provide pre-calculated hashes or enable automatic calculation, ensuring that data remains consistent throughout the upload process. Any detected mismatch between client and server hashes will proactively terminate the upload, preventing silent data corruption and improving overall reliability.

Highlights

  • Enhanced Data Integrity: Introduces comprehensive client-side checksum validation for both CRC32C and MD5 hashes during object uploads via the JSON API path, proactively preventing silent data corruption.
  • Configurable Checksum Options: New UploadConfig options (clientCrc32c, clientMd5Hash, crc32c, md5) allow users to provide pre-calculated hashes or enable automatic client-side calculation.
  • Checksum Mismatch Detection: The upload stream is immediately destroyed upon detection of a CRC32C or MD5 mismatch between client-calculated/provided hashes and server-reported hashes.
  • X-Goog-Hash Header Support: The X-Goog-Hash header is now included in the final upload request, containing the calculated or provided CRC32C and MD5 hashes for server-side verification.
  • Refactored Hash Handling: The HashStreamValidator was improved with an md5Digest getter, and checksum application logic in resumable-upload.ts was centralized into private helper methods for clarity and robustness.
  • Comprehensive Testing: Extensive unit and integration tests have been added to cover X-Goog-Hash header injection, checksum calculation, and mismatch detection for both single and multi-chunk uploads.
Changelog
  • src/hash-stream-validator.ts
    • Added a new md5Digest getter to provide the calculated MD5 hash.
    • Modified the _flush method to prevent re-calculating the MD5 digest if it has already been accessed via the getter, resolving a potential race condition.
  • src/resumable-upload.ts
    • Imported HashStreamValidator and FileExceptionMessages.
    • Added new UploadConfig properties: clientCrc32c, clientMd5Hash, crc32c, md5 for checksum control.
    • Introduced private class properties #hashValidator, #clientCrc32c, #clientMd5Hash.
    • Initialized #hashValidator in the constructor based on configuration.
    • Modified _write to pipe chunks through #hashValidator if enabled, destroying the stream on error.
    • Added private helper method #validateChecksum to compare client and server hashes and destroy the stream on mismatch.
    • Added private helper method #applyChecksumHeaders to construct and apply the X-Goog-Hash header to request options, prioritizing calculated hashes over client-provided ones.
    • Ensured HashStreamValidator.end() is called for the final chunk in multi-chunk uploads.
    • Applied X-Goog-Hash header only on the final chunk for multi-chunk uploads and for single-chunk uploads with known contentLength.
    • Updated responseHandler to finalize HashStreamValidator and perform checksum validation against server responses for successful uploads.
    • Added an uploadFinished emit for cases where shouldContinueUploadInAnotherRequest is true but the response is not successful (e.g., 308).
  • system-test/kitchen.ts
    • Added a new describe block for "Validation of Client Checksums Against Server Response (Integration)".
    • Included tests for successful upload with CRC32C calculation enabled.
    • Added tests for stream destruction on client-provided CRC32C checksum mismatch.
  • test/resumable-upload.ts
    • Imported UploadConfig and FileExceptionMessages.
    • Defined constants for correct/incorrect CRC32C and MD5 hashes for testing.
    • Added clientCrc32c and clientMd5Hash to default Upload configuration.
    • Added a new describe block for "X-Goog-Hash header injection" with helper functions and tests for single and multi-chunk uploads, covering calculated and client-provided hashes.
    • Added tests within #responseHandler describe block for destroying stream on CRC32C and MD5 checksum mismatches.
    • Added a new describe block for "Validation of Client Checksums Against Server Response" with various match/mismatch scenarios for CRC32C and MD5.
Activity
  • gemini-code-assist[bot] provided an initial summary of the PR.
  • thiyaguk09 repeatedly requested reviews and summaries from gemini-code-assist[bot].
  • Critical issues identified by the bot were addressed, including a race condition in HashStreamValidator's md5Digest getter and _flush method, refactoring duplicated X-Goog-Hash header logic, correcting the condition for determining final single-chunk uploads, improving logic within #applyChecksumHeaders for mixed hash types, combining isSingleFinalUpload blocks, and ensuring HashStreamValidator is finalized before validation in responseHandler.
  • A suggestion to simplify MD5 digest calculation in _flush by using the getter was noted, but the current implementation retains an explicit check.
  • A critical concern regarding a potential race condition when calling this.#hashValidator.end() for streaming single-chunk uploads (comment 2589079996) was raised by the bot, and thiyaguk09 responded that "Validation may be undefined if we apply this." This specific issue does not appear to be fully resolved in the provided diffs.
  • Another critical issue regarding buffering the entire content for single-chunk uploads with known contentLength before finalizing the hash was raised by the bot (comment 2597880034), but no direct resolution is visible in the provided diffs.

@thiyaguk09
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces client-side checksum validation for resumable uploads, supporting both CRC32C and MD5 hashes. It adds new UploadConfig options (clientCrc32c, clientMd5Hash, crc32c, md5) to allow users to provide pre-calculated hashes or enable client-side calculation. A new HashStreamValidator class is integrated into the Upload stream to compute hashes as data is written. The X-Goog-Hash header is now conditionally applied to upload requests, containing either the calculated or provided hashes, specifically on the final chunk for multi-chunk uploads or on the single request for non-resumable uploads. After a successful upload response, the client-side calculated or provided hashes are compared against the server-returned hashes, and the upload stream is destroyed with an UPLOAD_MISMATCH error if a discrepancy is found. New unit and system tests have been added to cover various scenarios, including successful hash validation, checksum mismatches, and correct X-Goog-Hash header injection for both single and multi-chunk uploads.

This commit introduces several stability fixes for the ResumableUpload
class:

1.  **Fixes Timeouts in Unit Tests:** Updates `makeRequestStream` mocks
to fully drain the request body stream, resolving stream consumption
deadlocks and timeouts in `#startUploading` unit tests.
2.  **Fixes Multi-Part Hang:** Correctly finalizes the `pipeline` for
partial chunks (`isPartialUpload=true`) by calling `pipelineCallback()`
immediately after successful chunk upload, preventing indefinite hangs
in multi-session tests.
3.  **Fixes Single-Chunk Checksum Missing Header:** Applies the
`X-Goog-Hash` header unconditionally in single-chunk mode if a validator
is present, ensuring checksum validation is active even when
`contentLength` is unknown.
Simplify `HashStreamValidator._flush` by utilizing `md5Digest` getter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants