Skip to content

Conversation

@BiteTheDDDDt
Copy link
Contributor

@BiteTheDDDDt BiteTheDDDDt commented Dec 15, 2025

What problem does this PR solve?

add Crc32CHashPartitioner
图片
This pull request refactors the codebase to standardize the usage of the CRC32C checksum library by replacing the custom util/crc32c.h header and its functions with the upstream crc32c library (<crc32c/crc32c.h>) and its API. It also updates function calls to use the correct data types expected by the new library and ensures consistent checksum calculation across multiple modules related to file I/O, compression, and storage.

Migration to Upstream CRC32C Library

  • Replaced all includes of "util/crc32c.h" with <crc32c/crc32c.h> and removed the custom header from all relevant files. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
  • Updated all function calls from crc32c::Value(...) to crc32c::Crc32c(...) for computing CRC32C checksums. [1] [2] [3] [4] [5] [6] [7] [8] [9]
  • Updated all function calls from crc32c::Extend(...) to use the new function signature, casting data pointers to const uint8_t* as required by the upstream library. [1] [2] [3] [4] [5]

Checksum Calculation Logic

  • Modified checksum calculation for multi-slice data by iteratively using crc32c::Extend over each slice, ensuring correct cumulative checksum computation.
  • Updated checksum verification logic to use the new API and data types, improving reliability and consistency across modules. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Code Clean-up and Consistency

  • Removed all redundant or obsolete includes of the custom crc32c.h header. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
  • Ensured all modules that require CRC32C now directly depend on the upstream library, reducing maintenance overhead and potential for bugs. (all above references)

These changes collectively improve code maintainability, reliability, and alignment with upstream best practices for CRC32C checksum operations.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Dec 15, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (77/77) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.23% (25316/34572)
Line Coverage 60.45% (268550/444243)
Region Coverage 56.02% (226136/403667)
Branch Coverage 57.30% (96170/167842)

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 91.74% (100/109) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.24% (24968/34563)
Line Coverage 58.99% (262096/444273)
Region Coverage 53.80% (217554/404357)
Branch Coverage 55.34% (93242/168476)

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.82% (1752/2195)
Line Coverage 65.74% (30773/46809)
Region Coverage 66.52% (15374/23112)
Branch Coverage 56.87% (8174/14374)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@BiteTheDDDDt BiteTheDDDDt changed the title add Crc32CHashPartitioner [Improvement](shuffle) add Crc32CHashPartitioner Dec 18, 2025
@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.00% (138/150) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.26% (24998/34595)
Line Coverage 58.98% (262270/444713)
Region Coverage 53.81% (217861/404906)
Branch Coverage 55.33% (93357/168720)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.72% (1757/2204)
Line Coverage 65.36% (30885/47256)
Region Coverage 66.04% (15410/23333)
Branch Coverage 56.50% (8187/14490)

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 11.59% (40/345) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.41% (18917/35417)
Line Coverage 39.29% (175363/446370)
Region Coverage 33.78% (135446/400953)
Branch Coverage 34.74% (58479/168319)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.32% (284/345) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.88% (25577/34619)
Line Coverage 61.27% (272691/445034)
Region Coverage 56.16% (227508/405097)
Branch Coverage 58.08% (98082/168877)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.67% (1760/2209)
Line Coverage 65.35% (30941/47346)
Region Coverage 66.03% (15430/23368)
Branch Coverage 56.45% (8198/14522)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.32% (284/345) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.88% (25577/34619)
Line Coverage 61.28% (272696/445034)
Region Coverage 56.17% (227525/405097)
Branch Coverage 58.08% (98081/168877)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.32% (284/345) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.31% (25033/34619)
Line Coverage 59.06% (262828/445034)
Region Coverage 53.80% (217950/405097)
Branch Coverage 55.46% (93653/168877)

1 similar comment
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.32% (284/345) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.31% (25033/34619)
Line Coverage 59.06% (262828/445034)
Region Coverage 53.80% (217950/405097)
Branch Coverage 55.46% (93653/168877)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.32% (284/345) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.32% (25036/34619)
Line Coverage 59.07% (262895/445034)
Region Coverage 53.84% (218099/405097)
Branch Coverage 55.48% (93693/168877)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 22, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@zclllyybb zclllyybb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BiteTheDDDDt BiteTheDDDDt merged commit 72a6c12 into apache:master Dec 22, 2025
28 of 30 checks passed
BiteTheDDDDt added a commit to BiteTheDDDDt/incubator-doris that referenced this pull request Dec 30, 2025
add Crc32CHashPartitioner
<img width="596" height="4284" alt="图片"
src="https://github.com/user-attachments/assets/5773ea04-b01a-4c8c-ba5a-0c725cb11f11"
/>
This pull request refactors the codebase to standardize the usage of the
CRC32C checksum library by replacing the custom `util/crc32c.h` header
and its functions with the upstream `crc32c` library
(`<crc32c/crc32c.h>`) and its API. It also updates function calls to use
the correct data types expected by the new library and ensures
consistent checksum calculation across multiple modules related to file
I/O, compression, and storage.

**Migration to Upstream CRC32C Library**

* Replaced all includes of `"util/crc32c.h"` with `<crc32c/crc32c.h>`
and removed the custom header from all relevant files.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL24)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1R20-L21)
[[3]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407R18-L23)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cR20-L23)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8R18)
[[6]](diffhunk://#diff-03a87568e2651d1524985a56a278f2e2932667c1e92efc60d0c5a750f0ad316bR20-R21)
[[7]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727R21)
[[8]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bR20)
[[9]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdR20)
[[10]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53R20)
[[11]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883R27-R28)
[[12]](diffhunk://#diff-9018eae3f9bef2cf64079552ce4d9c3fd3535a31b86a4ff496d29853c4968cb0R20)
* Updated all function calls from `crc32c::Value(...)` to
`crc32c::Crc32c(...)` for computing CRC32C checksums.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL120-R119)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1L89-R90)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL189-R190)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL420-R421)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L180-R180)
[[6]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L149-R150)
[[7]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL178-R181)
[[8]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L472-R472)
[[9]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L1158-R1159)
* Updated all function calls from `crc32c::Extend(...)` to use the new
function signature, casting data pointers to `const uint8_t*` as
required by the upstream library.
[[1]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407L320-R321)
[[2]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407L369-R370)
[[3]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L103-R104)
[[4]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727L2037-R2037)
[[5]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bL734-R734)

**Checksum Calculation Logic**

* Modified checksum calculation for multi-slice data by iteratively
using `crc32c::Extend` over each slice, ensuring correct cumulative
checksum computation.
* Updated checksum verification logic to use the new API and data types,
improving reliability and consistency across modules.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL120-R119)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1L89-R90)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL189-R190)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL420-R421)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L180-R180)
[[6]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L149-R150)
[[7]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL178-R181)
[[8]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L472-R472)
[[9]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L1158-R1159)

**Code Clean-up and Consistency**

* Removed all redundant or obsolete includes of the custom `crc32c.h`
header.
[[1]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1R20-L21)
[[2]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407R18-L23)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cR20-L23)
[[4]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L30)
[[5]](diffhunk://#diff-03a87568e2651d1524985a56a278f2e2932667c1e92efc60d0c5a750f0ad316bL30)
[[6]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727L51)
[[7]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bL49)
[[8]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL44)
[[9]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L69)
[[10]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L65)
* Ensured all modules that require CRC32C now directly depend on the
upstream library, reducing maintenance overhead and potential for bugs.
(all above references)

These changes collectively improve code maintainability, reliability,
and alignment with upstream best practices for CRC32C checksum
operations.
yiguolei pushed a commit that referenced this pull request Dec 31, 2025
BiteTheDDDDt added a commit that referenced this pull request Jan 8, 2026
add Crc32CHashPartitioner
<img width="596" height="4284" alt="图片"
src="https://github.com/user-attachments/assets/5773ea04-b01a-4c8c-ba5a-0c725cb11f11"
/>
This pull request refactors the codebase to standardize the usage of the
CRC32C checksum library by replacing the custom `util/crc32c.h` header
and its functions with the upstream `crc32c` library
(`<crc32c/crc32c.h>`) and its API. It also updates function calls to use
the correct data types expected by the new library and ensures
consistent checksum calculation across multiple modules related to file
I/O, compression, and storage.

**Migration to Upstream CRC32C Library**

* Replaced all includes of `"util/crc32c.h"` with `<crc32c/crc32c.h>`
and removed the custom header from all relevant files.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL24)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1R20-L21)
[[3]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407R18-L23)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cR20-L23)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8R18)
[[6]](diffhunk://#diff-03a87568e2651d1524985a56a278f2e2932667c1e92efc60d0c5a750f0ad316bR20-R21)
[[7]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727R21)
[[8]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bR20)
[[9]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdR20)
[[10]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53R20)
[[11]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883R27-R28)
[[12]](diffhunk://#diff-9018eae3f9bef2cf64079552ce4d9c3fd3535a31b86a4ff496d29853c4968cb0R20)
* Updated all function calls from `crc32c::Value(...)` to
`crc32c::Crc32c(...)` for computing CRC32C checksums.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL120-R119)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1L89-R90)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL189-R190)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL420-R421)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L180-R180)
[[6]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L149-R150)
[[7]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL178-R181)
[[8]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L472-R472)
[[9]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L1158-R1159)
* Updated all function calls from `crc32c::Extend(...)` to use the new
function signature, casting data pointers to `const uint8_t*` as
required by the upstream library.
[[1]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407L320-R321)
[[2]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407L369-R370)
[[3]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L103-R104)
[[4]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727L2037-R2037)
[[5]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bL734-R734)

**Checksum Calculation Logic**

* Modified checksum calculation for multi-slice data by iteratively
using `crc32c::Extend` over each slice, ensuring correct cumulative
checksum computation.
* Updated checksum verification logic to use the new API and data types,
improving reliability and consistency across modules.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL120-R119)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1L89-R90)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL189-R190)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL420-R421)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L180-R180)
[[6]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L149-R150)
[[7]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL178-R181)
[[8]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L472-R472)
[[9]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L1158-R1159)

**Code Clean-up and Consistency**

* Removed all redundant or obsolete includes of the custom `crc32c.h`
header.
[[1]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1R20-L21)
[[2]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407R18-L23)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cR20-L23)
[[4]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L30)
[[5]](diffhunk://#diff-03a87568e2651d1524985a56a278f2e2932667c1e92efc60d0c5a750f0ad316bL30)
[[6]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727L51)
[[7]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bL49)
[[8]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL44)
[[9]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L69)
[[10]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L65)
* Ensured all modules that require CRC32C now directly depend on the
upstream library, reducing maintenance overhead and potential for bugs.
(all above references)

These changes collectively improve code maintainability, reliability,
and alignment with upstream best practices for CRC32C checksum
operations.
BiteTheDDDDt added a commit to BiteTheDDDDt/incubator-doris that referenced this pull request Jan 8, 2026
This reverts commit 5c69a29.

Revert "[Chore](thirdparty) add crc32c-1.1.2 to thirdparty (apache#58462)"

This reverts commit 066b69e.

[Chore](thirdparty) add crc32c-1.1.2 to thirdparty (apache#58462)

doris have crc32c from rocksdb now, but it has poorly performance than
google/crc32c.

66663538 rows int
crc32c-rocksdb 684.879ms
crc32c-google 206.360ms

66663538 rows varchar
crc32c-rocksdb 1sec368ms
crc32c-google 391.290ms

This pull request adds support for the `crc32c` third-party dependency
to the build environment. The changes include updating the changelog,
adding build logic, and configuring the necessary variables to download
and build `crc32c`.

**Third-party dependency integration:**

* Added `crc32c-1.1.2` to the list of third-party dependencies in the
changelog (`thirdparty/CHANGELOG.md`).
* Added `crc32c` to the default package build list in
`build-thirdparty.sh` to ensure it is built by default.
* Implemented the `build_crc32c()` function in `build-thirdparty.sh` to
handle the build and installation process for `crc32c`.

**Build configuration updates:**

* Defined download URL, archive name, source directory, and MD5 checksum
for `crc32c` in `vars.sh`.
* Added `CRC32C` to the `TP_ARCHIVES` array in `vars.sh` so it is
included in the set of managed third-party archives.

[Chore](hash) use google/crc32c to instead rocksdb/crc32c and crc_hash (apache#58557)

doris have crc32c from rocksdb now, but it has poorly performance than
google/crc32c.

66663538 rows int
crc32c-rocksdb 684.879ms
crc32c-google 206.360ms

66663538 rows varchar
crc32c-rocksdb 1sec368ms
crc32c-google 391.290ms

We already have unit tests for
rocksdb/crc32c([be/test/util/crc32c_test.cpp](https://github.com/apache/doris/blob/master/be/test/util/crc32c_test.cpp)),
so this change is safe

This pull request updates the codebase to use the more efficient and
modern CRC32C hashing algorithm in place of the older CRC32
implementation. The changes include switching hash functions throughout
the code, updating the CRC32C utility implementation to use an external
library, and adding the required third-party dependency. This improves
hash performance and consistency, and prepares the codebase for future
compatibility.

**Hashing algorithm migration:**

* Replaced all usages of `HashUtil::crc_hash` with
`HashUtil::crc32c_hash` in `block_bloom_filter.hpp`,
`column_dictionary.h`, and `function_string.h` to utilize CRC32C for
better performance and reliability.
[[1]](diffhunk://#diff-635476edd1321096d1d32eb6453bed4624e8f23d0580750d515aaad9dfe5404eL79-R79)
[[2]](diffhunk://#diff-635476edd1321096d1d32eb6453bed4624e8f23d0580750d515aaad9dfe5404eL108-R108)
[[3]](diffhunk://#diff-bf8bb38b6a6eae6cccd7ed62ff64b1a77fbd273a614348b096330abea8331b4dL348-R348)
[[4]](diffhunk://#diff-9cc694af32a330f9ffd947df039bdfc12be67b2107c9e612d7861b17c5018176L4601-R4601)

* Added the new `crc32c_hash` method to `HashUtil` and marked the old
`crc_hash` as deprecated, retaining it only for backward compatibility
with historical data.
[[1]](diffhunk://#diff-92d951e58f5e0b824254f5eb0d931b604518e4bfbe666b665cd56ed9435667bbL52-R58)
[[2]](diffhunk://#diff-92d951e58f5e0b824254f5eb0d931b604518e4bfbe666b665cd56ed9435667bbR68-R69)
[[3]](diffhunk://#diff-92d951e58f5e0b824254f5eb0d931b604518e4bfbe666b665cd56ed9435667bbL120-L124)

**CRC32C utility refactor and dependency management:**

* Refactored `crc32c.cpp` and `crc32c.h` to use the external `crc32c`
library, removing the previous custom implementation and lookup tables.
Added new utility functions for CRC32C operations.
[[1]](diffhunk://#diff-1a21d70259827997bdfd54da21acd6db2ae0a29465873b53dbf8c7e9c6a7e265L18-R38)
[[2]](diffhunk://#diff-72d5c6ec3fe2da095fe1413472778c1d56027242035bdb83c62339ccfcca6ed6L18-R33)

* Added the `crc32c` third-party dependency in the build configuration
to support the new CRC32C utility.

**Build and header updates:**

* Updated includes in `hash_util.hpp` to reference the new CRC32C
utility.

[Improvement](shuffle) add Crc32CHashPartitioner (apache#59052)

add Crc32CHashPartitioner
<img width="596" height="4284" alt="图片"
src="https://github.com/user-attachments/assets/5773ea04-b01a-4c8c-ba5a-0c725cb11f11"
/>
This pull request refactors the codebase to standardize the usage of the
CRC32C checksum library by replacing the custom `util/crc32c.h` header
and its functions with the upstream `crc32c` library
(`<crc32c/crc32c.h>`) and its API. It also updates function calls to use
the correct data types expected by the new library and ensures
consistent checksum calculation across multiple modules related to file
I/O, compression, and storage.

**Migration to Upstream CRC32C Library**

* Replaced all includes of `"util/crc32c.h"` with `<crc32c/crc32c.h>`
and removed the custom header from all relevant files.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL24)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1R20-L21)
[[3]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407R18-L23)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cR20-L23)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8R18)
[[6]](diffhunk://#diff-03a87568e2651d1524985a56a278f2e2932667c1e92efc60d0c5a750f0ad316bR20-R21)
[[7]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727R21)
[[8]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bR20)
[[9]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdR20)
[[10]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53R20)
[[11]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883R27-R28)
[[12]](diffhunk://#diff-9018eae3f9bef2cf64079552ce4d9c3fd3535a31b86a4ff496d29853c4968cb0R20)
* Updated all function calls from `crc32c::Value(...)` to
`crc32c::Crc32c(...)` for computing CRC32C checksums.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL120-R119)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1L89-R90)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL189-R190)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL420-R421)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L180-R180)
[[6]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L149-R150)
[[7]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL178-R181)
[[8]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L472-R472)
[[9]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L1158-R1159)
* Updated all function calls from `crc32c::Extend(...)` to use the new
function signature, casting data pointers to `const uint8_t*` as
required by the upstream library.
[[1]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407L320-R321)
[[2]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407L369-R370)
[[3]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L103-R104)
[[4]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727L2037-R2037)
[[5]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bL734-R734)

**Checksum Calculation Logic**

* Modified checksum calculation for multi-slice data by iteratively
using `crc32c::Extend` over each slice, ensuring correct cumulative
checksum computation.
* Updated checksum verification logic to use the new API and data types,
improving reliability and consistency across modules.
[[1]](diffhunk://#diff-0572424f9b6fe1561e15b070c1155b1b8f9272499029d425ff5a8d0e0aa8f40fL120-R119)
[[2]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1L89-R90)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL189-R190)
[[4]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cL420-R421)
[[5]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L180-R180)
[[6]](diffhunk://#diff-ea6232df0f48fea9e5403472da0bc4206acfd69b676c1b5fbc2d2df13df24624L149-R150)
[[7]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL178-R181)
[[8]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L472-R472)
[[9]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L1158-R1159)

**Code Clean-up and Consistency**

* Removed all redundant or obsolete includes of the custom `crc32c.h`
header.
[[1]](diffhunk://#diff-a4327d67c48e4a4115a1ac9bc0a82a646bbfcb141d80f5f428142f55027e16a1R20-L21)
[[2]](diffhunk://#diff-f46297d8957a9929f575febc300a004c144e106ea6893f1b95508ab006503407R18-L23)
[[3]](diffhunk://#diff-3ef6a4f806adc33273c229fbdb827c072152651d5930b19affde4c1f8984c51cR20-L23)
[[4]](diffhunk://#diff-52cdc310f4ed34081299dff53c543455745a834dbc5c50a2c21b765c0c90c3f8L30)
[[5]](diffhunk://#diff-03a87568e2651d1524985a56a278f2e2932667c1e92efc60d0c5a750f0ad316bL30)
[[6]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727L51)
[[7]](diffhunk://#diff-4dc7440cc992e7f9bdd8ec9c5bfc5a6194f9d78fc5ff359c4781d992df4e610bL49)
[[8]](diffhunk://#diff-5eb6e846447db952b75ba0fd9bc1614702c428689c93e089a952ea414c23b7fdL44)
[[9]](diffhunk://#diff-c33a6f975ebaa66163e68ba51a4d9ce0cbfd6b5d063edce503130d7bae502c53L69)
[[10]](diffhunk://#diff-8061bb86d18c96049b63aa2caf4851933bff6b16cefa5460b1ee736d6f0ac883L65)
* Ensured all modules that require CRC32C now directly depend on the
upstream library, reducing maintenance overhead and potential for bugs.
(all above references)

These changes collectively improve code maintainability, reliability,
and alignment with upstream best practices for CRC32C checksum
operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.3-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants