-
Notifications
You must be signed in to change notification settings - Fork 38
feat: add parallel chunking example #525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
assafvayner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tested this for "correctness" i.e. that it produces the same chunks as the non-parallel chunker?
the most basic of tests: chunking file https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv produces https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.chunks
Ideally we would also create some special data that can trip this up, I think some of it exists in the chunker tests but we would want to grow it to a larger size. e.g. the last 64 bytes of some chunk repeated over GB's with different places where you set the initial boundaries.
|
@assafvayner thanks for taking a look! yes this has been compared against the existing chunk implementation and produces the same results. Also I just tested this with the reference chunk binary parallel chunk binary file from the hub finding adversarial examples is a great idea, i'll take a look at the existing chunker tests and update this PR. Also if you have any existing examples of difficult to chunk files please feel free to share! UpdatePlease see the latest commit for 7 correctness test that compare against the Chunker. I've attempted to add a wide range of inputs and specific patterns to verify edge cases. Please let me know if I should add more! cargo test --package deduplication --features parallel-chunking parallel_chunkingoutput Finished `test` profile [unoptimized + debuginfo] target(s) in 0.08s
Running unittests src/lib.rs (target/debug/deps/deduplication-2b8b660622f1a151)
running 10 tests
test parallel_chunking::tests::test_parallel_chunking_small_file ... ok
test parallel_chunking::tests::test_parallel_chunking_config ... ok
test parallel_chunking::tests::test_parallel_chunking_large_file ... ok
test parallel_chunking::tests::test_parallel_matches_reference ... ok
test parallel_chunking::tests::test_parallel_matches_reference_repeating_pattern ... ok
test parallel_chunking::tests::test_parallel_matches_reference_mixed_entropy ... ok
test parallel_chunking::tests::test_parallel_matches_reference_random ... ok
test parallel_chunking::tests::test_parallel_matches_reference_hash_window_pattern ... ok
test parallel_chunking::tests::test_parallel_matches_reference_asymmetric_density ... ok
test parallel_chunking::tests::test_parallel_matches_reference_boundary_sensitivity ... ok
test result: ok. 10 passed; 0 failed; 0 ignored; 0 measured; 10 filtered out; finished in 9.35s |
This PR adds an experimental
parallel-chunkexample binary that is feature gated with aparallel-chunkingflag.The changes include:
chunk_file_paralleltoChunkerparallel_chunking.rsRun
build the examples
chunk a file
the output should be identical to the original chunk binary output
Benching
below is a small benchmark script that compares
chunkandparallel-chunkruntimes on a generated 1GB input file using hyperfine../bench.sh 1 # the number indicates the random file gb sizeBuilding examples... Compiling deduplication v0.14.5 (/Users/drbh/Projects/xet-core-tmp/deduplication) Compiling cas_object v0.1.0 (/Users/drbh/Projects/xet-core-tmp/cas_object) Compiling cas_client v0.14.5 (/Users/drbh/Projects/xet-core-tmp/cas_client) Compiling hub_client v0.1.0 (/Users/drbh/Projects/xet-core-tmp/hub_client) Compiling data v0.14.5 (/Users/drbh/Projects/xet-core-tmp/data) Finished `release` profile [optimized + debuginfo] target(s) in 10.88s Compiling data v0.14.5 (/Users/drbh/Projects/xet-core-tmp/data) Finished `release` profile [optimized + debuginfo] target(s) in 5.47s Reference version... (last 10 lines) fd27a0d46a8d835b52ab6bba351b8f977ac6a31ea26f2e56f558a4def0cf41c0 40657 290e4a68705cbb8ee95274f027bd17356438c4ab94c41b58ee72d22c0d6afb88 93564 5ae15e4077c349647cdc0f9f2272b91fdf6ab50f6ef67763b3c165ec925385e5 55783 8a5e75b8c173d1b466a2a66ba54853c317f8913a544c5d5ab547bff1dec5960f 59859 0b899aad6c918cd5a709c5bd5994c036f37509e8345ac503c2d059e766e06924 131072 4d0c6e8d974da6f53f959a55fbc735a484c09f6ce3dde5410ef505a590d3bbef 111800 02dc025bdb7afd7e0b32c545f1abd70015c728508c3aa6ec3a7cf08a6c52a250 89915 a602df3c6e333fdbe6434158945aa05ec82139bed8741336b128322423005112 29275 234ce19a584bb9996f3a909adada4dff8de8b30a296b8cbb4b31099af09501b3 36264 5090799ec0d6b1dba9a69048e6af617371a1305d98bb12c83659e86dcad1dce0 59004 Parallel version... (last 10 lines) fd27a0d46a8d835b52ab6bba351b8f977ac6a31ea26f2e56f558a4def0cf41c0 40657 290e4a68705cbb8ee95274f027bd17356438c4ab94c41b58ee72d22c0d6afb88 93564 5ae15e4077c349647cdc0f9f2272b91fdf6ab50f6ef67763b3c165ec925385e5 55783 8a5e75b8c173d1b466a2a66ba54853c317f8913a544c5d5ab547bff1dec5960f 59859 0b899aad6c918cd5a709c5bd5994c036f37509e8345ac503c2d059e766e06924 131072 4d0c6e8d974da6f53f959a55fbc735a484c09f6ce3dde5410ef505a590d3bbef 111800 02dc025bdb7afd7e0b32c545f1abd70015c728508c3aa6ec3a7cf08a6c52a250 89915 a602df3c6e333fdbe6434158945aa05ec82139bed8741336b128322423005112 29275 234ce19a584bb9996f3a909adada4dff8de8b30a296b8cbb4b31099af09501b3 36264 5090799ec0d6b1dba9a69048e6af617371a1305d98bb12c83659e86dcad1dce0 59004 Running reference benchmarks... Benchmark 1: target/release/examples/chunk --input /tmp/random_1.0gb.bin Time (mean ± σ): 1.169 s ± 0.017 s [User: 1.015 s, System: 0.151 s] Range (min … max): 1.149 s … 1.205 s 10 runs Running parallel benchmarks... Benchmark 1: target/release/examples/parallel-chunk --input /tmp/random_1.0gb.bin Time (mean ± σ): 217.4 ms ± 10.4 ms [User: 1317.2 ms, System: 264.6 ms] Range (min … max): 201.7 ms … 234.3 ms 14 runs**run on a Macbook M3 Max
As shown in the benches above. the user time (~compute time) is roughly the same in both cases but the wall clock time is >5x faster when the work can be distributed across multiple threads.
PS: the chunked boundaries and hash values appear to exactly match the reference in all cases when testing locally - however more tests to ensure correctness would increase confidence.
Opening this PR as a draft since it adds a new file, dependencies and a feature flag which may need to be refactored/changed. Looking forward to feedback and please let me know what needs to be updated/changed to complete this PR.