feat: add parallel chunking example #525

drbh · 2025-10-13T00:00:20Z

This PR adds an experimental parallel-chunk example binary that is feature gated with a parallel-chunking flag.

The changes include:

adding chunk_file_parallel to Chunker
multithread chunking logic in parallel_chunking.rs
a feature flag for logic and deps
a small cli binary to test the feature

Run

build the examples

cargo build --release --features parallel-chunking --example parallel-chunk
cargo build --release --features parallel-chunking --example chunk

chunk a file

target/release/examples/parallel-chunk --input FILE.ABC

the output should be identical to the original chunk binary output

target/release/examples/chunk --input FILE.ABC

Benching

below is a small benchmark script that compares chunk and parallel-chunk runtimes on a generated 1GB input file using hyperfine.

#!/bin/bash

# if user provides a number argument, use it as the size of the file to create
if [ $# -eq 1 ]; then
    N=$1
else
    N=1
fi

# add a N GB file in /tmp if it doesn't exist
if [ ! -f "/tmp/random_${N}.0gb.bin" ]; then
    echo "Creating a ${N}.0gb random file in /tmp..."
    dd if=/dev/urandom of="/tmp/random_${N}.0gb.bin" bs=1M count=$((N * 1024))
fi

file="/tmp/random_${N}.0gb.bin"

echo "Building examples..."
cargo build --release --features parallel-chunking --example parallel-chunk
cargo build --release --features parallel-chunking --example chunk

echo "" 
echo "Reference version... (last 10 lines)"
target/release/examples/chunk --input $file | tail -n 10

echo "" 
echo "Parallel version... (last 10 lines)"
target/release/examples/parallel-chunk --input $file | tail -n 10

echo "" 
echo "Running reference benchmarks..."
hyperfine "target/release/examples/chunk --input $file"

echo "" 
echo "Running parallel benchmarks..."
hyperfine "target/release/examples/parallel-chunk --input $file"

./bench.sh 1 # the number indicates the random file gb size

Building examples...
   Compiling deduplication v0.14.5 (/Users/drbh/Projects/xet-core-tmp/deduplication)
   Compiling cas_object v0.1.0 (/Users/drbh/Projects/xet-core-tmp/cas_object)
   Compiling cas_client v0.14.5 (/Users/drbh/Projects/xet-core-tmp/cas_client)
   Compiling hub_client v0.1.0 (/Users/drbh/Projects/xet-core-tmp/hub_client)
   Compiling data v0.14.5 (/Users/drbh/Projects/xet-core-tmp/data)
    Finished `release` profile [optimized + debuginfo] target(s) in 10.88s
   Compiling data v0.14.5 (/Users/drbh/Projects/xet-core-tmp/data)
    Finished `release` profile [optimized + debuginfo] target(s) in 5.47s

Reference version... (last 10 lines)
fd27a0d46a8d835b52ab6bba351b8f977ac6a31ea26f2e56f558a4def0cf41c0 40657
290e4a68705cbb8ee95274f027bd17356438c4ab94c41b58ee72d22c0d6afb88 93564
5ae15e4077c349647cdc0f9f2272b91fdf6ab50f6ef67763b3c165ec925385e5 55783
8a5e75b8c173d1b466a2a66ba54853c317f8913a544c5d5ab547bff1dec5960f 59859
0b899aad6c918cd5a709c5bd5994c036f37509e8345ac503c2d059e766e06924 131072
4d0c6e8d974da6f53f959a55fbc735a484c09f6ce3dde5410ef505a590d3bbef 111800
02dc025bdb7afd7e0b32c545f1abd70015c728508c3aa6ec3a7cf08a6c52a250 89915
a602df3c6e333fdbe6434158945aa05ec82139bed8741336b128322423005112 29275
234ce19a584bb9996f3a909adada4dff8de8b30a296b8cbb4b31099af09501b3 36264
5090799ec0d6b1dba9a69048e6af617371a1305d98bb12c83659e86dcad1dce0 59004

Parallel version... (last 10 lines)
fd27a0d46a8d835b52ab6bba351b8f977ac6a31ea26f2e56f558a4def0cf41c0 40657
290e4a68705cbb8ee95274f027bd17356438c4ab94c41b58ee72d22c0d6afb88 93564
5ae15e4077c349647cdc0f9f2272b91fdf6ab50f6ef67763b3c165ec925385e5 55783
8a5e75b8c173d1b466a2a66ba54853c317f8913a544c5d5ab547bff1dec5960f 59859
0b899aad6c918cd5a709c5bd5994c036f37509e8345ac503c2d059e766e06924 131072
4d0c6e8d974da6f53f959a55fbc735a484c09f6ce3dde5410ef505a590d3bbef 111800
02dc025bdb7afd7e0b32c545f1abd70015c728508c3aa6ec3a7cf08a6c52a250 89915
a602df3c6e333fdbe6434158945aa05ec82139bed8741336b128322423005112 29275
234ce19a584bb9996f3a909adada4dff8de8b30a296b8cbb4b31099af09501b3 36264
5090799ec0d6b1dba9a69048e6af617371a1305d98bb12c83659e86dcad1dce0 59004

Running reference benchmarks...
Benchmark 1: target/release/examples/chunk --input /tmp/random_1.0gb.bin
  Time (mean ± σ):      1.169 s ±  0.017 s    [User: 1.015 s, System: 0.151 s]
  Range (min … max):    1.149 s …  1.205 s    10 runs


Running parallel benchmarks...
Benchmark 1: target/release/examples/parallel-chunk --input /tmp/random_1.0gb.bin
  Time (mean ± σ):     217.4 ms ±  10.4 ms    [User: 1317.2 ms, System: 264.6 ms]
  Range (min … max):   201.7 ms … 234.3 ms    14 runs

**run on a Macbook M3 Max

As shown in the benches above. the user time (~compute time) is roughly the same in both cases but the wall clock time is >5x faster when the work can be distributed across multiple threads.

PS: the chunked boundaries and hash values appear to exactly match the reference in all cases when testing locally - however more tests to ensure correctness would increase confidence.

Opening this PR as a draft since it adds a new file, dependencies and a feature flag which may need to be refactored/changed. Looking forward to feedback and please let me know what needs to be updated/changed to complete this PR.

assafvayner

have you tested this for "correctness" i.e. that it produces the same chunks as the non-parallel chunker?

the most basic of tests: chunking file https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv produces https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.chunks

Ideally we would also create some special data that can trip this up, I think some of it exists in the chunker tests but we would want to grow it to a larger size. e.g. the last 64 bytes of some chunk repeated over GB's with different places where you set the initial boundaries.

drbh · 2025-11-07T22:17:51Z

@assafvayner thanks for taking a look!

yes this has been compared against the existing chunk implementation and produces the same results.

Also I just tested this with the Electric_Vehicle_Population_Data_20250917.csv file shared and produce the same output as the chunk file.

reference chunk binary

target/release/examples/chunk --input Electric_Vehicle_Population_Data_20250917.csv | sha256sum
b20bb4b910bab5d67a4082313cf83afea201747f02e0aedc4158a333ba61c684  -

parallel chunk binary

target/release/examples/parallel-chunk --input Electric_Vehicle_Population_Data_20250917.csv | sha256sum
b20bb4b910bab5d67a4082313cf83afea201747f02e0aedc4158a333ba61c684  -

file from the hub

sha256sum Electric_Vehicle_Population_Data_20250917.csv.chunks
b20bb4b910bab5d67a4082313cf83afea201747f02e0aedc4158a333ba61c684  Electric_Vehicle_Population_Data_20250917.csv.chunks

finding adversarial examples is a great idea, i'll take a look at the existing chunker tests and update this PR. Also if you have any existing examples of difficult to chunk files please feel free to share!

Update

Please see the latest commit for 7 correctness test that compare against the Chunker. I've attempted to add a wide range of inputs and specific patterns to verify edge cases. Please let me know if I should add more!

cargo test --package deduplication --features parallel-chunking parallel_chunking

output

    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.08s
     Running unittests src/lib.rs (target/debug/deps/deduplication-2b8b660622f1a151)

running 10 tests
test parallel_chunking::tests::test_parallel_chunking_small_file ... ok
test parallel_chunking::tests::test_parallel_chunking_config ... ok
test parallel_chunking::tests::test_parallel_chunking_large_file ... ok
test parallel_chunking::tests::test_parallel_matches_reference ... ok
test parallel_chunking::tests::test_parallel_matches_reference_repeating_pattern ... ok
test parallel_chunking::tests::test_parallel_matches_reference_mixed_entropy ... ok
test parallel_chunking::tests::test_parallel_matches_reference_random ... ok
test parallel_chunking::tests::test_parallel_matches_reference_hash_window_pattern ... ok
test parallel_chunking::tests::test_parallel_matches_reference_asymmetric_density ... ok
test parallel_chunking::tests::test_parallel_matches_reference_boundary_sensitivity ... ok

test result: ok. 10 passed; 0 failed; 0 ignored; 0 measured; 10 filtered out; finished in 9.35s

drbh added 3 commits October 12, 2025 19:18

feat: add parallel chunking example

f85d74d

fix: cargo fmt rerun

967fdca

fix: cargo fmt with nightly

24e88ad

assafvayner reviewed Nov 7, 2025

View reviewed changes

drbh added 2 commits November 7, 2025 17:42

feat: add correctness tests

633ecd1

fix: avoid chunk_file_parallel in chunker for simplicity

dfda352

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add parallel chunking example #525

feat: add parallel chunking example #525

drbh commented Oct 13, 2025

Uh oh!

assafvayner left a comment

Uh oh!

drbh commented Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add parallel chunking example #525

Are you sure you want to change the base?

feat: add parallel chunking example #525

Conversation

drbh commented Oct 13, 2025

Run

Benching

Uh oh!

assafvayner left a comment

Choose a reason for hiding this comment

Uh oh!

drbh commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drbh commented Nov 7, 2025 •

edited

Loading