You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update xet-core constant links, pin to version tag
The current URLs pointed to a files/lines that have since been refactored elsewhere in the repo, in xet-core commmit 5afd1eeef1b91d59d735a7d5d2c7712c1ad2ab46.
I believe I've traced to the modern equivalents, though it's possible IDEAL_CAS_BLOCK_SIZE should really pointed to MAX_XORB_BYTES in deduplication/src/constants.rs
* Update developers.google.com link to avoid redirect
* Update from-chunks-to-blocks.md
* arg, fixing again
---------
Co-authored-by: Pedro Cuenca <[email protected]>
Copy file name to clipboardExpand all lines: from-chunks-to-blocks.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,11 +24,11 @@ On Hugging Face's [Xet team](https://huggingface.co/xet-team), we're bringing CD
24
24
25
25
Imagine uploading a 200GB repository to the Hub. Today, [there are a number of ways to do this](https://huggingface.co/docs/huggingface_hub/en/guides/upload), but all use a file-centric approach. To bring faster file transfers to the Hub, we've open-sourced [xet-core](https://github.com/huggingface/xet-core) and `hf_xet`, an integration with [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) which uses a chunk-based approach written in Rust.
26
26
27
-
If you consider a 200GB repository with unique chunks, that's 3 million entries (at [~64KB per chunk](https://github.com/huggingface/xet-core/blob/main/merkledb/src/constants.rs#L5)) in the content-addressed store (CAS) backing all repositories. If a new version of a model is uploaded or a branch in the repository is created with different data, more unique chunks are added, driving up the entries in the CAS.
27
+
If you consider a 200GB repository with unique chunks, that's 3 million entries (at [~64KB per chunk](https://github.com/huggingface/xet-core/blob/v1.1.7/deduplication/src/constants.rs#L4)) in the content-addressed store (CAS) backing all repositories. If a new version of a model is uploaded or a branch in the repository is created with different data, more unique chunks are added, driving up the entries in the CAS.
28
28
29
29
With nearly 45PB across 2 million model, dataset, and space repositories on the Hub, a purely chunk-based approach could incur **690 billion chunks**. Managing this volume of content using only chunks is simply not viable due to:
30
30
31
-
-**Network Overheads**: If each chunk is downloaded or uploaded individually, millions of requests are generated on each upload and download, overwhelming both client and server. Even [batching queries](https://developers.google.com/classroom/best-practices/batch) simply shifts the problem to the storage layer.
31
+
-**Network Overheads**: If each chunk is downloaded or uploaded individually, millions of requests are generated on each upload and download, overwhelming both client and server. Even [batching queries](https://developers.google.com/workspace/classroom/best-practices/batch) simply shifts the problem to the storage layer.
32
32
-**Infrastructure Overheads**: A naive CAS that tracks chunks individually would require billions of entries, leading to steep monthly bills on services like [DynamoDB](https://aws.amazon.com/pm/dynamodb/) or [S3](https://aws.amazon.com/s3/). At Hugging Face’s scale, this quickly adds up.
33
33
34
34
In short, network requests balloon, databases struggle to manage the metadata, and the cost of orchestrating each chunk skyrockets all while you wait for your files to transfer.
@@ -51,7 +51,7 @@ What does this mean? We scale with **aggregation.**
51
51
52
52
Aggregation takes chunks and groups them, referencing them intelligently in ways that provide clever (and practical) benefits:
53
53
54
-
-**Blocks**: Instead of transferring and storing chunks, we bundle data together in blocks of [up to 64MB](https://github.com/huggingface/xet-core/blob/main/merkledb/src/constants.rs#L6) after deduplication. Blocks are still content-addressed, but this reduces CAS entries by a factor of 1,000.
54
+
-**Blocks**: Instead of transferring and storing chunks, we bundle data together in blocks of [up to 64MB](https://github.com/huggingface/xet-core/blob/v1.1.7/cas_object/src/constants.rs#L12) after deduplication. Blocks are still content-addressed, but this reduces CAS entries by a factor of 1,000.
55
55
-**Shards**: Shards provide the mapping between files and chunks (referencing blocks as they do so). This allows us to identify which parts of a file have changed, referencing shards generated from past uploads. When chunks are already known to exist in the CAS, they’re skipped, slashing unnecessary transfers and queries.
56
56
57
57
Together, blocks and shards unlock significant benefits. However, when someone uploads a new file, how do we know if a chunk has been uploaded before so we can eliminate an unnecessary request? Performing a network query for every chunk is not scalable and goes against the “no 1:1” principle we mentioned above.
0 commit comments