Skip to content

Commit e8ee3fd

Browse files
jsulzjulien-chanouticelina
authored
Apply suggestions from code review
Co-authored-by: Julien Chaumond <[email protected]> Co-authored-by: Célina <[email protected]>
1 parent c5c8248 commit e8ee3fd

File tree

3 files changed

+6
-6
lines changed

3 files changed

+6
-6
lines changed

docs/hub/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The Hugging Face Hub is a platform with over 900k models, 200k datasets, and 300
1414
<a class="transform no-underline! transition-colors hover:translate-x-px hover:text-gray-700" href="./notifications">Notifications</a>
1515
<a class="transform no-underline! transition-colors hover:translate-x-px hover:text-gray-700" href="./collections">Collections</a>
1616
<a class="transform no-underline! transition-colors hover:translate-x-px hover:text-gray-700" href="./webhooks">Webhooks</a>
17-
<a class="transform no-underline! transition-colors hover:translate-x-px hover:text-gray-700" href="./repositories-storage">Storage Backends</a>
17+
<a class="transform no-underline! transition-colors hover:translate-x-px hover:text-gray-700" href="./storage-backends">Storage Backends</a>
1818
<a class="transform no-underline! transition-colors hover:translate-x-px hover:text-gray-700" href="./repositories-next-steps">Next Steps</a>
1919
<a class="transform no-underline! transition-colors hover:translate-x-px hover:text-gray-700" href="./repositories-licenses">Licenses</a>
2020
</div>

docs/hub/repositories-storage.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Meanwhile, a Git LFS pointer file provide metadata to locate the actual file con
3030
- **Pointer size**: The size of the pointer file stored in the Git repository.
3131
- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations.
3232

33-
A Xet pointer includes all of this information (by design; refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs)) with the addition of a `Xet backed hash` field for referencing the file in Xet storage.
33+
A Xet pointer includes all of this information by design. Refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs) with the addition of a `Xet backed hash` field for referencing the file in Xet storage.
3434

3535
<div class="flex justify-center">
3636
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/pointer-file-light.png"/>
@@ -91,19 +91,19 @@ While Xet brings fine-grained deduplication and enhanced performance to Git-base
9191

9292
### Deduplication
9393

94-
Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-enabled client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded.
94+
Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded.
9595

9696
To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a [content-addressed store (CAS)](#content-addressed-store-cas), keyed by its hash.
9797

98-
The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times.
98+
The Hub's [current recommendation](https://huggingface.co/docs/hub/storage-limits#recommendations) is to limit files to 20GB. At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times.
9999

100100
For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face.
101101

102102
### Backward Compatibility with LFS
103103

104104
Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification.
105105

106-
This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content.
106+
This symmetry allows non-Xet-aware clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content.
107107

108108
While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version.
109109

docs/hub/repositories.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,4 @@ In these pages, you will go over the basics of getting started with Git and inte
1616
- [Repository storage limits](./storage-limits)
1717
- [Next Steps](./repositories-next-steps)
1818
- [Licenses](./repositories-licenses)
19-
- [Storage](./repositories-storage)
19+
- [Storage Backends](./storage-backends)

0 commit comments

Comments
 (0)