-
Notifications
You must be signed in to change notification settings - Fork 373
WIP: Initial Xet docs (incomplete) #1622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from 10 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
e2eb575
Initial Xet docs (incomplete)
rajatarya 48a4ac2
reformat and move LFS to bottom
jsulz 25952d6
first pass at repositioning Xet first, LFS last
jsulz 383d1a9
grammar and flow nits
jsulz 966fb1f
Add to index.md
rajatarya b5efe6c
working deduplication section in and fixing some grammar nits
jsulz 45f7251
refining 'using xet storage' section
jsulz 5e400e7
worked on 'recommendations' section
jsulz 6c84ac0
pass through for flow and verbiage
jsulz 82960a9
images uploaded and formatted
jsulz 1b547b4
dropping architecture overview; will move to xet-core
jsulz 5b02fea
updating link placement
jsulz 641eee9
incorporating feedback
jsulz 86e967f
adding callout to join the waitlist and links to huggingface_hub docs
jsulz b02362c
minor flow nit
jsulz 779df1c
TOC and index consistency with page title
jsulz c5c8248
Update docs/hub/repositories-storage.md
julien-c e8ee3fd
Apply suggestions from code review
jsulz 307e5b7
Apply suggestions from code review
jsulz aee5d69
Update docs/hub/repositories-storage.md
jsulz 248bfb9
rename file
hanouticelina 1e2e36d
align repositories index page with toctree
jsulz e0ce581
Apply suggestions from code review
jsulz bc6d50e
Added a brief paragraph about security
ylow 3aeae1e
updated xet cache link
jsulz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| # Storage | ||
|
|
||
| Repositories on the Hugging Face Hub are unique to those on software development platforms. They contain files that are: | ||
jsulz marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - Large - in the range of GB and above | ||
jsulz marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) | ||
|
|
||
| While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. | ||
|
|
||
| Storing these files directly in a Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. | ||
|
|
||
| Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. | ||
|
|
||
| Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported and widely used (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. | ||
|
|
||
| ## Xet | ||
|
|
||
| [In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. | ||
jsulz marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/gitattributes-light.png"/> | ||
| <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/gitattributes-dark.png"/> | ||
| </div> | ||
|
|
||
| Meanwhile, the pointer files provide metadata to locate the actual file contents in remote storage: | ||
|
|
||
| - **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. | ||
| - **Pointer size**: The size of the pointer file stored in the Git repository. | ||
| - **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. | ||
|
|
||
| A Xet pointer includes all of this information (by design; refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs)) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. | ||
jsulz marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| <div class="flex justify-center"> | ||
| <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/pointer-file-light.png"/> | ||
| <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/pointer-file-dark.png"/> | ||
| </div> | ||
|
|
||
| Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to the [Deduplication](#deduplication) section below. | ||
|
|
||
| ### Deduplication | ||
|
|
||
| Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. | ||
|
|
||
| To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a [content-addressed store (CAS)](#content-addressed-store-cas), keyed by its hash. | ||
|
|
||
| The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. | ||
|
|
||
| For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. | ||
|
|
||
| ### Architecture Overview | ||
|
|
||
| Supporting this requires coordination between the storage layer and the local machine interacting with the repository (and all the systems in-between). There are 4 primary components to the Xet architecture: | ||
|
|
||
| 1. [Client](#client) | ||
| 2. [Hugging Face Hub](#hugging-face-hub) | ||
| 3. [Content addressed store (CAS)](#content-addressed-store-cas) | ||
| 4. [Amazon S3](#aws-s3) | ||
|
|
||
| ![IMAGE OF XET ARCHITECTURE] | ||
|
|
||
| #### Client | ||
|
|
||
| The client represents whatever machine is uploading or downloading a file. Current support is limited to [the Python package, `hf_xet`](https://pypi.org/project/hf-xet/), which provides an integration with the `huggingface_hub` and Xet-backed repositories. | ||
|
|
||
| When uploading files to Hub, `hf_xet` chunks the files into immutable content-defined chunks and deduplicates - ignoring previously seen chunks and only uploading new ones. | ||
|
|
||
| On the download path, `hf_xet` communicates with CAS to get the reconstruction information for a file. This information is compared against the local chunk cache so that `hf_xet` only issues requests for uncached chunks. | ||
|
|
||
| #### Hugging Face Hub | ||
|
|
||
| The Hub backend manages the Git repository, authentication & authorization, and metadata about both the files and repository. The Hub communicates with the client and CAS. | ||
|
|
||
| #### Content Addressed Store (CAS) | ||
|
|
||
| The content addressed store (CAS) is more than just a store - it is set of services that exposes APIs for supporting uploading and downloading Xet-backed files with a key-value store (DynamoDb) mapping hashed content and metadata to its location in S3. | ||
|
|
||
| The primary APIs are used for: | ||
|
|
||
| 1. Uploading blocks: Verifies the contents of the uploaded blocks, and then writes them to the appropriate S3 bucket. | ||
| 2. Uploading shards: Verifies the contents of the uploaded shards, writes them to the appropriate S3 bucket, and registers the shard in CAS | ||
| 3. Downloading file reconstruction information: Given the `Xet backed hash` field from a pointer file organize the manifest necessary to rebuild the file. Return the manifest to the client for direct download from S3 using presigned URLs for the relevant blocks to download. | ||
| 4. Check storage location: Given the `LFS SHA256 hash` this returns if Xet or Git LFS manages the content. This is a critical part of migration & compatibility with the legacy Git LFS storage system. | ||
| 5. Git LFS Bridge: Allows repositories using Xet storage to be accessed by legacy non-Xet-aware clients. The Bridge mimics an Git LFS server but does the work of reconstructing the requested file and returning it to the client. This allows downloading files through a single URL (so you can use tools like `curl` of the web interface of the Hub to download files). | ||
jsulz marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| #### AWS S3 | ||
|
|
||
| S3 stores the blocks and shards. It provides resiliency, availability, and fast access leveraging [Cloudfront](https://aws.amazon.com/cloudfront/) as a CDN. | ||
|
|
||
| #### Upload Sequence Diagram | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img class="block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/writes.png"/> | ||
| </div> | ||
|
|
||
| #### Download Sequence Diagram | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img class="block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/reads.png"/> | ||
| </div> | ||
|
|
||
| ### Backward Compatibility with LFS | ||
|
|
||
| Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. | ||
|
|
||
| This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. | ||
|
|
||
| While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version. | ||
|
|
||
| ### Security Model | ||
|
|
||
| ### Using Xet Storage | ||
|
|
||
| To start using Xet Storage, the simplest way to get started is to install the `hf_xet` python package when installing `huggingface_hub`: | ||
|
|
||
| ```bash | ||
| pip install huggingface_hub[hf_xet] | ||
| ``` | ||
|
|
||
| If you use the `transformers` or `datasets` libraries instead of making requests through `huggingface_hub` then simply install `hf_xet` directly: | ||
|
|
||
| ```bash | ||
| pip install hf-xet | ||
| ``` | ||
|
|
||
| If your Python environment has a `hf_xet`-aware version of `huggingface_hub` then your uploads and downloads will automatically use Xet. | ||
|
|
||
| That's it! You now get the benefits of Xet deduplication for both uploads and downloads. Team members using older `huggingface_hub` versions will still be able to upload and download repositories through the backwards compatibility provided by the LFS bridge. | ||
|
|
||
| #### Recommendations | ||
|
|
||
| Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage: | ||
|
|
||
| - **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. | ||
| - **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks upload, so frequent commits are both fast and storage-efficient. | ||
| - **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. | ||
| - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. | ||
|
|
||
| #### Current Limitations | ||
|
|
||
| While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: | ||
|
|
||
| - **64-bit systems only**: The hf_xet client currently requires a 64-bit architecture; 32-bit systems are not supported. | ||
| - **Partial JavaScript library support**: The [huggingface.js](https://huggingface.co/docs/huggingface.js/index) library has limited functionality with Xet-backed repositories; additional coverage is planned in future releases. | ||
| - **Full web support currently unavailable**: Full support for chunked uploads via the Hub web interface remains under development. | ||
jsulz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Legacy Storage: Git LFS | ||
|
|
||
| The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub's Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories' files (45PB total as of this writing). | ||
|
|
||
| The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). | ||
|
|
||
| This leads to a worse developer experience along with a proliferation of additional storage. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.