Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
e2eb575
Initial Xet docs (incomplete)
rajatarya Mar 4, 2025
48a4ac2
reformat and move LFS to bottom
jsulz Mar 7, 2025
25952d6
first pass at repositioning Xet first, LFS last
jsulz Mar 7, 2025
383d1a9
grammar and flow nits
jsulz Mar 7, 2025
966fb1f
Add to index.md
rajatarya Mar 8, 2025
b5efe6c
working deduplication section in and fixing some grammar nits
jsulz Mar 10, 2025
45f7251
refining 'using xet storage' section
jsulz Mar 10, 2025
5e400e7
worked on 'recommendations' section
jsulz Mar 10, 2025
6c84ac0
pass through for flow and verbiage
jsulz Mar 10, 2025
82960a9
images uploaded and formatted
jsulz Mar 10, 2025
1b547b4
dropping architecture overview; will move to xet-core
jsulz Mar 13, 2025
5b02fea
updating link placement
jsulz Mar 13, 2025
641eee9
incorporating feedback
jsulz Mar 13, 2025
86e967f
adding callout to join the waitlist and links to huggingface_hub docs
jsulz Mar 13, 2025
b02362c
minor flow nit
jsulz Mar 13, 2025
779df1c
TOC and index consistency with page title
jsulz Mar 13, 2025
c5c8248
Update docs/hub/repositories-storage.md
julien-c Mar 14, 2025
e8ee3fd
Apply suggestions from code review
jsulz Mar 14, 2025
307e5b7
Apply suggestions from code review
jsulz Mar 14, 2025
aee5d69
Update docs/hub/repositories-storage.md
jsulz Mar 14, 2025
248bfb9
rename file
hanouticelina Mar 14, 2025
1e2e36d
align repositories index page with toctree
jsulz Mar 14, 2025
e0ce581
Apply suggestions from code review
jsulz Mar 14, 2025
bc6d50e
Added a brief paragraph about security
ylow Mar 14, 2025
3aeae1e
updated xet cache link
jsulz Mar 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@
title: Next Steps
- local: repositories-licenses
title: Licenses
- local: repositories-storage
title: Storage
- title: Models
local: models
isExpanded: true
Expand Down
118 changes: 118 additions & 0 deletions docs/hub/repositories-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Storage

## Intro

Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files considerably different files from those used to build traditional software.

They are:

- Large - in the range of GB or TB
- Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet))

To manage these in a Git repository has traditionally meant using [Git LFS](https://git-lfs.com/), a Git extension.

## Git LFS

Git LFS is utilized when working files larger than 10MB or whose extensions are present in a `.gitattributes` file:

![ ADD IMAGE OF .gitattributes here ]

Instead of storing these alongside the rest of the content in the repository, Git LFS routes the content in a remote storage designed for large objects.

Git LFS then creates a "pointer file" which is stored in the repository for the given revision:

![Example from a Hub repository](attachment:75d5c684-4245-47a1-bf3d-9b980ae26043:Screenshot_2025-02-24_at_8.38.24_AM.png)

Example from a Hub repository

The fields in a pointer file that you will see on the Hub are:

- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents.
- **Pointer size**: The size of the pointer file stored in the Git repository.
- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations.

As you can see, the pointer file size is much smaller than the remote file, allowing the repository itself to remain small. This is especially important when working with a repository using Git, as only remote files at the specific commit are transferred instead of each revision of the remote file.

The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing).

The main limitation of LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine).

This leads to a worse developer experience along with a proliferation of additional storage.

## Xet

[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace LFS on the Hub.

Like LFS, a Xet-backed repository utilizes S3 as the remote storage and stores pointer files in the repository.

![Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png)

Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.

Unlike LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, it's contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded.

The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to only notice that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few chunks) and securely deduplicates shared blocks across repositories.

Supporting this requires coordination between the storage layer and the local machine interacting with the repository (and all the systems in-between). There are 4 primary components to the Xet architecture:

1. Client
2. Hugging Face Hub
3. Content addressed store (CAS)
4. Amazon S3

![IMAGE OF XET ARCHITECTURE]

### Client

The client represents whatever machine is uploading or downloading a file. Current support is limited to [the Python package, `hf_xet`](https://pypi.org/project/hf-xet/), which provides an integration with the `huggingface_hub` and Xet-backed repositories.

When uploading files to Hub, `hf_xet` chunks the files into immutable content-defined chunks and deduplicates - ignoring previously seen chunks and only uploading new ones.

On the download path, `hf_xet` communicates with CAS to get the reconstruction information for a file. This information is compared against the local chunk cache so that `hf_xet` only issues requests for uncached chunks.

### Hugging Face Hub

The Hub backend manages the Git repository, authentication & authorization, and metadata about both the files and repository. The Hub communicates with the client and CAS.

### Content Addressed Store (CAS)

The content addressed store (CAS) is more than just a store - it is set of services that exposes APIs for supporting uploading and downloading Xet-backed files with a key-value store (DynamoDb) mapping hashed content and metadata to its location in S3.

The primary APIs are used for:

1. Uploading blocks: Verifies the contents of the uploaded blocks, and then writes them to the appropriate S3 bucket.
2. Uploading shards: Verifies the contents of the uploaded shards, writes them to the appropriate S3 bucket, and registers the shard in CAS
3. Downloading file reconstruction information: Given the `Xet backed hash` field from a pointer file organize the manifest necessary to rebuild the file. Return the manifest to the client for direct download from S3 using presigned URLs for the relevant blocks to download.
4. Check storage location: Given the `LFS SHA256 hash` this returns if Xet or LFS manages the content. This is a critical part of migration & compatibility with the legacy LFS storage system.
5. LFS Bridge: Allows repositories using Xet storage to be accessed by legacy non-Xet-aware clients. The Bridge mimics an LFS server but does the work of reconstructing the requested file and returning it to the client. This allows downloading files through a single URL (so you can use tools like `curl` of the web interface of the Hub to download files).

### AWS S3

S3 stores the blocks and shards. It provides resiliency, availability, and fast access leveraging [Cloudfront](https://aws.amazon.com/cloudfront/) as a CDN.

### Upload Sequence Diagram

![new-writes.png](attachment:006a81c4-8ec6-4c78-a1a1-d47c3e4dd543:new-writes.png)

### Download Sequence Diagram

![new-reads.png](attachment:337bb67d-bad4-4e27-a9c5-179d6ae746aa:new-reads.png)

### Backward Compatibility with LFS

Xet Storage provides a seamless transition for existing Hub repositories. It isn’t necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification.

This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in LFS or Xet storage, allowing downstream services (LFS or the LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content.

While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through LFS and a background process will convert the file revision to a Xet-backed version.

### Deduplication

### Security Model

### Recommendations

#### Current Limitations
#### Best Practices

### Using Xet Storage
1 change: 1 addition & 0 deletions docs/hub/repositories.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ In these pages, you will go over the basics of getting started with Git and inte
- [Repository storage limits](./storage-limits)
- [Next Steps](./repositories-next-steps)
- [Licenses](./repositories-licenses)
- [Storage](./repositories-storage)