From e2eb575bb006d48dc4d862722c20cd6f62b274f0 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Mon, 3 Mar 2025 17:07:07 -0800 Subject: [PATCH 01/25] Initial Xet docs (incomplete) --- docs/hub/_toctree.yml | 2 + docs/hub/repositories-storage.md | 118 +++++++++++++++++++++++++++++++ docs/hub/repositories.md | 1 + 3 files changed, 121 insertions(+) create mode 100644 docs/hub/repositories-storage.md diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 611d832f0..f234c6bb9 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -31,6 +31,8 @@ title: Next Steps - local: repositories-licenses title: Licenses + - local: repositories-storage + title: Storage - title: Models local: models isExpanded: true diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md new file mode 100644 index 000000000..3720ffff1 --- /dev/null +++ b/docs/hub/repositories-storage.md @@ -0,0 +1,118 @@ +# Storage + +## Intro + +Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files considerably different files from those used to build traditional software. + +They are: + +- Large - in the range of GB or TB +- Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) + +To manage these in a Git repository has traditionally meant using [Git LFS](https://git-lfs.com/), a Git extension. + +## Git LFS + +Git LFS is utilized when working files larger than 10MB or whose extensions are present in a `.gitattributes` file: + +![ ADD IMAGE OF .gitattributes here ] + +Instead of storing these alongside the rest of the content in the repository, Git LFS routes the content in a remote storage designed for large objects. + +Git LFS then creates a "pointer file" which is stored in the repository for the given revision: + +![Example from a Hub repository](attachment:75d5c684-4245-47a1-bf3d-9b980ae26043:Screenshot_2025-02-24_at_8.38.24_AM.png) + +Example from a Hub repository + +The fields in a pointer file that you will see on the Hub are: + +- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. +- **Pointer size**: The size of the pointer file stored in the Git repository. +- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. + +As you can see, the pointer file size is much smaller than the remote file, allowing the repository itself to remain small. This is especially important when working with a repository using Git, as only remote files at the specific commit are transferred instead of each revision of the remote file. + +The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing). + +The main limitation of LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). + +This leads to a worse developer experience along with a proliferation of additional storage. + +## Xet + +[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace LFS on the Hub. + +Like LFS, a Xet-backed repository utilizes S3 as the remote storage and stores pointer files in the repository. + +![Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) + +Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage. + +Unlike LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, it's contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded. + +The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to only notice that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few chunks) and securely deduplicates shared blocks across repositories. + +Supporting this requires coordination between the storage layer and the local machine interacting with the repository (and all the systems in-between). There are 4 primary components to the Xet architecture: + +1. Client +2. Hugging Face Hub +3. Content addressed store (CAS) +4. Amazon S3 + +![IMAGE OF XET ARCHITECTURE] + +### Client + +The client represents whatever machine is uploading or downloading a file. Current support is limited to [the Python package, `hf_xet`](https://pypi.org/project/hf-xet/), which provides an integration with the `huggingface_hub` and Xet-backed repositories. + +When uploading files to Hub, `hf_xet` chunks the files into immutable content-defined chunks and deduplicates - ignoring previously seen chunks and only uploading new ones. + +On the download path, `hf_xet` communicates with CAS to get the reconstruction information for a file. This information is compared against the local chunk cache so that `hf_xet` only issues requests for uncached chunks. + +### Hugging Face Hub + +The Hub backend manages the Git repository, authentication & authorization, and metadata about both the files and repository. The Hub communicates with the client and CAS. + +### Content Addressed Store (CAS) + +The content addressed store (CAS) is more than just a store - it is set of services that exposes APIs for supporting uploading and downloading Xet-backed files with a key-value store (DynamoDb) mapping hashed content and metadata to its location in S3. + +The primary APIs are used for: + +1. Uploading blocks: Verifies the contents of the uploaded blocks, and then writes them to the appropriate S3 bucket. +2. Uploading shards: Verifies the contents of the uploaded shards, writes them to the appropriate S3 bucket, and registers the shard in CAS +3. Downloading file reconstruction information: Given the `Xet backed hash` field from a pointer file organize the manifest necessary to rebuild the file. Return the manifest to the client for direct download from S3 using presigned URLs for the relevant blocks to download. +4. Check storage location: Given the `LFS SHA256 hash` this returns if Xet or LFS manages the content. This is a critical part of migration & compatibility with the legacy LFS storage system. +5. LFS Bridge: Allows repositories using Xet storage to be accessed by legacy non-Xet-aware clients. The Bridge mimics an LFS server but does the work of reconstructing the requested file and returning it to the client. This allows downloading files through a single URL (so you can use tools like `curl` of the web interface of the Hub to download files). + +### AWS S3 + +S3 stores the blocks and shards. It provides resiliency, availability, and fast access leveraging [Cloudfront](https://aws.amazon.com/cloudfront/) as a CDN. + +### Upload Sequence Diagram + +![new-writes.png](attachment:006a81c4-8ec6-4c78-a1a1-d47c3e4dd543:new-writes.png) + +### Download Sequence Diagram + +![new-reads.png](attachment:337bb67d-bad4-4e27-a9c5-179d6ae746aa:new-reads.png) + +### Backward Compatibility with LFS + +Xet Storage provides a seamless transition for existing Hub repositories. It isn’t necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. + +This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in LFS or Xet storage, allowing downstream services (LFS or the LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. + +While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through LFS and a background process will convert the file revision to a Xet-backed version. + +### Deduplication + +### Security Model + +### Recommendations + +#### Current Limitations +#### Best Practices + +### Using Xet Storage diff --git a/docs/hub/repositories.md b/docs/hub/repositories.md index 125129968..969711b85 100644 --- a/docs/hub/repositories.md +++ b/docs/hub/repositories.md @@ -16,3 +16,4 @@ In these pages, you will go over the basics of getting started with Git and inte - [Repository storage limits](./storage-limits) - [Next Steps](./repositories-next-steps) - [Licenses](./repositories-licenses) +- [Storage](./repositories-storage) From 48a4ac29a37f9dc2866c34c9bbcc3e145b17954a Mon Sep 17 00:00:00 2001 From: jsulz Date: Fri, 7 Mar 2025 11:57:49 -0800 Subject: [PATCH 02/25] reformat and move LFS to bottom --- docs/hub/repositories-storage.md | 71 ++++++++++++++++---------------- 1 file changed, 36 insertions(+), 35 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 3720ffff1..6c263339b 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -2,54 +2,26 @@ ## Intro -Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files considerably different files from those used to build traditional software. +Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files considerably different files from those used to build traditional software. They are: - Large - in the range of GB or TB - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) -To manage these in a Git repository has traditionally meant using [Git LFS](https://git-lfs.com/), a Git extension. - -## Git LFS - -Git LFS is utilized when working files larger than 10MB or whose extensions are present in a `.gitattributes` file: - -![ ADD IMAGE OF .gitattributes here ] - -Instead of storing these alongside the rest of the content in the repository, Git LFS routes the content in a remote storage designed for large objects. - -Git LFS then creates a "pointer file" which is stored in the repository for the given revision: - -![Example from a Hub repository](attachment:75d5c684-4245-47a1-bf3d-9b980ae26043:Screenshot_2025-02-24_at_8.38.24_AM.png) - -Example from a Hub repository - -The fields in a pointer file that you will see on the Hub are: - -- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. -- **Pointer size**: The size of the pointer file stored in the Git repository. -- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. - -As you can see, the pointer file size is much smaller than the remote file, allowing the repository itself to remain small. This is especially important when working with a repository using Git, as only remote files at the specific commit are transferred instead of each revision of the remote file. - -The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing). - -The main limitation of LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). - -This leads to a worse developer experience along with a proliferation of additional storage. +To manage these in a Git repository has traditionally meant using [Git LFS](https://git-lfs.com/), a Git extension. While Git LFS is still in use on the Hub (see the [Legacy section below](#legacy-storage-git-lfs)), repositories are transitioning to a new custom storage system to enable faster file transfers. ## Xet [In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace LFS on the Hub. -Like LFS, a Xet-backed repository utilizes S3 as the remote storage and stores pointer files in the repository. +Like LFS, a Xet-backed repository utilizes S3 as the remote storage and stores pointer files in the repository. ![Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage. -Unlike LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, it's contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded. +Unlike LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, it's contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded. The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to only notice that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few chunks) and securely deduplicates shared blocks across repositories. @@ -68,7 +40,7 @@ The client represents whatever machine is uploading or downloading a file. Curre When uploading files to Hub, `hf_xet` chunks the files into immutable content-defined chunks and deduplicates - ignoring previously seen chunks and only uploading new ones. -On the download path, `hf_xet` communicates with CAS to get the reconstruction information for a file. This information is compared against the local chunk cache so that `hf_xet` only issues requests for uncached chunks. +On the download path, `hf_xet` communicates with CAS to get the reconstruction information for a file. This information is compared against the local chunk cache so that `hf_xet` only issues requests for uncached chunks. ### Hugging Face Hub @@ -100,11 +72,11 @@ S3 stores the blocks and shards. It provides resiliency, availability, and fast ### Backward Compatibility with LFS -Xet Storage provides a seamless transition for existing Hub repositories. It isn’t necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. +Xet Storage provides a seamless transition for existing Hub repositories. It isn’t necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in LFS or Xet storage, allowing downstream services (LFS or the LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. -While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through LFS and a background process will convert the file revision to a Xet-backed version. +While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through LFS and a background process will convert the file revision to a Xet-backed version. ### Deduplication @@ -113,6 +85,35 @@ While a Xet-aware client will receive file reconstruction information from CAS t ### Recommendations #### Current Limitations + #### Best Practices ### Using Xet Storage + +## Legacy Storage: Git LFS + +Git LFS is utilized when working files larger than 10MB or whose extensions are present in a `.gitattributes` file: + +![ ADD IMAGE OF .gitattributes here ] + +Instead of storing these alongside the rest of the content in the repository, Git LFS routes the content in a remote storage designed for large objects. + +Git LFS then creates a "pointer file" which is stored in the repository for the given revision: + +![Example from a Hub repository](attachment:75d5c684-4245-47a1-bf3d-9b980ae26043:Screenshot_2025-02-24_at_8.38.24_AM.png) + +Example from a Hub repository + +The fields in a pointer file that you will see on the Hub are: + +- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. +- **Pointer size**: The size of the pointer file stored in the Git repository. +- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. + +As you can see, the pointer file size is much smaller than the remote file, allowing the repository itself to remain small. This is especially important when working with a repository using Git, as only remote files at the specific commit are transferred instead of each revision of the remote file. + +The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing). + +The main limitation of LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). + +This leads to a worse developer experience along with a proliferation of additional storage. From 25952d67a6df7cb8e42a31e1cb74fc560941715f Mon Sep 17 00:00:00 2001 From: jsulz Date: Fri, 7 Mar 2025 13:19:58 -0800 Subject: [PATCH 03/25] first pass at repositioning Xet first, LFS last --- docs/hub/repositories-storage.md | 42 ++++++++++++-------------------- 1 file changed, 16 insertions(+), 26 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 6c263339b..1b22dc823 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -9,17 +9,27 @@ They are: - Large - in the range of GB or TB - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) -To manage these in a Git repository has traditionally meant using [Git LFS](https://git-lfs.com/), a Git extension. While Git LFS is still in use on the Hub (see the [Legacy section below](#legacy-storage-git-lfs)), repositories are transitioning to a new custom storage system to enable faster file transfers. +Storing these files directly in a Git repository is impractical. Not only are the storage systems behind Git repositories unsuited for such large files, but when you clone a repository, Git retrieves the entire history, including all file revisions, which can be prohibitively large for massive binaries. Instead, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like Amazon S3). As a result, the repository stays small and typical Git workflows remain efficient. + +Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported and widely used (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development to enable faster file transfers. ## Xet [In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace LFS on the Hub. -Like LFS, a Xet-backed repository utilizes S3 as the remote storage and stores pointer files in the repository. +Like LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. -![Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) +[!IMAGEHEREOFGITATTRIBUTES] + +Meanwhile, the pointer files provides metadata to locate the actual file contents in remote storage: -Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage. +- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. +- **Pointer size**: The size of the pointer file stored in the Git repository. +- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. + +A Xet pointer includes all of this information (by design; refer to the section on [backwards compatibility with LFS](#backward-compatibility-with-lfs)) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. + +![Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) Unlike LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, it's contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded. @@ -92,28 +102,8 @@ While a Xet-aware client will receive file reconstruction information from CAS t ## Legacy Storage: Git LFS -Git LFS is utilized when working files larger than 10MB or whose extensions are present in a `.gitattributes` file: - -![ ADD IMAGE OF .gitattributes here ] - -Instead of storing these alongside the rest of the content in the repository, Git LFS routes the content in a remote storage designed for large objects. - -Git LFS then creates a "pointer file" which is stored in the repository for the given revision: - -![Example from a Hub repository](attachment:75d5c684-4245-47a1-bf3d-9b980ae26043:Screenshot_2025-02-24_at_8.38.24_AM.png) - -Example from a Hub repository - -The fields in a pointer file that you will see on the Hub are: - -- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. -- **Pointer size**: The size of the pointer file stored in the Git repository. -- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. - -As you can see, the pointer file size is much smaller than the remote file, allowing the repository itself to remain small. This is especially important when working with a repository using Git, as only remote files at the specific commit are transferred instead of each revision of the remote file. - -The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing). +The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing). -The main limitation of LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). +The primary limitation of LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). This leads to a worse developer experience along with a proliferation of additional storage. From 383d1a993c5db5e05abe7ef4e6015444a84483fb Mon Sep 17 00:00:00 2001 From: jsulz Date: Fri, 7 Mar 2025 15:59:12 -0800 Subject: [PATCH 04/25] grammar and flow nits --- docs/hub/repositories-storage.md | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 1b22dc823..8dc79f757 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -2,24 +2,26 @@ ## Intro -Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files considerably different files from those used to build traditional software. +Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files that are considerably different from those used to build traditional software. They are: - Large - in the range of GB or TB - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) -Storing these files directly in a Git repository is impractical. Not only are the storage systems behind Git repositories unsuited for such large files, but when you clone a repository, Git retrieves the entire history, including all file revisions, which can be prohibitively large for massive binaries. Instead, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like Amazon S3). As a result, the repository stays small and typical Git workflows remain efficient. +Storing these files directly in a Git repository is impractical. Not only are the storage systems behind Git repositories unsuited for such large files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing users to download gigabytes of historic data they may never need. -Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported and widely used (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development to enable faster file transfers. +Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like Amazon S3). As a result, the repository stays small and typical Git workflows remain efficient. + +Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported and widely used (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. ## Xet -[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace LFS on the Hub. +[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. -Like LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. +Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. -[!IMAGEHEREOFGITATTRIBUTES] +![ ADD IMAGE OF .gitattributes here ] Meanwhile, the pointer files provides metadata to locate the actual file contents in remote storage: @@ -27,11 +29,11 @@ Meanwhile, the pointer files provides metadata to locate the actual file content - **Pointer size**: The size of the pointer file stored in the Git repository. - **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. -A Xet pointer includes all of this information (by design; refer to the section on [backwards compatibility with LFS](#backward-compatibility-with-lfs)) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. +A Xet pointer includes all of this information (by design; refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs)) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. -![Xet pointer files are nearly identical to LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) +![Xet pointer files are nearly identical to Git LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) -Unlike LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, it's contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded. +Unlike Git LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, its contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded. The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to only notice that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few chunks) and securely deduplicates shared blocks across repositories. @@ -65,8 +67,8 @@ The primary APIs are used for: 1. Uploading blocks: Verifies the contents of the uploaded blocks, and then writes them to the appropriate S3 bucket. 2. Uploading shards: Verifies the contents of the uploaded shards, writes them to the appropriate S3 bucket, and registers the shard in CAS 3. Downloading file reconstruction information: Given the `Xet backed hash` field from a pointer file organize the manifest necessary to rebuild the file. Return the manifest to the client for direct download from S3 using presigned URLs for the relevant blocks to download. -4. Check storage location: Given the `LFS SHA256 hash` this returns if Xet or LFS manages the content. This is a critical part of migration & compatibility with the legacy LFS storage system. -5. LFS Bridge: Allows repositories using Xet storage to be accessed by legacy non-Xet-aware clients. The Bridge mimics an LFS server but does the work of reconstructing the requested file and returning it to the client. This allows downloading files through a single URL (so you can use tools like `curl` of the web interface of the Hub to download files). +4. Check storage location: Given the `LFS SHA256 hash` this returns if Xet or Git LFS manages the content. This is a critical part of migration & compatibility with the legacy Git LFS storage system. +5. Git LFS Bridge: Allows repositories using Xet storage to be accessed by legacy non-Xet-aware clients. The Bridge mimics an Git LFS server but does the work of reconstructing the requested file and returning it to the client. This allows downloading files through a single URL (so you can use tools like `curl` of the web interface of the Hub to download files). ### AWS S3 @@ -82,11 +84,11 @@ S3 stores the blocks and shards. It provides resiliency, availability, and fast ### Backward Compatibility with LFS -Xet Storage provides a seamless transition for existing Hub repositories. It isn’t necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. +Xet Storage provides a seamless transition for existing Hub repositories. It isn’t necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. -This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in LFS or Xet storage, allowing downstream services (LFS or the LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. +This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. -While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through LFS and a background process will convert the file revision to a Xet-backed version. +While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version. ### Deduplication @@ -104,6 +106,6 @@ While a Xet-aware client will receive file reconstruction information from CAS t The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing). -The primary limitation of LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). +The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). This leads to a worse developer experience along with a proliferation of additional storage. From 966fb1f847bc4a93e255d7d8df7e2fd5d0b81fdf Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Fri, 7 Mar 2025 16:20:30 -0800 Subject: [PATCH 05/25] Add to index.md --- docs/hub/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/hub/index.md b/docs/hub/index.md index 7e710923b..605838bd6 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -16,6 +16,7 @@ The Hugging Face Hub is a platform with over 900k models, 200k datasets, and 300 Webhooks Next Steps Licenses +Storage
From b5efe6cfd23cf653e696ab2764370cec7f46eb12 Mon Sep 17 00:00:00 2001 From: jsulz Date: Mon, 10 Mar 2025 11:18:52 -0700 Subject: [PATCH 06/25] working deduplication section in and fixing some grammar nits --- docs/hub/repositories-storage.md | 38 ++++++++++++++++++-------------- 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 8dc79f757..0cca7b80d 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -1,7 +1,5 @@ # Storage -## Intro - Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files that are considerably different from those used to build traditional software. They are: @@ -9,7 +7,7 @@ They are: - Large - in the range of GB or TB - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) -Storing these files directly in a Git repository is impractical. Not only are the storage systems behind Git repositories unsuited for such large files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing users to download gigabytes of historic data they may never need. +Storing these files directly in a Git repository is impractical. Not only are the storage systems behind Git repositories unsuited for such large files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like Amazon S3). As a result, the repository stays small and typical Git workflows remain efficient. @@ -23,7 +21,7 @@ Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a ` ![ ADD IMAGE OF .gitattributes here ] -Meanwhile, the pointer files provides metadata to locate the actual file contents in remote storage: +Meanwhile, the pointer files provide metadata to locate the actual file contents in remote storage: - **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. - **Pointer size**: The size of the pointer file stored in the Git repository. @@ -33,9 +31,19 @@ A Xet pointer includes all of this information (by design; refer to the section ![Xet pointer files are nearly identical to Git LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) -Unlike Git LFS, Xet-enabled repositories utilize [content defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data) for the large binary files found in Model and Dataset repositories. When a file is uploaded to a Xet-backed repository, its contents are broken down into these variable-sized chunks. New chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded while previously seen chunks are discarded. +Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows like incremental updates to model checkpoints or appending or inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to the [Deduplication](#deduplication) section below. + +### Deduplication + +Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a `chunk`). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, all other chunks are discarded. + +To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a [content-addressed store (CAS)](#content-addressed-store-cas), keyed by its hash. -The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to only notice that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few chunks) and securely deduplicates shared blocks across repositories. +The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to only notice that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few chunks) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. + +For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. + +### Architecture Overview Supporting this requires coordination between the storage layer and the local machine interacting with the repository (and all the systems in-between). There are 4 primary components to the Xet architecture: @@ -46,7 +54,7 @@ Supporting this requires coordination between the storage layer and the local ma ![IMAGE OF XET ARCHITECTURE] -### Client +#### Client The client represents whatever machine is uploading or downloading a file. Current support is limited to [the Python package, `hf_xet`](https://pypi.org/project/hf-xet/), which provides an integration with the `huggingface_hub` and Xet-backed repositories. @@ -54,11 +62,11 @@ When uploading files to Hub, `hf_xet` chunks the files into immutable content-de On the download path, `hf_xet` communicates with CAS to get the reconstruction information for a file. This information is compared against the local chunk cache so that `hf_xet` only issues requests for uncached chunks. -### Hugging Face Hub +#### Hugging Face Hub The Hub backend manages the Git repository, authentication & authorization, and metadata about both the files and repository. The Hub communicates with the client and CAS. -### Content Addressed Store (CAS) +#### Content Addressed Store (CAS) The content addressed store (CAS) is more than just a store - it is set of services that exposes APIs for supporting uploading and downloading Xet-backed files with a key-value store (DynamoDb) mapping hashed content and metadata to its location in S3. @@ -70,28 +78,26 @@ The primary APIs are used for: 4. Check storage location: Given the `LFS SHA256 hash` this returns if Xet or Git LFS manages the content. This is a critical part of migration & compatibility with the legacy Git LFS storage system. 5. Git LFS Bridge: Allows repositories using Xet storage to be accessed by legacy non-Xet-aware clients. The Bridge mimics an Git LFS server but does the work of reconstructing the requested file and returning it to the client. This allows downloading files through a single URL (so you can use tools like `curl` of the web interface of the Hub to download files). -### AWS S3 +#### AWS S3 S3 stores the blocks and shards. It provides resiliency, availability, and fast access leveraging [Cloudfront](https://aws.amazon.com/cloudfront/) as a CDN. -### Upload Sequence Diagram +#### Upload Sequence Diagram ![new-writes.png](attachment:006a81c4-8ec6-4c78-a1a1-d47c3e4dd543:new-writes.png) -### Download Sequence Diagram +#### Download Sequence Diagram ![new-reads.png](attachment:337bb67d-bad4-4e27-a9c5-179d6ae746aa:new-reads.png) ### Backward Compatibility with LFS -Xet Storage provides a seamless transition for existing Hub repositories. It isn’t necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. +Xet Storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version. -### Deduplication - ### Security Model ### Recommendations @@ -104,7 +110,7 @@ While a Xet-aware client will receive file reconstruction information from CAS t ## Legacy Storage: Git LFS -The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub’s Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories’ files (45PB total as of this writing). +The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub's Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories' files (45PB total as of this writing). The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). From 45f7251fe80aaa208720423c26347b6a7c6460b0 Mon Sep 17 00:00:00 2001 From: jsulz Date: Mon, 10 Mar 2025 12:59:52 -0700 Subject: [PATCH 07/25] refining 'using xet storage' section --- docs/hub/repositories-storage.md | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 0cca7b80d..53e6c974b 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -100,13 +100,33 @@ While a Xet-aware client will receive file reconstruction information from CAS t ### Security Model -### Recommendations +### Using Xet Storage -#### Current Limitations +To start using Xet Storage, the simplest way to get started is to install the `hf_xet` python package when installing `huggingface_hub`. + +```bash +pip install huggingface_hub[hf_xet] +``` + +If you use the `transformers` or `datasets` libraries instead of making requests through `huggingface_hub` then simply install `hf_xet` directly. + +```bash +pip install hf-xet +``` + +If your Python environment has a `hf_xet`-aware version of `huggingface_hub` then your uploads and downloads will automatically use Xet. + +That's it! You now get the benefits of Xet deduplication for both uploads and downloads. Team members using older `huggingface_hub` versions will still be able to upload and download repositories through the backwards compatibility provided by the LFS bridge. #### Best Practices -### Using Xet Storage +#### Current Limitations + +While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: + +- **64-bit systems only**: The hf_xet client currently requires a 64-bit architecture; 32-bit systems are not supported. +- **Partial JavaScript library support**: The [huggingface.js](https://huggingface.co/docs/huggingface.js/index) library has limited functionality with Xet-based repositories; additional coverage is planned in future releases. +- **Full web support currently unavailable**: Full support for chunked uploads via the Hub web interface remains under development. ## Legacy Storage: Git LFS From 5e400e7d8f67dfe9a20e4e96ad7ab888c1ede6f5 Mon Sep 17 00:00:00 2001 From: jsulz Date: Mon, 10 Mar 2025 13:25:03 -0700 Subject: [PATCH 08/25] worked on 'recommendations' section --- docs/hub/repositories-storage.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 53e6c974b..35c2c22b3 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -118,7 +118,14 @@ If your Python environment has a `hf_xet`-aware version of `huggingface_hub` the That's it! You now get the benefits of Xet deduplication for both uploads and downloads. Team members using older `huggingface_hub` versions will still be able to upload and download repositories through the backwards compatibility provided by the LFS bridge. -#### Best Practices +#### Recommendations + +Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage: + +- **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. +- **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks upload, so frequent commits are both fast and storage-efficient. +- **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., _.safetensors, _.bin) to avoid routing small, text-based files through large-file storage for optimal performance. +- **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. #### Current Limitations From 6c84ac020cd03188644ed1f93143fa02208e60fa Mon Sep 17 00:00:00 2001 From: jsulz Date: Mon, 10 Mar 2025 13:43:32 -0700 Subject: [PATCH 09/25] pass through for flow and verbiage --- docs/hub/repositories-storage.md | 36 ++++++++++++++++---------------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 35c2c22b3..9040d96b8 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -1,15 +1,15 @@ # Storage -Repositories on the Hugging Face Hub are unique to those on software development platforms. While both leverage the benefits of modern version control with the support of Git, Hub repositories often contain files that are considerably different from those used to build traditional software. +Repositories on the Hugging Face Hub are unique to those on software development platforms. They contain files that are: -They are: - -- Large - in the range of GB or TB +- Large - in the range of GB and above - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) -Storing these files directly in a Git repository is impractical. Not only are the storage systems behind Git repositories unsuited for such large files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. +While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. + +Storing these files directly in a Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. -Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like Amazon S3). As a result, the repository stays small and typical Git workflows remain efficient. +Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported and widely used (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. @@ -31,15 +31,15 @@ A Xet pointer includes all of this information (by design; refer to the section ![Xet pointer files are nearly identical to Git LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) -Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows like incremental updates to model checkpoints or appending or inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to the [Deduplication](#deduplication) section below. +Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to the [Deduplication](#deduplication) section below. ### Deduplication -Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a `chunk`). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, all other chunks are discarded. +Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a [content-addressed store (CAS)](#content-addressed-store-cas), keyed by its hash. -The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to only notice that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few chunks) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. +The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. @@ -47,10 +47,10 @@ For more details, refer to the [From Files to Chunks](https://huggingface.co/blo Supporting this requires coordination between the storage layer and the local machine interacting with the repository (and all the systems in-between). There are 4 primary components to the Xet architecture: -1. Client -2. Hugging Face Hub -3. Content addressed store (CAS) -4. Amazon S3 +1. [Client](#client) +2. [Hugging Face Hub](#hugging-face-hub) +3. [Content addressed store (CAS)](#content-addressed-store-cas) +4. [Amazon S3](#aws-s3) ![IMAGE OF XET ARCHITECTURE] @@ -92,7 +92,7 @@ S3 stores the blocks and shards. It provides resiliency, availability, and fast ### Backward Compatibility with LFS -Xet Storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. +Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. @@ -102,13 +102,13 @@ While a Xet-aware client will receive file reconstruction information from CAS t ### Using Xet Storage -To start using Xet Storage, the simplest way to get started is to install the `hf_xet` python package when installing `huggingface_hub`. +To start using Xet Storage, the simplest way to get started is to install the `hf_xet` python package when installing `huggingface_hub`: ```bash pip install huggingface_hub[hf_xet] ``` -If you use the `transformers` or `datasets` libraries instead of making requests through `huggingface_hub` then simply install `hf_xet` directly. +If you use the `transformers` or `datasets` libraries instead of making requests through `huggingface_hub` then simply install `hf_xet` directly: ```bash pip install hf-xet @@ -124,7 +124,7 @@ Xet integrates seamlessly with the Hub's current Python-based workflows. However - **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. - **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks upload, so frequent commits are both fast and storage-efficient. -- **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., _.safetensors, _.bin) to avoid routing small, text-based files through large-file storage for optimal performance. +- **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. #### Current Limitations @@ -132,7 +132,7 @@ Xet integrates seamlessly with the Hub's current Python-based workflows. However While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: - **64-bit systems only**: The hf_xet client currently requires a 64-bit architecture; 32-bit systems are not supported. -- **Partial JavaScript library support**: The [huggingface.js](https://huggingface.co/docs/huggingface.js/index) library has limited functionality with Xet-based repositories; additional coverage is planned in future releases. +- **Partial JavaScript library support**: The [huggingface.js](https://huggingface.co/docs/huggingface.js/index) library has limited functionality with Xet-backed repositories; additional coverage is planned in future releases. - **Full web support currently unavailable**: Full support for chunked uploads via the Hub web interface remains under development. ## Legacy Storage: Git LFS From 82960a9f5f933d8026647fbf85f042834519add4 Mon Sep 17 00:00:00 2001 From: jsulz Date: Mon, 10 Mar 2025 14:34:12 -0700 Subject: [PATCH 10/25] images uploaded and formatted --- docs/hub/repositories-storage.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 9040d96b8..d30930c45 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -19,7 +19,10 @@ Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) fo Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. -![ ADD IMAGE OF .gitattributes here ] +
+ + +
Meanwhile, the pointer files provide metadata to locate the actual file contents in remote storage: @@ -29,7 +32,10 @@ Meanwhile, the pointer files provide metadata to locate the actual file contents A Xet pointer includes all of this information (by design; refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs)) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. -![Xet pointer files are nearly identical to Git LFS pointer files with the addition of a `Xet backed hash` field that is used for referencing the file in Xet storage.](attachment:9828eb0c-3c93-4a85-bb79-9daacbec3258:Screenshot_2025-02-24_at_9.37.36_AM.png) +
+ + +
Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to the [Deduplication](#deduplication) section below. @@ -84,11 +90,15 @@ S3 stores the blocks and shards. It provides resiliency, availability, and fast #### Upload Sequence Diagram -![new-writes.png](attachment:006a81c4-8ec6-4c78-a1a1-d47c3e4dd543:new-writes.png) +
+ +
#### Download Sequence Diagram -![new-reads.png](attachment:337bb67d-bad4-4e27-a9c5-179d6ae746aa:new-reads.png) +
+ +
### Backward Compatibility with LFS From 1b547b44573a2ab0071734ff2d97d59abe6fd223 Mon Sep 17 00:00:00 2001 From: jsulz Date: Thu, 13 Mar 2025 11:30:54 -0700 Subject: [PATCH 11/25] dropping architecture overview; will move to xet-core --- docs/hub/repositories-storage.md | 51 -------------------------------- 1 file changed, 51 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index d30930c45..b153fb710 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -49,57 +49,6 @@ The Hub's [current recommendation is to limit files to 20GB](https://huggingface For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. -### Architecture Overview - -Supporting this requires coordination between the storage layer and the local machine interacting with the repository (and all the systems in-between). There are 4 primary components to the Xet architecture: - -1. [Client](#client) -2. [Hugging Face Hub](#hugging-face-hub) -3. [Content addressed store (CAS)](#content-addressed-store-cas) -4. [Amazon S3](#aws-s3) - -![IMAGE OF XET ARCHITECTURE] - -#### Client - -The client represents whatever machine is uploading or downloading a file. Current support is limited to [the Python package, `hf_xet`](https://pypi.org/project/hf-xet/), which provides an integration with the `huggingface_hub` and Xet-backed repositories. - -When uploading files to Hub, `hf_xet` chunks the files into immutable content-defined chunks and deduplicates - ignoring previously seen chunks and only uploading new ones. - -On the download path, `hf_xet` communicates with CAS to get the reconstruction information for a file. This information is compared against the local chunk cache so that `hf_xet` only issues requests for uncached chunks. - -#### Hugging Face Hub - -The Hub backend manages the Git repository, authentication & authorization, and metadata about both the files and repository. The Hub communicates with the client and CAS. - -#### Content Addressed Store (CAS) - -The content addressed store (CAS) is more than just a store - it is set of services that exposes APIs for supporting uploading and downloading Xet-backed files with a key-value store (DynamoDb) mapping hashed content and metadata to its location in S3. - -The primary APIs are used for: - -1. Uploading blocks: Verifies the contents of the uploaded blocks, and then writes them to the appropriate S3 bucket. -2. Uploading shards: Verifies the contents of the uploaded shards, writes them to the appropriate S3 bucket, and registers the shard in CAS -3. Downloading file reconstruction information: Given the `Xet backed hash` field from a pointer file organize the manifest necessary to rebuild the file. Return the manifest to the client for direct download from S3 using presigned URLs for the relevant blocks to download. -4. Check storage location: Given the `LFS SHA256 hash` this returns if Xet or Git LFS manages the content. This is a critical part of migration & compatibility with the legacy Git LFS storage system. -5. Git LFS Bridge: Allows repositories using Xet storage to be accessed by legacy non-Xet-aware clients. The Bridge mimics an Git LFS server but does the work of reconstructing the requested file and returning it to the client. This allows downloading files through a single URL (so you can use tools like `curl` of the web interface of the Hub to download files). - -#### AWS S3 - -S3 stores the blocks and shards. It provides resiliency, availability, and fast access leveraging [Cloudfront](https://aws.amazon.com/cloudfront/) as a CDN. - -#### Upload Sequence Diagram - -
- -
- -#### Download Sequence Diagram - -
- -
- ### Backward Compatibility with LFS Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. From 5b02feae3f3ca730e4ae61a4fc3ca6205c47fe98 Mon Sep 17 00:00:00 2001 From: jsulz Date: Thu, 13 Mar 2025 13:02:07 -0700 Subject: [PATCH 12/25] updating link placement --- docs/hub/_toctree.yml | 4 ++-- docs/hub/index.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index f234c6bb9..e1e67276c 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -25,14 +25,14 @@ title: "How-to: Create automatic metadata quality reports" - local: notebooks title: Notebooks + - local: repositories-storage + title: Storage Backends - local: storage-limits title: Storage Limits - local: repositories-next-steps title: Next Steps - local: repositories-licenses title: Licenses - - local: repositories-storage - title: Storage - title: Models local: models isExpanded: true diff --git a/docs/hub/index.md b/docs/hub/index.md index 605838bd6..2439bbf2b 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -14,9 +14,9 @@ The Hugging Face Hub is a platform with over 900k models, 200k datasets, and 300 Notifications Collections Webhooks +Storage Backend Next Steps Licenses -Storage
From 641eee9a7dfdf2a4ef57053364244459c7ca4b29 Mon Sep 17 00:00:00 2001 From: jsulz Date: Thu, 13 Mar 2025 13:16:18 -0700 Subject: [PATCH 13/25] incorporating feedback --- docs/hub/repositories-storage.md | 49 ++++++++++++++++---------------- 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index b153fb710..2ffabd839 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -1,8 +1,8 @@ # Storage -Repositories on the Hugging Face Hub are unique to those on software development platforms. They contain files that are: +Repositories on the Hugging Face Hub are different from those on software development platforms. They contain files that are: -- Large - in the range of GB and above +- Large - model or dataset files are in the range of GB and above. We have a few TB-scale files! - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. @@ -15,7 +15,7 @@ Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) fo ## Xet -[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage started based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. +[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage startup based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. @@ -39,29 +39,9 @@ A Xet pointer includes all of this information (by design; refer to the section Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to the [Deduplication](#deduplication) section below. -### Deduplication - -Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. - -To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a [content-addressed store (CAS)](#content-addressed-store-cas), keyed by its hash. - -The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. - -For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. - -### Backward Compatibility with LFS - -Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. - -This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. - -While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version. - -### Security Model - ### Using Xet Storage -To start using Xet Storage, the simplest way to get started is to install the `hf_xet` python package when installing `huggingface_hub`: +To start using Xet Storage, you need a Xet-enabled client. Currently, you can do this by using the `hf_xet` python package when installing `huggingface_hub`: ```bash pip install huggingface_hub[hf_xet] @@ -93,6 +73,27 @@ While Xet brings fine-grained deduplication and enhanced performance to Git-base - **64-bit systems only**: The hf_xet client currently requires a 64-bit architecture; 32-bit systems are not supported. - **Partial JavaScript library support**: The [huggingface.js](https://huggingface.co/docs/huggingface.js/index) library has limited functionality with Xet-backed repositories; additional coverage is planned in future releases. - **Full web support currently unavailable**: Full support for chunked uploads via the Hub web interface remains under development. +- **Git client integration (git-xet**)**: Planned but remains under development. + +### Deduplication + +Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-enabled client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. + +To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a [content-addressed store (CAS)](#content-addressed-store-cas), keyed by its hash. + +The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. + +For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. + +### Backward Compatibility with LFS + +Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. + +This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. + +While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version. + +### Security Model ## Legacy Storage: Git LFS From 86e967ffbf112b750dee488757f1efd3d0172bf0 Mon Sep 17 00:00:00 2001 From: jsulz Date: Thu, 13 Mar 2025 13:35:56 -0700 Subject: [PATCH 14/25] adding callout to join the waitlist and links to huggingface_hub docs --- docs/hub/repositories-storage.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 2ffabd839..53bd0497d 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -41,7 +41,16 @@ Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories d ### Using Xet Storage -To start using Xet Storage, you need a Xet-enabled client. Currently, you can do this by using the `hf_xet` python package when installing `huggingface_hub`: +To start using Xet Storage, you need a Xet-enabled repository and client. + + + +To make Xet the default for all your repositories, [join the waitlist](https://huggingface.co/join/xet)! You can apply for yourself or your entire organization (requires [admin permissions](https://huggingface.co/docs/hub/organizations-security)). Once approved, all current repositories will be automatically migrated to Xet and future repositories will be Xet-enabled by default. + + + + +To access a Xet-enabled client, add the `hf_xet` Python package when installing `huggingface_hub`: ```bash pip install huggingface_hub[hf_xet] @@ -57,6 +66,11 @@ If your Python environment has a `hf_xet`-aware version of `huggingface_hub` the That's it! You now get the benefits of Xet deduplication for both uploads and downloads. Team members using older `huggingface_hub` versions will still be able to upload and download repositories through the backwards compatibility provided by the LFS bridge. +To see more detailed usage docs, refer to the `huggingface_hub` docs for: +- [Upload](https://huggingface.co/docs/huggingface_hub/guides/upload#faster-uploads-with-hf_xet) +- [Download](https://huggingface.co/docs/huggingface_hub/guides/download#hf_xet) +- [Managing the `hf_xet` cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#xet-cache) + #### Recommendations Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage: From b02362cf9940451215c60d99c3845f8d99d2dd81 Mon Sep 17 00:00:00 2001 From: jsulz Date: Thu, 13 Mar 2025 13:42:00 -0700 Subject: [PATCH 15/25] minor flow nit --- docs/hub/repositories-storage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 53bd0497d..b6e839f93 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -24,7 +24,7 @@ Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `
-Meanwhile, the pointer files provide metadata to locate the actual file contents in remote storage: +Meanwhile, a Git LFS pointer file provide metadata to locate the actual file contents in remote storage: - **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. - **Pointer size**: The size of the pointer file stored in the Git repository. From 779df1c5ab0033ca7f76191ea4762bc655d4cbd4 Mon Sep 17 00:00:00 2001 From: jsulz Date: Thu, 13 Mar 2025 13:57:17 -0700 Subject: [PATCH 16/25] TOC and index consistency with page title --- docs/hub/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/index.md b/docs/hub/index.md index 2439bbf2b..ce6e513bb 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -14,7 +14,7 @@ The Hugging Face Hub is a platform with over 900k models, 200k datasets, and 300 Notifications Collections Webhooks -Storage Backend +Storage Backends Next Steps Licenses From c5c82483f52ca060682eeeb7333204682f0bf29c Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Fri, 14 Mar 2025 13:13:46 +0100 Subject: [PATCH 17/25] Update docs/hub/repositories-storage.md --- docs/hub/repositories-storage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index b6e839f93..eb52146e5 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -87,7 +87,7 @@ While Xet brings fine-grained deduplication and enhanced performance to Git-base - **64-bit systems only**: The hf_xet client currently requires a 64-bit architecture; 32-bit systems are not supported. - **Partial JavaScript library support**: The [huggingface.js](https://huggingface.co/docs/huggingface.js/index) library has limited functionality with Xet-backed repositories; additional coverage is planned in future releases. - **Full web support currently unavailable**: Full support for chunked uploads via the Hub web interface remains under development. -- **Git client integration (git-xet**)**: Planned but remains under development. +- **Git client integration (git-xet)**: Planned but remains under development. ### Deduplication From e8ee3fd8b046c197d0c5fda53a173ee1f48c9bec Mon Sep 17 00:00:00 2001 From: Jared Sulzdorf Date: Fri, 14 Mar 2025 07:00:21 -0700 Subject: [PATCH 18/25] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Julien Chaumond Co-authored-by: Célina --- docs/hub/index.md | 2 +- docs/hub/repositories-storage.md | 8 ++++---- docs/hub/repositories.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/hub/index.md b/docs/hub/index.md index ce6e513bb..c0e461e0a 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -14,7 +14,7 @@ The Hugging Face Hub is a platform with over 900k models, 200k datasets, and 300 Notifications Collections Webhooks -Storage Backends +Storage Backends Next Steps Licenses diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index eb52146e5..bd19f1c2b 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -30,7 +30,7 @@ Meanwhile, a Git LFS pointer file provide metadata to locate the actual file con - **Pointer size**: The size of the pointer file stored in the Git repository. - **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. -A Xet pointer includes all of this information (by design; refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs)) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. +A Xet pointer includes all of this information by design. Refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs) with the addition of a `Xet backed hash` field for referencing the file in Xet storage.
@@ -91,11 +91,11 @@ While Xet brings fine-grained deduplication and enhanced performance to Git-base ### Deduplication -Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-enabled client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. +Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a [content-addressed store (CAS)](#content-addressed-store-cas), keyed by its hash. -The Hub's [current recommendation is to limit files to 20GB](https://huggingface.co/docs/hub/storage-limits#recommendations). At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. +The Hub's [current recommendation](https://huggingface.co/docs/hub/storage-limits#recommendations) is to limit files to 20GB. At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. @@ -103,7 +103,7 @@ For more details, refer to the [From Files to Chunks](https://huggingface.co/blo Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format, with only the addition of the `Xet backed hash` field. Meaning, existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file and matches the Git LFS pointer file specification. -This symmetry allows non-Xet-enabled clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. +This symmetry allows non-Xet-aware clients (e.g., older versions of the `huggingface_hub` that are not Xet-aware) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. As noted in the section describing the CAS APIs, the Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services (Git LFS or the Git LFS bridge) to provide the proper URL to S3, regardless of which storage system holds the content. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version. diff --git a/docs/hub/repositories.md b/docs/hub/repositories.md index 969711b85..85107c2b1 100644 --- a/docs/hub/repositories.md +++ b/docs/hub/repositories.md @@ -16,4 +16,4 @@ In these pages, you will go over the basics of getting started with Git and inte - [Repository storage limits](./storage-limits) - [Next Steps](./repositories-next-steps) - [Licenses](./repositories-licenses) -- [Storage](./repositories-storage) +- [Storage Backends](./storage-backends) From 307e5b72bf8a2627b57b6abc98be7fbdb9eb391b Mon Sep 17 00:00:00 2001 From: Jared Sulzdorf Date: Fri, 14 Mar 2025 07:01:45 -0700 Subject: [PATCH 19/25] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Julien Chaumond Co-authored-by: Célina --- docs/hub/_toctree.yml | 2 +- docs/hub/repositories-storage.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index e1e67276c..02ff3d6b6 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -25,7 +25,7 @@ title: "How-to: Create automatic metadata quality reports" - local: notebooks title: Notebooks - - local: repositories-storage + - local: storage-backends title: Storage Backends - local: storage-limits title: Storage Limits diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index bd19f1c2b..70c2c6f73 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -50,7 +50,7 @@ To make Xet the default for all your repositories, [join the waitlist](https://h -To access a Xet-enabled client, add the `hf_xet` Python package when installing `huggingface_hub`: +To access a Xet-aware client, add the `hf_xet` Python package when installing `huggingface_hub`: ```bash pip install huggingface_hub[hf_xet] From aee5d69eb82834e2f98c3a869398733c5d6860f0 Mon Sep 17 00:00:00 2001 From: Jared Sulzdorf Date: Fri, 14 Mar 2025 07:03:03 -0700 Subject: [PATCH 20/25] Update docs/hub/repositories-storage.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Célina --- docs/hub/repositories-storage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/repositories-storage.md b/docs/hub/repositories-storage.md index 70c2c6f73..f6db6d1af 100644 --- a/docs/hub/repositories-storage.md +++ b/docs/hub/repositories-storage.md @@ -41,7 +41,7 @@ Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories d ### Using Xet Storage -To start using Xet Storage, you need a Xet-enabled repository and client. +To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. From 248bfb9a200c8cb307a84b7e367197bfa1a3cfd8 Mon Sep 17 00:00:00 2001 From: Celina Hanouti Date: Fri, 14 Mar 2025 15:07:06 +0100 Subject: [PATCH 21/25] rename file --- docs/hub/{repositories-storage.md => storage-backends.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/hub/{repositories-storage.md => storage-backends.md} (100%) diff --git a/docs/hub/repositories-storage.md b/docs/hub/storage-backends.md similarity index 100% rename from docs/hub/repositories-storage.md rename to docs/hub/storage-backends.md From 1e2e36d9eeb440d31bed4e56d7d840e3514f8254 Mon Sep 17 00:00:00 2001 From: jsulz Date: Fri, 14 Mar 2025 08:42:41 -0700 Subject: [PATCH 22/25] align repositories index page with toctree --- docs/hub/repositories.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/hub/repositories.md b/docs/hub/repositories.md index 85107c2b1..9e602d974 100644 --- a/docs/hub/repositories.md +++ b/docs/hub/repositories.md @@ -13,7 +13,7 @@ In these pages, you will go over the basics of getting started with Git and inte - [Webhooks](./webhooks) - [Notifications](./notifications) - [Collections](./collections) -- [Repository storage limits](./storage-limits) +- [Storage Backends](./storage-backends) +- [Storage Limits](./storage-limits) - [Next Steps](./repositories-next-steps) - [Licenses](./repositories-licenses) -- [Storage Backends](./storage-backends) From e0ce581081d4553e911bf19dddf745654b6f3aa7 Mon Sep 17 00:00:00 2001 From: Jared Sulzdorf Date: Fri, 14 Mar 2025 08:45:10 -0700 Subject: [PATCH 23/25] Apply suggestions from code review Co-authored-by: Lucain Co-authored-by: Julien Chaumond --- docs/hub/storage-backends.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/hub/storage-backends.md b/docs/hub/storage-backends.md index f6db6d1af..4abf2109b 100644 --- a/docs/hub/storage-backends.md +++ b/docs/hub/storage-backends.md @@ -53,10 +53,10 @@ To make Xet the default for all your repositories, [join the waitlist](https://h To access a Xet-aware client, add the `hf_xet` Python package when installing `huggingface_hub`: ```bash -pip install huggingface_hub[hf_xet] +pip install -U huggingface_hub[hf_xet] ``` -If you use the `transformers` or `datasets` libraries instead of making requests through `huggingface_hub` then simply install `hf_xet` directly: +If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub` so you can simply install `hf_xet` in the same env: ```bash pip install hf-xet @@ -76,7 +76,7 @@ To see more detailed usage docs, refer to the `huggingface_hub` docs for: Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage: - **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. -- **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks upload, so frequent commits are both fast and storage-efficient. +- **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient. - **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. From bc6d50eb9e4810722b78313936fbe62b4279c6c6 Mon Sep 17 00:00:00 2001 From: Yucheng Low <2740522+ylow@users.noreply.github.com> Date: Fri, 14 Mar 2025 11:16:53 -0700 Subject: [PATCH 24/25] Added a brief paragraph about security --- docs/hub/storage-backends.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/hub/storage-backends.md b/docs/hub/storage-backends.md index 4abf2109b..4ce88fec3 100644 --- a/docs/hub/storage-backends.md +++ b/docs/hub/storage-backends.md @@ -108,6 +108,7 @@ This symmetry allows non-Xet-aware clients (e.g., older versions of the `hugging While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed locally, a legacy client will get a S3 URL from the Git LFS bridge. Meanwhile, while uploading an update to a Xet-backed file, a Xet-aware client will run CDC deduplication and upload through CAS while a non-Xet-aware client will upload through Git LFS and a background process will convert the file revision to a Xet-backed version. ### Security Model +Xet storage provides data deduplication over all chunks stored in Hugging Face. This is done via cryptographic hashing in a privacy sensitive way. The contents of chunks are protected and are associated with repository permissions. i.e. you can only read chunks which are required to reproduce files you have access to, and no more. See [xet-core](https://github.com/huggingface/xet-core) for details. ## Legacy Storage: Git LFS From 3aeae1e900f05949d1ceffcb5cc757d8751f407c Mon Sep 17 00:00:00 2001 From: jsulz Date: Fri, 14 Mar 2025 11:19:47 -0700 Subject: [PATCH 25/25] updated xet cache link --- docs/hub/storage-backends.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/storage-backends.md b/docs/hub/storage-backends.md index 4ce88fec3..aeab92828 100644 --- a/docs/hub/storage-backends.md +++ b/docs/hub/storage-backends.md @@ -69,7 +69,7 @@ That's it! You now get the benefits of Xet deduplication for both uploads and do To see more detailed usage docs, refer to the `huggingface_hub` docs for: - [Upload](https://huggingface.co/docs/huggingface_hub/guides/upload#faster-uploads-with-hf_xet) - [Download](https://huggingface.co/docs/huggingface_hub/guides/download#hf_xet) -- [Managing the `hf_xet` cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#xet-cache) +- [Managing the `hf_xet` cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#chunk-based-caching-xet) #### Recommendations