diff --git a/docs/hub/_redirects.yml b/docs/hub/_redirects.yml index 0831a66db..b0225497a 100644 --- a/docs/hub/_redirects.yml +++ b/docs/hub/_redirects.yml @@ -18,4 +18,5 @@ api-webhook: webhooks adapter-transformers: adapters security-two-fa: security-2fa repositories-recommendations: storage-limits -xet: storage-backends#xet +xet: xet/index +storage-backends: xet/index diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 7b76ba22b..c1505db07 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -47,8 +47,19 @@ title: Repository Settings - local: storage-limits title: Storage Limits - - local: storage-backends - title: Storage Backends + - local: xet/index + title: Storage Backend (Xet) + sections: + - local: xet/overview + title: Xet History & Overview + - local: xet/using-xet-storage + title: Using Xet Storage + - local: xet/security + title: Security + - local: xet/legacy-git-lfs + title: Backwards Compatibility & Legacy + - local: xet/deduplication + title: Deduplication - local: repositories-pull-requests-discussions title: Pull Requests & Discussions - local: notifications diff --git a/docs/hub/index.md b/docs/hub/index.md index 46eea6891..ef5f54cb7 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -27,7 +27,7 @@ The Hugging Face Hub is a platform with over 2M models, 500k datasets, and 1M de Getting Started Repository Settings Storage Limits -Storage Backends +Storage Backend (Xet) Pull requests and Discussions Notifications Collections @@ -122,7 +122,7 @@ On it, you'll be able to upload and discover... - Spaces: _interactive apps for demonstrating ML models directly in your browser_ The Hub offers **versioning, commit history, diffs, branches, and over a dozen library integrations**! -All repositories build on [Xet](https://huggingface.co/join/xet), a new technology to efficiently store Large Files inside Git, intelligently splitting files into unique chunks and accelerating uploads and downloads. +All repositories build on [Xet](./xet/index), a new technology to efficiently store Large Files inside Git, intelligently splitting files into unique chunks and accelerating uploads and downloads. You can learn more about the features that all repositories share in the [**Repositories documentation**](./repositories). diff --git a/docs/hub/repositories.md b/docs/hub/repositories.md index 8a9e7dbfd..013492557 100644 --- a/docs/hub/repositories.md +++ b/docs/hub/repositories.md @@ -3,7 +3,7 @@ Models, Spaces, and Datasets are hosted on the Hugging Face Hub as [Git repositories](https://git-scm.com/about), which means that version control and collaboration are core elements of the Hub. In a nutshell, a repository (also known as a **repo**) is a place where code and assets can be stored to back up your work, share it with the community, and work in a team. Unlike other collaboration platforms, our Git repositories are optimized for Machine Learning and AI files – large binary files, usually in specific file formats like Parquet and Safetensors, and up to [Terabyte-scale sizes](https://huggingface.co/blog/from-files-to-chunks)! -To achieve this, we built [Xet](./storage-backends), a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads. +To achieve this, we built [Xet](./xet/index), a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads.
@@ -17,7 +17,7 @@ In these pages, you will go over the basics of getting started with Git and Xet - [Getting Started with Repositories](./repositories-getting-started) - [Settings](./repositories-settings) - [Storage Limits](./storage-limits) -- [Storage Backends](./storage-backends) +- [Storage Backend (Xet)](./xet/index) - [Pull Requests & Discussions](./repositories-pull-requests-discussions) - [Pull Requests advanced usage](./repositories-pull-requests-discussions#pull-requests-advanced-usage) - [Collections](./collections) diff --git a/docs/hub/storage-backends.md b/docs/hub/storage-backends.md deleted file mode 100644 index cf09be009..000000000 --- a/docs/hub/storage-backends.md +++ /dev/null @@ -1,122 +0,0 @@ -# Storage - -Repositories on the Hugging Face Hub are different from those on software development platforms. They contain files that are: - -- Large - model or dataset files are in the range of GB and above. We have a few TB-scale files! -- Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) - -While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. - -Storing these files directly in a pure Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. - -Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. - -Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub has adopted Xet, a modern custom storage system built specifically for AI/ML development. It enables chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. - -## Xet - -[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage startup based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. - -Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. - -
- - -
- -A Git LFS pointer file provides metadata to locate the actual file contents in remote storage: - -- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. -- **Pointer size**: The size of the pointer file stored in the Git repository. -- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. - -A Xet pointer includes all of this information by design. Refer to the section on [backwards compatibility with Git LFS](#backward-compatibility-with-lfs) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. - -
- - -
- -Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to the [Deduplication](#deduplication) section below. - -### Using Xet Storage - -To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users). - -> [!TIP] -> For user and organization profiles created before May 23rd, 2025, you can make Xet the default for all your repositories by [signing up here](https://huggingface.co/join/xet). You can apply for yourself or your entire organization (requires [admin permissions](https://huggingface.co/docs/hub/organizations-security)). Once approved, all existing repositories will be automatically migrated to Xet and future repositories will be Xet-enabled by default. -> -> PRO users and Team or Enterprise organizations will be fast-tracked for access. - -To access a Xet-aware version of the `huggingface_hub`, simply install the latest version: - -```bash -pip install -U huggingface_hub -``` - -As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. - -If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken. - -Where versions of `huggingface_hub` >= 0.30.0 and < 0.32.0 are installed, `hf_xet` must be installed explicitly: - -```bash -pip install -U hf-xet -``` - -And that's it! You now get the benefits of Xet deduplication for both uploads and downloads. Team members using a version of `huggingface_hub` < 0.30.0 will still be able to upload and download repositories through the [backwards compatibility provided by the LFS bridge](#backward-compatibility-with-lfs). - -To see more detailed usage docs, refer to the `huggingface_hub` docs for: - -- [Upload](https://huggingface.co/docs/huggingface_hub/guides/upload#faster-uploads) -- [Download](https://huggingface.co/docs/huggingface_hub/guides/download#hfxet) -- [Managing the `hf_xet` cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#chunk-based-caching-xet) - -#### Recommendations - -Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage: - -- **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. -- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads. -- **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient. -- **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. -- **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. - -#### Current Limitations - -While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: - -- **64-bit systems only**: The `hf_xet` client currently requires a 64-bit architecture; 32-bit systems are not supported. -- **Git client integration (git-xet)**: Planned but remains under development. - -### Deduplication - -Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. - -To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a content-addressed store (CAS), keyed by its hash. - -The Hub's [current recommendation](https://huggingface.co/docs/hub/storage-limits#recommendations) is to limit files to 20GB. At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. - -For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. - -### Backward Compatibility with LFS - -Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format; the addition of the `Xet backed hash` is only added to the web interface as a convenience. Practically, this means existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file that matches the Git LFS pointer file specification. - -This symmetry allows non-Xet-aware clients (e.g., older versions of the `huggingface_hub`) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. The Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services to request the proper URL(s) from S3, regardless of which storage system holds the content. - -Within the Xet architecture, backward compatibility for downloads is achieved by the Git LFS bridge. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed file, a legacy client will get a single URL from the bridge which does the work of reconstructing the request file and returning the URL to the resource. This allows downloading files through a URL so that you can continue to use the Hub's web interface or `curl`. - -Meanwhile, uploads from non‑Xet‑aware clients still follow the standard Git LFS path, even if the file is already Xet-backed. Once the file is uploaded to LFS, a background process automatically migrates the content, turning it into a Xet-backed revision. Coupled with the Git LFS bridge, this lets repository maintainers and the rest of the Hub adopt Xet at their own pace without disruption. - -### Security Model - -Xet storage provides data deduplication over all chunks stored in Hugging Face. This is done via cryptographic hashing in a privacy sensitive way. The contents of chunks are protected and are associated with repository permissions. i.e. you can only read chunks which are required to reproduce files you have access to, and no more. See [xet-core](https://github.com/huggingface/xet-core) for details. - -## Legacy Storage: Git LFS - -The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub's Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories' files (45PB total as of this writing). - -The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). - -This leads to a worse developer experience along with a proliferation of additional storage. diff --git a/docs/hub/xet/deduplication.md b/docs/hub/xet/deduplication.md new file mode 100644 index 000000000..0d2855e5e --- /dev/null +++ b/docs/hub/xet/deduplication.md @@ -0,0 +1,10 @@ +## Deduplication + +Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. + +To avoid the overhead of communicating and managing at the level of chunks, new chunks are grouped together in [64MB blocks](https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation) and uploaded. Each block is stored once in a content-addressed store (CAS), keyed by its hash. + +The Hub's [current recommendation](https://huggingface.co/docs/hub/storage-limits#recommendations) is to limit files to 20GB. At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. + +For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. + diff --git a/docs/hub/xet/index.md b/docs/hub/xet/index.md new file mode 100644 index 000000000..3f53caff3 --- /dev/null +++ b/docs/hub/xet/index.md @@ -0,0 +1,29 @@ +# Xet: our Storage Backend + +Repositories on the Hugging Face Hub are different from those on software development platforms. They contain files that are: + +- Large - model or dataset files are in the range of GB and above. We have a few TB-scale files! +- Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) + +While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. + +Storing these files directly in a pure Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. + +Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. + +Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see [Backwards Compatibility & Legacy](./legacy-git-lfs)), the Hub has adopted Xet, a modern custom storage system built specifically for AI/ML development. It enables chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. + +### Open Source Xet Protocol + +If you are looking to understand the underlying Xet protocol or are looking to build a new client library to access Xet Storage, check out the [Xet Protocol Specification](https://huggingface.co/docs/xet/index). + +In these pages you will get started in using Xet Storage. + +## Contents + +- [Xet History & Overview](./overview) +- [Using Xet Storage](./using-xet-storage) +- [Security](./security) +- [Backwards Compatibility & Legacy](./legacy-git-lfs) +- [Deduplication](./deduplication) + diff --git a/docs/hub/xet/legacy-git-lfs.md b/docs/hub/xet/legacy-git-lfs.md new file mode 100644 index 000000000..10c2b5e3f --- /dev/null +++ b/docs/hub/xet/legacy-git-lfs.md @@ -0,0 +1,15 @@ +## Backward Compatibility with LFS + +Uploads from legacy / non‑Xet‑aware clients still follow the standard Git LFS path, even if the repo is already Xet-backed. Once the file is uploaded to LFS, a background process automatically migrates the file to using Xet storage. The Xet architecture provides backwards compatibility for legacy clients downloading files from Xet-backed repos by offering a Git LFS bridge. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed file, a legacy client will get a single URL from the bridge which does the work of reconstructing the request file and returning the URL to the resource. This allows downloading files through a URL so that you can continue to use the Hub's web interface or `curl`. By having LFS file uploads automatically migrate and having older clients continue to download files from Xet-backed repositories, maintainers and the rest of the Hub can update their pipelines at their own pace. + +Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format; the addition of the `Xet backed hash` is only added to the web interface as a convenience. Practically, this means existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file that matches the Git LFS pointer file specification. + +This symmetry allows non-Xet-aware clients (e.g., older versions of the `huggingface_hub`) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. The Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services to request the proper URL(s) from S3, regardless of which storage system holds the content. + +## Legacy Storage: Git LFS + +The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub's Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA256 hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories' files. + +The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). + +This leads to a worse developer experience along with a proliferation of additional storage. diff --git a/docs/hub/xet/overview.md b/docs/hub/xet/overview.md new file mode 100644 index 000000000..e00b759f7 --- /dev/null +++ b/docs/hub/xet/overview.md @@ -0,0 +1,25 @@ +## Xet History & Overview + +[In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage startup based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. + +Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. + +
+ + +
+ +A Git LFS pointer file provides metadata to locate the actual file contents in remote storage: + +- **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the file’s contents. +- **Pointer size**: The size of the pointer file stored in the Git repository. +- **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. + +A Xet pointer includes all of this information by design. Refer to the section on [backwards compatibility with Git LFS](legacy-git-lfs#backward-compatibility-with-lfs) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. + +
+ + +
+ +Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to [Deduplication](deduplication). diff --git a/docs/hub/xet/security.md b/docs/hub/xet/security.md new file mode 100644 index 000000000..3c623ab41 --- /dev/null +++ b/docs/hub/xet/security.md @@ -0,0 +1,5 @@ +## Security Model + +Xet storage provides data deduplication over all chunks stored in Hugging Face. This is done via cryptographic hashing in a privacy sensitive way. The contents of chunks are protected and are associated with repository permissions, i.e. you can only read chunks which are required to reproduce files you have access to, and no more. + +More information and details on how deduplication is done in a privacy-preserving way are described in the [Xet Protocol Specification](https://huggingface.co/docs/xet/deduplication). diff --git a/docs/hub/xet/using-xet-storage.md b/docs/hub/xet/using-xet-storage.md new file mode 100644 index 000000000..e6828691c --- /dev/null +++ b/docs/hub/xet/using-xet-storage.md @@ -0,0 +1,42 @@ +## Using Xet Storage + +To access a Xet-aware version of the `huggingface_hub`, simply install the latest version: + +```bash +pip install -U huggingface_hub +``` + +As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. + +If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken. + +Where versions of `huggingface_hub` >= 0.30.0 and < 0.32.0 are installed, `hf_xet` must be installed explicitly: + +```bash +pip install -U hf-xet +``` + +And that's it! You now get the benefits of Xet deduplication for both uploads and downloads. Team members using a version of `huggingface_hub` < 0.30.0 will still be able to upload and download repositories through the [backwards compatibility provided by the LFS bridge](legacy-git-lfs#backward-compatibility-with-lfs). + +To see more detailed usage docs, refer to the `huggingface_hub` docs for: + +- [Upload](https://huggingface.co/docs/huggingface_hub/guides/upload#faster-uploads) +- [Download](https://huggingface.co/docs/huggingface_hub/guides/download#hfxet) +- [Managing the `hf_xet` cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#chunk-based-caching-xet) + +### Recommendations + +Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage: + +- **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. +- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads. +- **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient. +- **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. +- **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. + +### Current Limitations + +While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: + +- **64-bit systems only**: The `hf_xet` client currently requires a 64-bit architecture; 32-bit systems are not supported. +- **Git client integration (git-xet)**: Under active development - coming soon, stay tuned! diff --git a/docs/xet/index.md b/docs/xet/index.md index ab27c53ad..20bd70fc7 100644 --- a/docs/xet/index.md +++ b/docs/xet/index.md @@ -1,7 +1,7 @@ # Xet Protocol Specification > [!NOTE] -> Version 0.1.0 (1.0.0 on release) +> Version 1.0.0 > The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119](https://www.ietf.org/rfc/rfc2119.txt) [RFC8174](https://www.ietf.org/rfc/rfc8174.txt) when, and only when, they appear in all capitals, as shown here.