From e137ed2792ba3b4b169b8c7c2f9b33e2e5d08072 Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Fri, 19 Sep 2025 23:11:46 +0200 Subject: [PATCH 1/5] Start promoting Xet more accross the doc (not just in a single doc page) --- docs/hub/_toctree.yml | 8 ++++---- .../enterprise-hub-gating-group-collections.md | 2 +- docs/hub/index.md | 3 ++- docs/hub/repositories.md | 17 ++++++++++++----- docs/hub/storage-backends.md | 4 ++-- 5 files changed, 21 insertions(+), 13 deletions(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 437aa6184..1f1042f23 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -43,6 +43,10 @@ title: Getting Started with Repositories - local: repositories-settings title: Repository Settings + - local: storage-limits + title: Storage Limits + - local: storage-backends + title: Storage Backends - local: repositories-pull-requests-discussions title: Pull Requests & Discussions - local: notifications @@ -60,10 +64,6 @@ title: "How-to: Create automatic metadata quality reports" - local: notebooks title: Notebooks - - local: storage-limits - title: Storage Limits - - local: storage-backends - title: Storage Backends - local: repositories-next-steps title: Next Steps - local: repositories-licenses diff --git a/docs/hub/enterprise-hub-gating-group-collections.md b/docs/hub/enterprise-hub-gating-group-collections.md index 55fc036da..1e1a1769c 100644 --- a/docs/hub/enterprise-hub-gating-group-collections.md +++ b/docs/hub/enterprise-hub-gating-group-collections.md @@ -9,7 +9,7 @@ Gating Group Collections allow organizations to grant (or reject) access to all To enable Gating Group in a collection: - the collection owner must be an organization -- the organization must be subscribed to the Enterprise Hub +- the organization must be subscribed to a Team or Enterprise plan - all models and datasets in the collection must be owned by the same organization as the collection - each model or dataset in the collection may only belong to one Gating Group Collection (but they can still be included in non-gating i.e. _regular_ collections). diff --git a/docs/hub/index.md b/docs/hub/index.md index f03a59a33..e2204491a 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -25,11 +25,12 @@ The Hugging Face Hub is a platform with over 1.7M models, 400k datasets, and 600 Introduction Getting Started Repository Settings +Storage Limits +Storage Backends Pull requests and Discussions Notifications Collections Webhooks -Storage Backends Next Steps Licenses diff --git a/docs/hub/repositories.md b/docs/hub/repositories.md index 9e602d974..01e3f8fa3 100644 --- a/docs/hub/repositories.md +++ b/docs/hub/repositories.md @@ -2,18 +2,25 @@ Models, Spaces, and Datasets are hosted on the Hugging Face Hub as [Git repositories](https://git-scm.com/about), which means that version control and collaboration are core elements of the Hub. In a nutshell, a repository (also known as a **repo**) is a place where code and assets can be stored to back up your work, share it with the community, and work in a team. -In these pages, you will go over the basics of getting started with Git and interacting with repositories on the Hub. Once you get the hang of it, you can explore the best practices and next steps that we've compiled for effective repository usage. +Unlike other collaboration platforms, our Git repositories are optimized for Machine Learning and AI files – large binary files, usually in specific file formats like Parquet and Safetensors, and up to Terabyte-scale sizes! +To achieve this, we built [Xet](./storage-backends), a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads. + +
+ +
+ +In these pages, you will go over the basics of getting started with Git and Xet and interacting with repositories on the Hub. Once you get the hang of it, you can explore the best practices and next steps that we've compiled for effective repository usage. ## Contents - [Getting Started with Repositories](./repositories-getting-started) - [Settings](./repositories-settings) +- [Storage Limits](./storage-limits) +- [Storage Backends](./storage-backends) - [Pull Requests & Discussions](./repositories-pull-requests-discussions) - [Pull Requests advanced usage](./repositories-pull-requests-discussions#pull-requests-advanced-usage) -- [Webhooks](./webhooks) -- [Notifications](./notifications) - [Collections](./collections) -- [Storage Backends](./storage-backends) -- [Storage Limits](./storage-limits) +- [Notifications](./notifications) +- [Webhooks](./webhooks) - [Next Steps](./repositories-next-steps) - [Licenses](./repositories-licenses) diff --git a/docs/hub/storage-backends.md b/docs/hub/storage-backends.md index 12687b539..6a75cdcfb 100644 --- a/docs/hub/storage-backends.md +++ b/docs/hub/storage-backends.md @@ -7,11 +7,11 @@ Repositories on the Hugging Face Hub are different from those on software develo While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. -Storing these files directly in a Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. +Storing these files directly in a pure Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. -Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported and widely used (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. +Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. ## Xet From 71153aa9484770004c590c04ba1f538c5f0305fd Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Fri, 19 Sep 2025 23:16:20 +0200 Subject: [PATCH 2/5] here too? --- docs/hub/index.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/hub/index.md b/docs/hub/index.md index e2204491a..812cfb1fe 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -120,7 +120,10 @@ On it, you'll be able to upload and discover... - Datasets: _featuring a wide variety of data for different domains and modalities_ - Spaces: _interactive apps for demonstrating ML models directly in your browser_ -The Hub offers **versioning, commit history, diffs, branches, and over a dozen library integrations**! You can learn more about the features that all repositories share in the [**Repositories documentation**](./repositories). +The Hub offers **versioning, commit history, diffs, branches, and over a dozen library integrations**! +All repositories build on [Xet](https://huggingface.co/join/xet), a new technology to efficiently store Large Files inside Git, intelligently splitting files into unique chunks and accelerating uploads and downloads. + +You can learn more about the features that all repositories share in the [**Repositories documentation**](./repositories). ## Models From 8c29794ce4f879102d1b4ee20d9e6d817e817bb8 Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Sat, 20 Sep 2025 11:58:44 +0200 Subject: [PATCH 3/5] Update docs/hub/repositories.md --- docs/hub/repositories.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/repositories.md b/docs/hub/repositories.md index 01e3f8fa3..431f56498 100644 --- a/docs/hub/repositories.md +++ b/docs/hub/repositories.md @@ -2,7 +2,7 @@ Models, Spaces, and Datasets are hosted on the Hugging Face Hub as [Git repositories](https://git-scm.com/about), which means that version control and collaboration are core elements of the Hub. In a nutshell, a repository (also known as a **repo**) is a place where code and assets can be stored to back up your work, share it with the community, and work in a team. -Unlike other collaboration platforms, our Git repositories are optimized for Machine Learning and AI files – large binary files, usually in specific file formats like Parquet and Safetensors, and up to Terabyte-scale sizes! +Unlike other collaboration platforms, our Git repositories are optimized for Machine Learning and AI files – large binary files, usually in specific file formats like Parquet and Safetensors, and up to [Terabyte-scale sizes](https://huggingface.co/blog/from-files-to-chunks)! To achieve this, we built [Xet](./storage-backends), a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads.
From 67cff2c5da68fd4a9909f6cf765fab6bf601fd6e Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Sat, 20 Sep 2025 12:06:04 +0200 Subject: [PATCH 4/5] Update docs/hub/storage-backends.md Co-authored-by: Pedro Cuenca --- docs/hub/storage-backends.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/storage-backends.md b/docs/hub/storage-backends.md index 6a75cdcfb..1075ecd8c 100644 --- a/docs/hub/storage-backends.md +++ b/docs/hub/storage-backends.md @@ -11,7 +11,7 @@ Storing these files directly in a pure Git repository is impractical. Not only a Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. -Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub is introducing a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. +Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see the [Legacy section below](#legacy-storage-git-lfs)), the Hub has adopted Xet, a modern custom storage system built specifically for AI/ML development. It enables chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. ## Xet From ef21fd3c4f3d1f1243762bf0b71302433e5e59dc Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Sat, 20 Sep 2025 12:13:05 +0200 Subject: [PATCH 5/5] dark --- docs/hub/repositories.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/hub/repositories.md b/docs/hub/repositories.md index 431f56498..8a9e7dbfd 100644 --- a/docs/hub/repositories.md +++ b/docs/hub/repositories.md @@ -7,6 +7,7 @@ To achieve this, we built [Xet](./storage-backends), a modern custom storage sys
+
In these pages, you will go over the basics of getting started with Git and Xet and interacting with repositories on the Hub. Once you get the hang of it, you can explore the best practices and next steps that we've compiled for effective repository usage.