From 5ef190a7c6cabd0f586e6946d1baffe3818c7e30 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Mon, 3 Mar 2025 16:46:38 -0800 Subject: [PATCH 01/16] Xet docs --- docs/source/en/guides/download.md | 20 +++++ docs/source/en/guides/manage-cache.md | 89 +++++++++++++++++++ docs/source/en/guides/upload.md | 17 ++++ .../environment_variables.md | 8 ++ 4 files changed, 134 insertions(+) diff --git a/docs/source/en/guides/download.md b/docs/source/en/guides/download.md index 254c72d165..5720acd0f0 100644 --- a/docs/source/en/guides/download.md +++ b/docs/source/en/guides/download.md @@ -166,6 +166,26 @@ For more details about the CLI download command, please refer to the [CLI guide] ## Faster downloads +There are two options to speed up downloads. Both involve installing a Python package written in Rust. + +* `hf_xet` is newer and uses the Xet storage backend for upload/download. It is available in production, but is in the process of being rolled out to all users. +* `hf_transfer` is a power-tool with some limitations, but is thoroughly tested and has been in production for a long time. + +### hf_xet + +Take advantage of faster downloads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables +chunk-based deduplication for faster downloads and uploads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS. + +To enable it, specify the `hf_xet` package when installing `huggingface_hub`: + +```bash +pip install huggingface_hub[hf_xet] +``` + +All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/repositories-storage). + +### hf_transfer + If you are running on a machine with high bandwidth, you can increase your download speed with [`hf_transfer`](https://github.com/huggingface/hf_transfer), a Rust-based library developed to speed up file transfers with the Hub. diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index 521a50b21f..fc6447527a 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -541,3 +541,92 @@ Will free 8.6G >>> delete_strategy.execute() Cache deletion done. Saved 8.6G. ``` + +## Xet Cache + +To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingaface_hub` cache, creating additional caching layer to enable chunk-based deduplication. + +The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure + +```bash + +├─ chunk_cache +├─ shard_cache +``` + +The `xet` cache, like the rest of `hf_xet` is fully integrated with the `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system. + +The `chunk-cache` directory contains cached data chunks that are used to speed up downloads while the `shard-cache` directory contains cached shards that are utilized on the upload path. + +### `chunk_cache` + +This cache is used on the download path. The cache directory structure is based on a base-64 encoded hash from the content-addressed store (CAS) that backs each Xet-enabled repository. A CAS hash serves as the key to lookup the offsets of where the data is stored. + +At the topmost level, the first two letters of the base 64 encoded CAS hash are used to create a subdirectory in the `chunk_cache` (keys that share these first two letters are grouped here). The inner levels are comprised of subdirectories with the full key as the directory name. At the base are the cache items which are ranges of blocks that contain the cached chunks. + +```bash + +├─ xet +│ ├─ chunk_cache +│ │ ├─ A1 +│ │ │ ├─ A1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0 +│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9 +│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB + +``` + +When requesting a file, the first thing `hf_xet` does is communicate with Xet storage’s content addressed store (CAS) for reconstruction information. The reconstruction information contains information about the CAS keys required to download the file in its entirety. + +Before executing the requests for the CAS keys, the `chunk_cache` is consulted. If a key in the cache matches a CAS key, then there is no reason to issue a request for that content. `hf_xet` uses the chunks stored in the directory instead. + +As the `chunk_cache` is purely an optimization, not a guarantee, `hf_xet` utilizes a computationally efficient eviction policy. When the `chunk_cache` is full (see `Limits and Limitations` below), `hf_xet` implements a random eviction policy when selecting an eviction candidate. This significantly reduces the overhead of managing a robust caching system (e.g., LRU) while still providing most of the benefits of caching chunks. + +### `shard_cache` + +This cache is used when uploading content to the Hub. The directory is flat, comprising only of shard files, each using an ID for the shard name. + +```sh + +├─ xet +│ ├─ shard_cache +│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb +│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb +│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb +│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb +``` + +The `shard_cache` contains shards that are: + +- Locally generated and successfully uploaded to the CAS +- Downloaded from CAS as part of the global deduplication algorithm + +Shards provide a mapping between files and chunks. During uploads, each file is chunked and the hash of the chunk is saved. Every shard in the cache is then consulted. If a shard contains a chunk hash that is present in the local file being uploaded, then that chunk can be discarded as it is already stored in CAS. + +All shards have an expiration date of 3-4 weeks from when they are downloaded. Shards that are expired are not loaded during upload and are deleted one week after expiration. + +### Limits and Limitations + +The `chunk_cache` is limited to 10GB in size while the `shard_cache` is technically without limits (in practice, the size and use of shards are such that limiting the cache is unnecessary). + +By design, both caches are without high-level APIs. These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache). + +If you need to reclaim the space utilized by either cache or need to debug any potential cache-related issues, simply remove the `xet` cache entirely by running `rm -rf ~//xet` where `` is the location of your Hugging Face cache, typically `~/.cache/huggingface` + +Example full `xet`cache directory tree: + +```sh + +├─ xet +│ ├─ chunk_cache +│ │ ├─ LN +│ │ │ ├─ L1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0 +│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9 +│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB +│ ├─ shard_cache +│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb +│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb +│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb +│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb +``` + +To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/repositories-storage). diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index b7d254a4c1..223e379ecc 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -182,6 +182,23 @@ Check out our [Repository limitations and recommendations](https://huggingface.c In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub. However, `huggingface_hub` has more advanced features to make things easier. Let's have a look at them! +### Faster uploads with hf_xet + +Take advantage of faster uploads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables chunk-based deduplication for faster uploads and downloads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS. + + + +Xet storage is being rolled out to Hugging Face Hub users at this time, so xet uploads may need to be enabled for your repo for `hf_xet` to actually upload to the Xet backend. + + + +To enable it, specify the `hf_xet` extra when installing `huggingface_hub`: + +```bash +pip install huggingface_hub[hf_xet] +``` + +All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/repositories-storage). ### Non-blocking uploads diff --git a/docs/source/en/package_reference/environment_variables.md b/docs/source/en/package_reference/environment_variables.md index 4ab3cc3cd6..8f88d25bae 100644 --- a/docs/source/en/package_reference/environment_variables.md +++ b/docs/source/en/package_reference/environment_variables.md @@ -158,6 +158,14 @@ To use `hf_transfer`: Please note that using `hf_transfer` comes with certain limitations. Since it is not purely Python-based, debugging errors may be challenging. Additionally, `hf_transfer` lacks several user-friendly features such as resumable downloads and proxies. These omissions are intentional to maintain the simplicity and speed of the Rust logic. Consequently, `hf_transfer` is not enabled by default in `huggingface_hub`. + + +`hf_xet` is an alternative to `hf_transfer`. It provides efficient file transfers through a chunk-based deduplication strategy, custom Xet storage (replacing Git LFS), and a seamless integration with `huggingface_hub`. + +[Read more about the package](https://huggingface.co/docs/hub/repositories-storage) and enable with `pip install huggingface_hub[hf_xet]`. + + + ## Deprecated environment variables In order to standardize all environment variables within the Hugging Face ecosystem, some variables have been marked as deprecated. Although they remain functional, they no longer take precedence over their replacements. The following table outlines the deprecated variables and their corresponding alternatives: From 324c52285057a9cee6055a4c4f510ee0b25416ec Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Thu, 13 Mar 2025 16:38:37 +0100 Subject: [PATCH 02/16] Update docs/source/en/guides/download.md --- docs/source/en/guides/download.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/download.md b/docs/source/en/guides/download.md index 5720acd0f0..a57cf3bc55 100644 --- a/docs/source/en/guides/download.md +++ b/docs/source/en/guides/download.md @@ -169,7 +169,7 @@ For more details about the CLI download command, please refer to the [CLI guide] There are two options to speed up downloads. Both involve installing a Python package written in Rust. * `hf_xet` is newer and uses the Xet storage backend for upload/download. It is available in production, but is in the process of being rolled out to all users. -* `hf_transfer` is a power-tool with some limitations, but is thoroughly tested and has been in production for a long time. +* `hf_transfer` is a power-tool to download and upload to our LFS storage backend (note: this is less future-proof than Xet). It is thoroughly tested and has been in production for a long time, but it has some limitations. ### hf_xet From feeeec827f61a8f9c87f0729d50193e91c08c3eb Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Thu, 13 Mar 2025 10:27:15 -0700 Subject: [PATCH 03/16] Update docs/source/en/guides/manage-cache.md Co-authored-by: Pierric Cistac --- docs/source/en/guides/manage-cache.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index fc6447527a..bfe23c3753 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -544,7 +544,7 @@ Cache deletion done. Saved 8.6G. ## Xet Cache -To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingaface_hub` cache, creating additional caching layer to enable chunk-based deduplication. +To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure From e0d3bd8e33bfddd91349610250c544fb7e0b72cb Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Thu, 13 Mar 2025 10:27:23 -0700 Subject: [PATCH 04/16] Update docs/source/en/guides/manage-cache.md Co-authored-by: Pierric Cistac --- docs/source/en/guides/manage-cache.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index bfe23c3753..2f1d90f09b 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -554,7 +554,7 @@ The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains ├─ shard_cache ``` -The `xet` cache, like the rest of `hf_xet` is fully integrated with the `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system. +The `xet` cache, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system. The `chunk-cache` directory contains cached data chunks that are used to speed up downloads while the `shard-cache` directory contains cached shards that are utilized on the upload path. From cf1996a1a05214b616d3d8d6361b37b9630083c5 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Thu, 13 Mar 2025 10:27:33 -0700 Subject: [PATCH 05/16] Update docs/source/en/guides/upload.md Co-authored-by: Pierric Cistac --- docs/source/en/guides/upload.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index 223e379ecc..6eb9805ffa 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -182,7 +182,7 @@ Check out our [Repository limitations and recommendations](https://huggingface.c In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub. However, `huggingface_hub` has more advanced features to make things easier. Let's have a look at them! -### Faster uploads with hf_xet +### Faster uploads with `hf_xet` Take advantage of faster uploads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables chunk-based deduplication for faster uploads and downloads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS. From e00950918b830737d462988b41a49ff6d122ce4e Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Thu, 13 Mar 2025 10:35:42 -0700 Subject: [PATCH 06/16] PR feedback, added waitlist links --- docs/source/en/guides/manage-cache.md | 2 +- docs/source/en/guides/upload.md | 7 ++++--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index 2f1d90f09b..0a3f918d4c 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -618,7 +618,7 @@ Example full `xet`cache directory tree: ├─ xet │ ├─ chunk_cache -│ │ ├─ LN +│ │ ├─ L1 │ │ │ ├─ L1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0 │ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9 │ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index 6eb9805ffa..2446f3c7c1 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -166,14 +166,15 @@ Check out our [Repository limitations and recommendations](https://huggingface.c - **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate on a script when failing takes only a little time. - **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. This is what [`upload_large_folder`] does for you. -- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth. To use `hf_transfer`: +- **Use `hf_xet`**: this leverages the new storage backend for Hub, is written in Rust, and is being rolled out to users right now. In order to upload using `hf_xet` your repo must be enabled to use the Xet storage backend. It is being rolled out now, so join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon! +- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth (uploads LFS files). To use `hf_transfer`: 1. Specify the `hf_transfer` extra when installing `huggingface_hub` (i.e., `pip install huggingface_hub[hf_transfer]`). 2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable. -`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer). +`hf_transfer` is a power user tool for uploading LFS files! It is tested and production-ready, but it is less future-proof and lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer). @@ -188,7 +189,7 @@ Take advantage of faster uploads through `hf_xet`, the Python binding to the [`x -Xet storage is being rolled out to Hugging Face Hub users at this time, so xet uploads may need to be enabled for your repo for `hf_xet` to actually upload to the Xet backend. +Xet storage is being rolled out to Hugging Face Hub users at this time, so xet uploads may need to be enabled for your repo for `hf_xet` to actually upload to the Xet backend. Join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon! From 630ad32a8dcd32673a6d2958352356b81620fd2c Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Thu, 13 Mar 2025 10:42:52 -0700 Subject: [PATCH 07/16] Added HF_XET_CACHE env variable docs --- docs/source/en/package_reference/environment_variables.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/en/package_reference/environment_variables.md b/docs/source/en/package_reference/environment_variables.md index 8f88d25bae..d1004ed5a9 100644 --- a/docs/source/en/package_reference/environment_variables.md +++ b/docs/source/en/package_reference/environment_variables.md @@ -36,6 +36,12 @@ spaces). Defaults to `"$HF_HOME/hub"` (e.g. `"~/.cache/huggingface/hub"` by default). +### HF_XET_CACHE + +To configure where Xet chunks (byte ranges from files managed by Xet backend) are cached locally. + +Defaults to `"$HF_HOME/xet"` (e.g. `"~/.cache/huggingface/xet"` by default). + ### HF_ASSETS_CACHE To configure where [assets](../guides/manage-cache#caching-assets) created by downstream libraries From 3653e95ee9ff1714baf8de8b25495b510113ab36 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Thu, 13 Mar 2025 10:45:18 -0700 Subject: [PATCH 08/16] PR feedback --- docs/source/en/guides/upload.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index 2446f3c7c1..51b199a410 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -183,7 +183,7 @@ Check out our [Repository limitations and recommendations](https://huggingface.c In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub. However, `huggingface_hub` has more advanced features to make things easier. Let's have a look at them! -### Faster uploads with `hf_xet` +### Faster Uploads Take advantage of faster uploads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables chunk-based deduplication for faster uploads and downloads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS. From 8a11cffaafb58b95d8100e6ea5158431eaffd9a8 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Fri, 14 Mar 2025 09:41:22 -0700 Subject: [PATCH 09/16] Update docs/source/en/guides/upload.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Célina --- docs/source/en/guides/upload.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index 51b199a410..8a0d2429d4 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -196,7 +196,7 @@ Xet storage is being rolled out to Hugging Face Hub users at this time, so xet u To enable it, specify the `hf_xet` extra when installing `huggingface_hub`: ```bash -pip install huggingface_hub[hf_xet] +pip install -U huggingface_hub[hf_xet] ``` All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/repositories-storage). From 7a455b57c54a64325a86bd85739c24ebe5eae2cd Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Fri, 14 Mar 2025 09:41:38 -0700 Subject: [PATCH 10/16] Update docs/source/en/guides/download.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Célina --- docs/source/en/guides/download.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/download.md b/docs/source/en/guides/download.md index a57cf3bc55..6717296d4e 100644 --- a/docs/source/en/guides/download.md +++ b/docs/source/en/guides/download.md @@ -179,7 +179,7 @@ chunk-based deduplication for faster downloads and uploads. `hf_xet` integrates To enable it, specify the `hf_xet` package when installing `huggingface_hub`: ```bash -pip install huggingface_hub[hf_xet] +pip install -U huggingface_hub[hf_xet] ``` All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/repositories-storage). From 195b68adc2608d982fc4c80453c991ff658de0db Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Fri, 14 Mar 2025 09:42:05 -0700 Subject: [PATCH 11/16] Update docs/source/en/guides/upload.md Co-authored-by: Lucain --- docs/source/en/guides/upload.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index 8a0d2429d4..abd8251e63 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -176,6 +176,8 @@ Check out our [Repository limitations and recommendations](https://huggingface.c `hf_transfer` is a power user tool for uploading LFS files! It is tested and production-ready, but it is less future-proof and lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer). +Note that `hf_xet` and `hf_transfer` tools are mutually exclusive. The former is used to upload files to Xet-enabled repos while the later uploads LFS files to regular repos. + ## Advanced features From 300b6a2f3185f7ae76cc5445578dcd42ada48759 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Fri, 14 Mar 2025 10:08:47 -0700 Subject: [PATCH 12/16] Doc feedback --- docs/source/en/guides/download.md | 4 +- docs/source/en/guides/manage-cache.md | 200 +++++++++++++------------- docs/source/en/guides/upload.md | 2 +- 3 files changed, 106 insertions(+), 100 deletions(-) diff --git a/docs/source/en/guides/download.md b/docs/source/en/guides/download.md index 6717296d4e..a4a2ac1c8e 100644 --- a/docs/source/en/guides/download.md +++ b/docs/source/en/guides/download.md @@ -168,7 +168,7 @@ For more details about the CLI download command, please refer to the [CLI guide] There are two options to speed up downloads. Both involve installing a Python package written in Rust. -* `hf_xet` is newer and uses the Xet storage backend for upload/download. It is available in production, but is in the process of being rolled out to all users. +* `hf_xet` is newer and uses the Xet storage backend for upload/download. It is available in production, but is in the process of being rolled out to all users, so join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon! * `hf_transfer` is a power-tool to download and upload to our LFS storage backend (note: this is less future-proof than Xet). It is thoroughly tested and has been in production for a long time, but it has some limitations. ### hf_xet @@ -182,6 +182,8 @@ To enable it, specify the `hf_xet` package when installing `huggingface_hub`: pip install -U huggingface_hub[hf_xet] ``` +Note: `hf_xet` will only be utilized when the files being downloaded are being stored with Xet Storage. + All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/repositories-storage). ### hf_transfer diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index 0a3f918d4c..ae312991fc 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -2,9 +2,11 @@ rendered properly in your Markdown viewer. --> -# Manage `huggingface_hub` cache-system +# Understand caching -## Understand caching +`huggingface_hub` utilizes the local disk as two caches, which avoid re-downloading items again. The first cache is a file-based cache, which caches individual files downloaded from the Hub and ensures that the same file is not downloaded again, even if multiple repos or revisions use that file. The second cache is a chunk cache, where each chunk represents a byte range from a file and ensures that chunks that are shared across files are only downloaded once. + +## File-based caching The Hugging Face Hub cache-system is designed to be the central cache shared across libraries that depend on the Hub. It has been updated in v0.8.0 to prevent re-downloading same files @@ -170,6 +172,95 @@ When symlinks are not supported, a warning message is displayed to the user to a them they are using a degraded version of the cache-system. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable to true. +## Chunk-based caching (Xet) + +To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks, which are immutable byte ranges from files (up to 64KB) that are created using content-defined chunking. For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/repositories-storage). + +The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure + +```bash + +├─ chunk_cache +├─ shard_cache +``` + +The `xet` cache, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system. + +The `chunk-cache` directory contains cached data chunks that are used to speed up downloads while the `shard-cache` directory contains cached shards that are utilized on the upload path. + +### `chunk_cache` + +This cache is used on the download path. The cache directory structure is based on a base-64 encoded hash from the content-addressed store (CAS) that backs each Xet-enabled repository. A CAS hash serves as the key to lookup the offsets of where the data is stored. + +At the topmost level, the first two letters of the base 64 encoded CAS hash are used to create a subdirectory in the `chunk_cache` (keys that share these first two letters are grouped here). The inner levels are comprised of subdirectories with the full key as the directory name. At the base are the cache items which are ranges of blocks that contain the cached chunks. + +```bash + +├─ xet +│ ├─ chunk_cache +│ │ ├─ A1 +│ │ │ ├─ A1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0 +│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9 +│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB + +``` + +When requesting a file, the first thing `hf_xet` does is communicate with Xet storage’s content addressed store (CAS) for reconstruction information. The reconstruction information contains information about the CAS keys required to download the file in its entirety. + +Before executing the requests for the CAS keys, the `chunk_cache` is consulted. If a key in the cache matches a CAS key, then there is no reason to issue a request for that content. `hf_xet` uses the chunks stored in the directory instead. + +As the `chunk_cache` is purely an optimization, not a guarantee, `hf_xet` utilizes a computationally efficient eviction policy. When the `chunk_cache` is full (see `Limits and Limitations` below), `hf_xet` implements a random eviction policy when selecting an eviction candidate. This significantly reduces the overhead of managing a robust caching system (e.g., LRU) while still providing most of the benefits of caching chunks. + +### `shard_cache` + +This cache is used when uploading content to the Hub. The directory is flat, comprising only of shard files, each using an ID for the shard name. + +```sh + +├─ xet +│ ├─ shard_cache +│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb +│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb +│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb +│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb +``` + +The `shard_cache` contains shards that are: + +- Locally generated and successfully uploaded to the CAS +- Downloaded from CAS as part of the global deduplication algorithm + +Shards provide a mapping between files and chunks. During uploads, each file is chunked and the hash of the chunk is saved. Every shard in the cache is then consulted. If a shard contains a chunk hash that is present in the local file being uploaded, then that chunk can be discarded as it is already stored in CAS. + +All shards have an expiration date of 3-4 weeks from when they are downloaded. Shards that are expired are not loaded during upload and are deleted one week after expiration. + +### Limits and Limitations + +The `chunk_cache` is limited to 10GB in size while the `shard_cache` is technically without limits (in practice, the size and use of shards are such that limiting the cache is unnecessary). + +By design, both caches are without high-level APIs. These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache). + +If you need to reclaim the space utilized by either cache or need to debug any potential cache-related issues, simply remove the `xet` cache entirely by running `rm -rf ~//xet` where `` is the location of your Hugging Face cache, typically `~/.cache/huggingface` + +Example full `xet`cache directory tree: + +```sh + +├─ xet +│ ├─ chunk_cache +│ │ ├─ L1 +│ │ │ ├─ L1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0 +│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9 +│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB +│ ├─ shard_cache +│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb +│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb +│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb +│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb +``` + +To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/repositories-storage). + ## Caching assets In addition to caching files from the Hub, downstream libraries often requires to cache @@ -232,7 +323,9 @@ In practice, your assets cache should look like the following tree: └── (...) ``` -## Scan your cache +## Manage your file-based cache + +### Scan your cache At the moment, cached files are never deleted from your local directory: when you download a new revision of a branch, previous files are kept in case you need them again. @@ -240,7 +333,7 @@ Therefore it can be useful to scan your cache directory in order to know which r and revisions are taking the most disk space. `huggingface_hub` provides an helper to do so that can be used via `huggingface-cli` or in a python script. -### Scan cache from the terminal +#### Scan cache from the terminal The easiest way to scan your HF cache-system is to use the `scan-cache` command from `huggingface-cli` tool. This command scans the cache and prints a report with information @@ -304,7 +397,7 @@ t5-small model d0a119eedb3718e34c648e594394474cf95e0617 t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 1 week ago main /home/wauplin/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5 ``` -### Scan cache from Python +#### Scan cache from Python For a more advanced usage, use [`scan_cache_dir`] which is the python utility called by the CLI tool. @@ -368,7 +461,7 @@ HFCacheInfo( ) ``` -## Clean your cache +### Clean your cache Scanning your cache is interesting but what you really want to do next is usually to delete some portions to free up some space on your drive. This is possible using the @@ -376,7 +469,7 @@ delete some portions to free up some space on your drive. This is possible using [`~HFCacheInfo.delete_revisions`] helper from [`HFCacheInfo`] object returned when scanning the cache. -### Delete strategy +#### Delete strategy To delete some cache, you need to pass a list of revisions to delete. The tool will define a strategy to free up the space based on this list. It returns a @@ -408,7 +501,7 @@ error is thrown. The deletion continues for other paths contained in the -### Clean cache from the terminal +#### Clean cache from the terminal The easiest way to delete some revisions from your HF cache-system is to use the `delete-cache` command from `huggingface-cli` tool. The command has two modes. By @@ -522,7 +615,7 @@ Example of command file: # 9cfa5647b32c0a30d0adfca06bf198d82192a0d1 # Refs: main # modified 5 days ago ``` -### Clean cache from Python +#### Clean cache from Python For more flexibility, you can also use the [`~HFCacheInfo.delete_revisions`] method programmatically. Here is a simple example. See reference for details. @@ -541,92 +634,3 @@ Will free 8.6G >>> delete_strategy.execute() Cache deletion done. Saved 8.6G. ``` - -## Xet Cache - -To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. - -The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure - -```bash - -├─ chunk_cache -├─ shard_cache -``` - -The `xet` cache, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system. - -The `chunk-cache` directory contains cached data chunks that are used to speed up downloads while the `shard-cache` directory contains cached shards that are utilized on the upload path. - -### `chunk_cache` - -This cache is used on the download path. The cache directory structure is based on a base-64 encoded hash from the content-addressed store (CAS) that backs each Xet-enabled repository. A CAS hash serves as the key to lookup the offsets of where the data is stored. - -At the topmost level, the first two letters of the base 64 encoded CAS hash are used to create a subdirectory in the `chunk_cache` (keys that share these first two letters are grouped here). The inner levels are comprised of subdirectories with the full key as the directory name. At the base are the cache items which are ranges of blocks that contain the cached chunks. - -```bash - -├─ xet -│ ├─ chunk_cache -│ │ ├─ A1 -│ │ │ ├─ A1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0 -│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9 -│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB - -``` - -When requesting a file, the first thing `hf_xet` does is communicate with Xet storage’s content addressed store (CAS) for reconstruction information. The reconstruction information contains information about the CAS keys required to download the file in its entirety. - -Before executing the requests for the CAS keys, the `chunk_cache` is consulted. If a key in the cache matches a CAS key, then there is no reason to issue a request for that content. `hf_xet` uses the chunks stored in the directory instead. - -As the `chunk_cache` is purely an optimization, not a guarantee, `hf_xet` utilizes a computationally efficient eviction policy. When the `chunk_cache` is full (see `Limits and Limitations` below), `hf_xet` implements a random eviction policy when selecting an eviction candidate. This significantly reduces the overhead of managing a robust caching system (e.g., LRU) while still providing most of the benefits of caching chunks. - -### `shard_cache` - -This cache is used when uploading content to the Hub. The directory is flat, comprising only of shard files, each using an ID for the shard name. - -```sh - -├─ xet -│ ├─ shard_cache -│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb -│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb -│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb -│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb -``` - -The `shard_cache` contains shards that are: - -- Locally generated and successfully uploaded to the CAS -- Downloaded from CAS as part of the global deduplication algorithm - -Shards provide a mapping between files and chunks. During uploads, each file is chunked and the hash of the chunk is saved. Every shard in the cache is then consulted. If a shard contains a chunk hash that is present in the local file being uploaded, then that chunk can be discarded as it is already stored in CAS. - -All shards have an expiration date of 3-4 weeks from when they are downloaded. Shards that are expired are not loaded during upload and are deleted one week after expiration. - -### Limits and Limitations - -The `chunk_cache` is limited to 10GB in size while the `shard_cache` is technically without limits (in practice, the size and use of shards are such that limiting the cache is unnecessary). - -By design, both caches are without high-level APIs. These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache). - -If you need to reclaim the space utilized by either cache or need to debug any potential cache-related issues, simply remove the `xet` cache entirely by running `rm -rf ~//xet` where `` is the location of your Hugging Face cache, typically `~/.cache/huggingface` - -Example full `xet`cache directory tree: - -```sh - -├─ xet -│ ├─ chunk_cache -│ │ ├─ L1 -│ │ │ ├─ L1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0 -│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9 -│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB -│ ├─ shard_cache -│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb -│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb -│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb -│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb -``` - -To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/repositories-storage). diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index abd8251e63..cd8db5195a 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -191,7 +191,7 @@ Take advantage of faster uploads through `hf_xet`, the Python binding to the [`x -Xet storage is being rolled out to Hugging Face Hub users at this time, so xet uploads may need to be enabled for your repo for `hf_xet` to actually upload to the Xet backend. Join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon! +Xet storage is being rolled out to Hugging Face Hub users at this time, so xet uploads may need to be enabled for your repo for `hf_xet` to actually upload to the Xet backend. Join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon! Also, `hf_xet` today only works with files on the file system, so cannot be used with file-like objects (byte-arrays, buffers). From 6c796016b0aa6c1ddf830e0ab58e1765e570cd7a Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Fri, 14 Mar 2025 10:29:16 -0700 Subject: [PATCH 13/16] Added two lines about flow of upload/download --- docs/source/en/guides/download.md | 4 +++- docs/source/en/guides/upload.md | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/source/en/guides/download.md b/docs/source/en/guides/download.md index a4a2ac1c8e..2eed70c622 100644 --- a/docs/source/en/guides/download.md +++ b/docs/source/en/guides/download.md @@ -174,7 +174,9 @@ There are two options to speed up downloads. Both involve installing a Python pa ### hf_xet Take advantage of faster downloads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables -chunk-based deduplication for faster downloads and uploads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS. +chunk-based deduplication for faster downloads and uploads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS. + +`hf_xet` uses the Xet storage system, which breaks files down into immutable chunks, storing collections of these chunks (called blocks or xorbs) remotely and retrieving them to reassemble the file when requested. When downloading, after confirming the user is authorized to access the files, `hf_xet` will query the Xet content-addressable service (CAS) with the LFS SHA256 hash for this file to receive the reconstruction metadata (ranges within xorbs) to assemble these files, along with presigned URLs to download the xorbs directly. Then `hf_xet` will efficiently download the xorb ranges necessary and will write out the files on disk. `hf_xet` uses a local disk cache to only download chunks once, learn more in the [Chunk-based caching(Xet)](./manage-cache.md#chunk-based-caching-xet) section. To enable it, specify the `hf_xet` package when installing `huggingface_hub`: diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index cd8db5195a..5cfdedd765 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -195,6 +195,8 @@ Xet storage is being rolled out to Hugging Face Hub users at this time, so xet u +`hf_xet` uses the Xet storage system, which breaks files down into immutable chunks, storing collections of these chunks (called blocks or xorbs) remotely and retrieving them to reassemble the file when requested. When uploading, after confirming the user is authorized to write to this repo, `hf_xet` will scan the files, breaking them down into their chunks and collecting those chunks into xorbs (and deduplicating across known chunks), and then will be upload these xorbs to the Xet content-addressable service (CAS), which will verify the integrity of the xorbs, register the xorb metadata along with the LFS SHA256 hash (to support lookup/download), and write the xorbs to remote storage. + To enable it, specify the `hf_xet` extra when installing `huggingface_hub`: ```bash From 8b808c3cb110748ec851f1d4dcfc7b3328033273 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Fri, 14 Mar 2025 11:05:41 -0700 Subject: [PATCH 14/16] Updating links to Hub doc location --- docs/source/en/guides/download.md | 2 +- docs/source/en/guides/manage-cache.md | 4 ++-- docs/source/en/guides/upload.md | 2 +- docs/source/en/package_reference/environment_variables.md | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/source/en/guides/download.md b/docs/source/en/guides/download.md index 2eed70c622..8143286956 100644 --- a/docs/source/en/guides/download.md +++ b/docs/source/en/guides/download.md @@ -186,7 +186,7 @@ pip install -U huggingface_hub[hf_xet] Note: `hf_xet` will only be utilized when the files being downloaded are being stored with Xet Storage. -All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/repositories-storage). +All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/storage-backends). ### hf_transfer diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index ae312991fc..d833b315f8 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -174,7 +174,7 @@ by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable to true. ## Chunk-based caching (Xet) -To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks, which are immutable byte ranges from files (up to 64KB) that are created using content-defined chunking. For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/repositories-storage). +To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks, which are immutable byte ranges from files (up to 64KB) that are created using content-defined chunking. For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/storage-backends). The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure @@ -259,7 +259,7 @@ Example full `xet`cache directory tree: │ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb ``` -To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/repositories-storage). +To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/storage-backends). ## Caching assets diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md index 5cfdedd765..434b8e284d 100644 --- a/docs/source/en/guides/upload.md +++ b/docs/source/en/guides/upload.md @@ -203,7 +203,7 @@ To enable it, specify the `hf_xet` extra when installing `huggingface_hub`: pip install -U huggingface_hub[hf_xet] ``` -All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/repositories-storage). +All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/storage-backends). ### Non-blocking uploads diff --git a/docs/source/en/package_reference/environment_variables.md b/docs/source/en/package_reference/environment_variables.md index d1004ed5a9..4776ffcb06 100644 --- a/docs/source/en/package_reference/environment_variables.md +++ b/docs/source/en/package_reference/environment_variables.md @@ -168,7 +168,7 @@ Please note that using `hf_transfer` comes with certain limitations. Since it is `hf_xet` is an alternative to `hf_transfer`. It provides efficient file transfers through a chunk-based deduplication strategy, custom Xet storage (replacing Git LFS), and a seamless integration with `huggingface_hub`. -[Read more about the package](https://huggingface.co/docs/hub/repositories-storage) and enable with `pip install huggingface_hub[hf_xet]`. +[Read more about the package](https://huggingface.co/docs/hub/storage-backends) and enable with `pip install huggingface_hub[hf_xet]`. From b599282e15ee4340a337990d3e50b6131c305af7 Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Mon, 17 Mar 2025 09:41:33 -0700 Subject: [PATCH 15/16] Update docs/source/en/guides/manage-cache.md Co-authored-by: Lucain --- docs/source/en/guides/manage-cache.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index d833b315f8..c0069b68bc 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -4,7 +4,7 @@ rendered properly in your Markdown viewer. # Understand caching -`huggingface_hub` utilizes the local disk as two caches, which avoid re-downloading items again. The first cache is a file-based cache, which caches individual files downloaded from the Hub and ensures that the same file is not downloaded again, even if multiple repos or revisions use that file. The second cache is a chunk cache, where each chunk represents a byte range from a file and ensures that chunks that are shared across files are only downloaded once. +`huggingface_hub` utilizes the local disk as two caches, which avoid re-downloading items again. The first cache is a file-based cache, which caches individual files downloaded from the Hub and ensures that the same file is not downloaded again when a repo gets updated. The second cache is a chunk cache, where each chunk represents a byte range from a file and ensures that chunks that are shared across files are only downloaded once. ## File-based caching From 9f917598b2548ea487901cc5213851eedaaa527d Mon Sep 17 00:00:00 2001 From: Rajat Arya Date: Mon, 17 Mar 2025 09:44:31 -0700 Subject: [PATCH 16/16] Reformat headings, less levels in TOC --- docs/source/en/guides/manage-cache.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/source/en/guides/manage-cache.md b/docs/source/en/guides/manage-cache.md index c0069b68bc..b2ef0cf6c9 100644 --- a/docs/source/en/guides/manage-cache.md +++ b/docs/source/en/guides/manage-cache.md @@ -333,7 +333,7 @@ Therefore it can be useful to scan your cache directory in order to know which r and revisions are taking the most disk space. `huggingface_hub` provides an helper to do so that can be used via `huggingface-cli` or in a python script. -#### Scan cache from the terminal +**Scan cache from the terminal** The easiest way to scan your HF cache-system is to use the `scan-cache` command from `huggingface-cli` tool. This command scans the cache and prints a report with information @@ -384,7 +384,7 @@ Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G. Got 1 warning(s) while scanning. Use -vvv to print details. ``` -#### Grep example +**Grep example** Since the output is in tabular format, you can combine it with any `grep`-like tools to filter the entries. Here is an example to filter only revisions from the "t5-small" @@ -397,7 +397,7 @@ t5-small model d0a119eedb3718e34c648e594394474cf95e0617 t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 1 week ago main /home/wauplin/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5 ``` -#### Scan cache from Python +**Scan cache from Python** For a more advanced usage, use [`scan_cache_dir`] which is the python utility called by the CLI tool. @@ -469,7 +469,7 @@ delete some portions to free up some space on your drive. This is possible using [`~HFCacheInfo.delete_revisions`] helper from [`HFCacheInfo`] object returned when scanning the cache. -#### Delete strategy +**Delete strategy** To delete some cache, you need to pass a list of revisions to delete. The tool will define a strategy to free up the space based on this list. It returns a @@ -501,7 +501,7 @@ error is thrown. The deletion continues for other paths contained in the -#### Clean cache from the terminal +**Clean cache from the terminal** The easiest way to delete some revisions from your HF cache-system is to use the `delete-cache` command from `huggingface-cli` tool. The command has two modes. By @@ -510,7 +510,7 @@ revisions to delete. This TUI is currently in beta as it has not been tested on platforms. If the TUI doesn't work on your machine, you can disable it using the `--disable-tui` flag. -#### Using the TUI +**Using the TUI** This is the default mode. To use it, you first need to install extra dependencies by running the following command: @@ -554,7 +554,7 @@ Start deletion. Done. Deleted 1 repo(s) and 0 revision(s) for a total of 3.1G. ``` -#### Without TUI +**Without TUI** As mentioned above, the TUI mode is currently in beta and is optional. It may be the case that it doesn't work on your machine or that you don't find it convenient. @@ -615,7 +615,7 @@ Example of command file: # 9cfa5647b32c0a30d0adfca06bf198d82192a0d1 # Refs: main # modified 5 days ago ``` -#### Clean cache from Python +**Clean cache from Python** For more flexibility, you can also use the [`~HFCacheInfo.delete_revisions`] method programmatically. Here is a simple example. See reference for details.