Skip to content

Commit e4e2a04

Browse files
rajataryajulien-cPierrcihanouticelinaWauplin
authored
Xet Docs for huggingface_hub (#2899)
* Xet docs * PR feedback, added waitlist links * Added HF_XET_CACHE env variable docs * PR feedback * Doc feedback * Added two lines about flow of upload/download * Updating links to Hub doc location * Reformat headings, less levels in TOC --------- Co-authored-by: Julien Chaumond <[email protected]> Co-authored-by: Pierric Cistac <[email protected]> Co-authored-by: Célina <[email protected]> Co-authored-by: Lucain <[email protected]>
1 parent 86de575 commit e4e2a04

File tree

4 files changed

+162
-15
lines changed

4 files changed

+162
-15
lines changed

docs/source/en/guides/download.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,30 @@ For more details about the CLI download command, please refer to the [CLI guide]
166166

167167
## Faster downloads
168168

169+
There are two options to speed up downloads. Both involve installing a Python package written in Rust.
170+
171+
* `hf_xet` is newer and uses the Xet storage backend for upload/download. It is available in production, but is in the process of being rolled out to all users, so join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon!
172+
* `hf_transfer` is a power-tool to download and upload to our LFS storage backend (note: this is less future-proof than Xet). It is thoroughly tested and has been in production for a long time, but it has some limitations.
173+
174+
### hf_xet
175+
176+
Take advantage of faster downloads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables
177+
chunk-based deduplication for faster downloads and uploads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS.
178+
179+
`hf_xet` uses the Xet storage system, which breaks files down into immutable chunks, storing collections of these chunks (called blocks or xorbs) remotely and retrieving them to reassemble the file when requested. When downloading, after confirming the user is authorized to access the files, `hf_xet` will query the Xet content-addressable service (CAS) with the LFS SHA256 hash for this file to receive the reconstruction metadata (ranges within xorbs) to assemble these files, along with presigned URLs to download the xorbs directly. Then `hf_xet` will efficiently download the xorb ranges necessary and will write out the files on disk. `hf_xet` uses a local disk cache to only download chunks once, learn more in the [Chunk-based caching(Xet)](./manage-cache.md#chunk-based-caching-xet) section.
180+
181+
To enable it, specify the `hf_xet` package when installing `huggingface_hub`:
182+
183+
```bash
184+
pip install -U huggingface_hub[hf_xet]
185+
```
186+
187+
Note: `hf_xet` will only be utilized when the files being downloaded are being stored with Xet Storage.
188+
189+
All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/storage-backends).
190+
191+
### hf_transfer
192+
169193
If you are running on a machine with high bandwidth,
170194
you can increase your download speed with [`hf_transfer`](https://github.com/huggingface/hf_transfer),
171195
a Rust-based library developed to speed up file transfers with the Hub.

docs/source/en/guides/manage-cache.md

Lines changed: 105 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22
rendered properly in your Markdown viewer.
33
-->
44

5-
# Manage `huggingface_hub` cache-system
5+
# Understand caching
66

7-
## Understand caching
7+
`huggingface_hub` utilizes the local disk as two caches, which avoid re-downloading items again. The first cache is a file-based cache, which caches individual files downloaded from the Hub and ensures that the same file is not downloaded again when a repo gets updated. The second cache is a chunk cache, where each chunk represents a byte range from a file and ensures that chunks that are shared across files are only downloaded once.
8+
9+
## File-based caching
810

911
The Hugging Face Hub cache-system is designed to be the central cache shared across libraries
1012
that depend on the Hub. It has been updated in v0.8.0 to prevent re-downloading same files
@@ -170,6 +172,95 @@ When symlinks are not supported, a warning message is displayed to the user to a
170172
them they are using a degraded version of the cache-system. This warning can be disabled
171173
by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable to true.
172174

175+
## Chunk-based caching (Xet)
176+
177+
To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks, which are immutable byte ranges from files (up to 64KB) that are created using content-defined chunking. For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/storage-backends).
178+
179+
The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure
180+
181+
```bash
182+
<CACHE_DIR>
183+
├─ chunk_cache
184+
├─ shard_cache
185+
```
186+
187+
The `xet` cache, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system.
188+
189+
The `chunk-cache` directory contains cached data chunks that are used to speed up downloads while the `shard-cache` directory contains cached shards that are utilized on the upload path.
190+
191+
### `chunk_cache`
192+
193+
This cache is used on the download path. The cache directory structure is based on a base-64 encoded hash from the content-addressed store (CAS) that backs each Xet-enabled repository. A CAS hash serves as the key to lookup the offsets of where the data is stored.
194+
195+
At the topmost level, the first two letters of the base 64 encoded CAS hash are used to create a subdirectory in the `chunk_cache` (keys that share these first two letters are grouped here). The inner levels are comprised of subdirectories with the full key as the directory name. At the base are the cache items which are ranges of blocks that contain the cached chunks.
196+
197+
```bash
198+
<CACHE_DIR>
199+
├─ xet
200+
│ ├─ chunk_cache
201+
│ │ ├─ A1
202+
│ │ │ ├─ A1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0
203+
│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9
204+
│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB
205+
206+
```
207+
208+
When requesting a file, the first thing `hf_xet` does is communicate with Xet storage’s content addressed store (CAS) for reconstruction information. The reconstruction information contains information about the CAS keys required to download the file in its entirety.
209+
210+
Before executing the requests for the CAS keys, the `chunk_cache` is consulted. If a key in the cache matches a CAS key, then there is no reason to issue a request for that content. `hf_xet` uses the chunks stored in the directory instead.
211+
212+
As the `chunk_cache` is purely an optimization, not a guarantee, `hf_xet` utilizes a computationally efficient eviction policy. When the `chunk_cache` is full (see `Limits and Limitations` below), `hf_xet` implements a random eviction policy when selecting an eviction candidate. This significantly reduces the overhead of managing a robust caching system (e.g., LRU) while still providing most of the benefits of caching chunks.
213+
214+
### `shard_cache`
215+
216+
This cache is used when uploading content to the Hub. The directory is flat, comprising only of shard files, each using an ID for the shard name.
217+
218+
```sh
219+
<CACHE_DIR>
220+
├─ xet
221+
│ ├─ shard_cache
222+
│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb
223+
│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb
224+
│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb
225+
│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb
226+
```
227+
228+
The `shard_cache` contains shards that are:
229+
230+
- Locally generated and successfully uploaded to the CAS
231+
- Downloaded from CAS as part of the global deduplication algorithm
232+
233+
Shards provide a mapping between files and chunks. During uploads, each file is chunked and the hash of the chunk is saved. Every shard in the cache is then consulted. If a shard contains a chunk hash that is present in the local file being uploaded, then that chunk can be discarded as it is already stored in CAS.
234+
235+
All shards have an expiration date of 3-4 weeks from when they are downloaded. Shards that are expired are not loaded during upload and are deleted one week after expiration.
236+
237+
### Limits and Limitations
238+
239+
The `chunk_cache` is limited to 10GB in size while the `shard_cache` is technically without limits (in practice, the size and use of shards are such that limiting the cache is unnecessary).
240+
241+
By design, both caches are without high-level APIs. These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
242+
243+
If you need to reclaim the space utilized by either cache or need to debug any potential cache-related issues, simply remove the `xet` cache entirely by running `rm -rf ~/<cache_dir>/xet` where `<cache_dir>` is the location of your Hugging Face cache, typically `~/.cache/huggingface`
244+
245+
Example full `xet`cache directory tree:
246+
247+
```sh
248+
<CACHE_DIR>
249+
├─ xet
250+
│ ├─ chunk_cache
251+
│ │ ├─ L1
252+
│ │ │ ├─ L1GerURLUcISVivdseeoY1PnYifYkOaCCJ7V5Q9fjgxkZWZhdWx0
253+
│ │ │ │ ├─ AAAAAAEAAAA5DQAAAAAAAIhRLjDI3SS5jYs4ysNKZiJy9XFI8CN7Ww0UyEA9KPD9
254+
│ │ │ │ ├─ AQAAAAIAAABzngAAAAAAAPNqPjd5Zby5aBvabF7Z1itCx0ryMwoCnuQcDwq79jlB
255+
│ ├─ shard_cache
256+
│ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb
257+
│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb
258+
│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb
259+
│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb
260+
```
261+
262+
To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/storage-backends).
263+
173264
## Caching assets
174265

175266
In addition to caching files from the Hub, downstream libraries often requires to cache
@@ -232,15 +323,17 @@ In practice, your assets cache should look like the following tree:
232323
└── (...)
233324
```
234325

235-
## Scan your cache
326+
## Manage your file-based cache
327+
328+
### Scan your cache
236329

237330
At the moment, cached files are never deleted from your local directory: when you download
238331
a new revision of a branch, previous files are kept in case you need them again.
239332
Therefore it can be useful to scan your cache directory in order to know which repos
240333
and revisions are taking the most disk space. `huggingface_hub` provides an helper to
241334
do so that can be used via `huggingface-cli` or in a python script.
242335

243-
### Scan cache from the terminal
336+
**Scan cache from the terminal**
244337

245338
The easiest way to scan your HF cache-system is to use the `scan-cache` command from
246339
`huggingface-cli` tool. This command scans the cache and prints a report with information
@@ -291,7 +384,7 @@ Done in 0.0s. Scanned 6 repo(s) for a total of 3.4G.
291384
Got 1 warning(s) while scanning. Use -vvv to print details.
292385
```
293386

294-
#### Grep example
387+
**Grep example**
295388

296389
Since the output is in tabular format, you can combine it with any `grep`-like tools to
297390
filter the entries. Here is an example to filter only revisions from the "t5-small"
@@ -304,7 +397,7 @@ t5-small model d0a119eedb3718e34c648e594394474cf95e0617
304397
t5-small model d78aea13fa7ecd06c29e3e46195d6341255065d5 970.7M 9 1 week ago main /home/wauplin/.cache/huggingface/hub/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5
305398
```
306399

307-
### Scan cache from Python
400+
**Scan cache from Python**
308401

309402
For a more advanced usage, use [`scan_cache_dir`] which is the python utility called by
310403
the CLI tool.
@@ -368,15 +461,15 @@ HFCacheInfo(
368461
)
369462
```
370463

371-
## Clean your cache
464+
### Clean your cache
372465

373466
Scanning your cache is interesting but what you really want to do next is usually to
374467
delete some portions to free up some space on your drive. This is possible using the
375468
`delete-cache` CLI command. One can also programmatically use the
376469
[`~HFCacheInfo.delete_revisions`] helper from [`HFCacheInfo`] object returned when
377470
scanning the cache.
378471

379-
### Delete strategy
472+
**Delete strategy**
380473

381474
To delete some cache, you need to pass a list of revisions to delete. The tool will
382475
define a strategy to free up the space based on this list. It returns a
@@ -408,7 +501,7 @@ error is thrown. The deletion continues for other paths contained in the
408501

409502
</Tip>
410503

411-
### Clean cache from the terminal
504+
**Clean cache from the terminal**
412505

413506
The easiest way to delete some revisions from your HF cache-system is to use the
414507
`delete-cache` command from `huggingface-cli` tool. The command has two modes. By
@@ -417,7 +510,7 @@ revisions to delete. This TUI is currently in beta as it has not been tested on
417510
platforms. If the TUI doesn't work on your machine, you can disable it using the
418511
`--disable-tui` flag.
419512

420-
#### Using the TUI
513+
**Using the TUI**
421514

422515
This is the default mode. To use it, you first need to install extra dependencies by
423516
running the following command:
@@ -461,7 +554,7 @@ Start deletion.
461554
Done. Deleted 1 repo(s) and 0 revision(s) for a total of 3.1G.
462555
```
463556

464-
#### Without TUI
557+
**Without TUI**
465558

466559
As mentioned above, the TUI mode is currently in beta and is optional. It may be the
467560
case that it doesn't work on your machine or that you don't find it convenient.
@@ -522,7 +615,7 @@ Example of command file:
522615
# 9cfa5647b32c0a30d0adfca06bf198d82192a0d1 # Refs: main # modified 5 days ago
523616
```
524617

525-
### Clean cache from Python
618+
**Clean cache from Python**
526619

527620
For more flexibility, you can also use the [`~HFCacheInfo.delete_revisions`] method
528621
programmatically. Here is a simple example. See reference for details.

docs/source/en/guides/upload.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -166,14 +166,17 @@ Check out our [Repository limitations and recommendations](https://huggingface.c
166166

167167
- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate on a script when failing takes only a little time.
168168
- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. This is what [`upload_large_folder`] does for you.
169-
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth. To use `hf_transfer`:
169+
- **Use `hf_xet`**: this leverages the new storage backend for Hub, is written in Rust, and is being rolled out to users right now. In order to upload using `hf_xet` your repo must be enabled to use the Xet storage backend. It is being rolled out now, so join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon!
170+
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth (uploads LFS files). To use `hf_transfer`:
170171
1. Specify the `hf_transfer` extra when installing `huggingface_hub`
171172
(i.e., `pip install huggingface_hub[hf_transfer]`).
172173
2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.
173174

174175
<Tip warning={true}>
175176

176-
`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).
177+
`hf_transfer` is a power user tool for uploading LFS files! It is tested and production-ready, but it is less future-proof and lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).
178+
179+
Note that `hf_xet` and `hf_transfer` tools are mutually exclusive. The former is used to upload files to Xet-enabled repos while the later uploads LFS files to regular repos.
177180

178181
</Tip>
179182

@@ -182,6 +185,25 @@ Check out our [Repository limitations and recommendations](https://huggingface.c
182185
In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub.
183186
However, `huggingface_hub` has more advanced features to make things easier. Let's have a look at them!
184187

188+
### Faster Uploads
189+
190+
Take advantage of faster uploads through `hf_xet`, the Python binding to the [`xet-core`](https://github.com/huggingface/xet-core) library that enables chunk-based deduplication for faster uploads and downloads. `hf_xet` integrates seamlessly with `huggingface_hub`, but uses the Rust `xet-core` library and Xet storage instead of LFS.
191+
192+
<Tip warning={true}>
193+
194+
Xet storage is being rolled out to Hugging Face Hub users at this time, so xet uploads may need to be enabled for your repo for `hf_xet` to actually upload to the Xet backend. Join the [waitlist](https://huggingface.co/join/xet) to get onboarded soon! Also, `hf_xet` today only works with files on the file system, so cannot be used with file-like objects (byte-arrays, buffers).
195+
196+
</Tip>
197+
198+
`hf_xet` uses the Xet storage system, which breaks files down into immutable chunks, storing collections of these chunks (called blocks or xorbs) remotely and retrieving them to reassemble the file when requested. When uploading, after confirming the user is authorized to write to this repo, `hf_xet` will scan the files, breaking them down into their chunks and collecting those chunks into xorbs (and deduplicating across known chunks), and then will be upload these xorbs to the Xet content-addressable service (CAS), which will verify the integrity of the xorbs, register the xorb metadata along with the LFS SHA256 hash (to support lookup/download), and write the xorbs to remote storage.
199+
200+
To enable it, specify the `hf_xet` extra when installing `huggingface_hub`:
201+
202+
```bash
203+
pip install -U huggingface_hub[hf_xet]
204+
```
205+
206+
All other `huggingface_hub` APIs will continue to work without any modification. To learn more about the benefits of Xet storage and `hf_xet`, refer to this [section](https://huggingface.co/docs/hub/storage-backends).
185207

186208
### Non-blocking uploads
187209

docs/source/en/package_reference/environment_variables.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Defaults to `"$HF_HOME/hub"` (e.g. `"~/.cache/huggingface/hub"` by default).
3838

3939
### HF_XET_CACHE
4040

41-
To configure where Xet Storage chunks will be cached locally.
41+
To configure where Xet chunks (byte ranges from files managed by Xet backend) are cached locally.
4242

4343
Defaults to `"$HF_HOME/xet"` (e.g. `"~/.cache/huggingface/xet"` by default).
4444

@@ -164,6 +164,14 @@ To use `hf_transfer`:
164164

165165
Please note that using `hf_transfer` comes with certain limitations. Since it is not purely Python-based, debugging errors may be challenging. Additionally, `hf_transfer` lacks several user-friendly features such as resumable downloads and proxies. These omissions are intentional to maintain the simplicity and speed of the Rust logic. Consequently, `hf_transfer` is not enabled by default in `huggingface_hub`.
166166

167+
<Tip>
168+
169+
`hf_xet` is an alternative to `hf_transfer`. It provides efficient file transfers through a chunk-based deduplication strategy, custom Xet storage (replacing Git LFS), and a seamless integration with `huggingface_hub`.
170+
171+
[Read more about the package](https://huggingface.co/docs/hub/storage-backends) and enable with `pip install huggingface_hub[hf_xet]`.
172+
173+
</Tip>
174+
167175
## Deprecated environment variables
168176

169177
In order to standardize all environment variables within the Hugging Face ecosystem, some variables have been marked as deprecated. Although they remain functional, they no longer take precedence over their replacements. The following table outlines the deprecated variables and their corresponding alternatives:

0 commit comments

Comments
 (0)