You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/storage-backends.md
+9-3Lines changed: 9 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,7 +41,7 @@ Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories d
41
41
42
42
### Using Xet Storage
43
43
44
-
To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users).
44
+
To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users).
45
45
46
46
> [!TIP]
47
47
> For user and organization profiles created before May 23rd, 2025, you can make Xet the default for all your repositories by [signing up here](https://huggingface.co/join/xet). You can apply for yourself or your entire organization (requires [admin permissions](https://huggingface.co/docs/hub/organizations-security)). Once approved, all existing repositories will be automatically migrated to Xet and future repositories will be Xet-enabled by default.
@@ -54,7 +54,7 @@ To access a Xet-aware version of the `huggingface_hub`, simply install the lates
54
54
pip install -U huggingface_hub
55
55
```
56
56
57
-
As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend.
57
+
As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend.
58
58
59
59
If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken.
60
60
@@ -77,7 +77,7 @@ To see more detailed usage docs, refer to the `huggingface_hub` docs for:
77
77
Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage:
78
78
79
79
-**Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files.
80
-
-**Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads.
80
+
-**Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads.
81
81
-**Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient.
82
82
-**Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage.
83
83
-**Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need.
@@ -120,3 +120,9 @@ The legacy storage system on the Hub, Git LFS utilizes many of the same conventi
120
120
The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine).
121
121
122
122
This leads to a worse developer experience along with a proliferation of additional storage.
123
+
124
+
## Open Source Xet Protocol
125
+
126
+
The Xet storage backend is built on an open source protocol that enables efficient, chunk-based storage and retrieval of large files. This protocol provides the foundation for the deduplication and performance benefits described throughout this documentation.
127
+
128
+
For detailed technical specifications about the Xet protocol, including API endpoints, authentication mechanisms, chunking algorithms, and file reconstruction processes, see the [Xet Protocol Specification](./xet/index).
Copy file name to clipboardExpand all lines: docs/hub/xet/api.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,9 @@ For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the orde
18
18
19
19
Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters).
20
20
21
+
> [!NOTE]
22
+
> In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure.
- start_offset: start offset of the current chunk, initialized to 0
26
26
27
-
### Per-byte update rule (Gearhash)
27
+
### Per-byte Update Rule (Gearhash)
28
28
29
29
For each input byte `b`, update the hash with 64-bit wrapping arithmetic:
30
30
31
31
```text
32
32
h = (h << 1) + TABLE[b]
33
33
```
34
34
35
-
### Boundary test and size constraints
35
+
### Boundary Test and Size Constraints
36
36
37
37
At each position after updating `h`, let `size = current_offset - start_offset + 1`.
38
38
@@ -89,31 +89,31 @@ Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all
89
89
- Locality: small edits only affect nearby boundaries
90
90
- Linear time and constant memory: single 64-bit state and counters
91
91
92
-
### Intuition and rationale
92
+
### Intuition and Rationale
93
93
94
94
- The table `TABLE[256]` injects pseudo-randomness per byte value so that the evolving hash `h` behaves like a random 64-bit value with respect to the mask test. This makes boundaries content-defined yet statistically evenly spaced.
95
95
- The left shift `(h << 1)` amplifies recent bytes, helping small changes affect nearby positions without globally shifting all boundaries.
96
96
- Resetting `h` to 0 at each boundary prevents long-range carryover and keeps boundary decisions for each chunk statistically independent.
97
97
98
-
### Implementation notes
98
+
### Implementation Notes
99
99
100
100
- Only reset `h` when you emit a boundary. This ensures chunking is stable even when streaming input in pieces.
101
101
- Apply the mask test only once `size >= MIN_CHUNK_SIZE`. This reduces the frequency of tiny chunks and stabilizes average chunk sizes.
102
102
- MUST force a boundary at `MAX_CHUNK_SIZE` even if `(h & MASK) != 0`. This guarantees bounded chunk sizes and prevents pathological long chunks when matches are rare.
103
103
- Use 64-bit wrapping arithmetic for `(h << 1) + TABLE[b]`. This is the behavior in the reference implementation [rust-gearhash].
104
104
105
-
### Edge cases
105
+
### Edge Cases
106
106
107
107
- Tiny files: if `len(data) < MIN_CHUNK_SIZE`, the entire `data` is emitted as a single chunk.
108
108
- Long runs without a match: if no position matches `(h & MASK) == 0` before `MAX_CHUNK_SIZE`, a boundary is forced at `MAX_CHUNK_SIZE` to cap chunk size.
109
109
110
-
### Portability and determinism
110
+
### Portability and Determinism
111
111
112
112
- With a fixed `T[256]` table and mask, the algorithm is deterministic across platforms: same input → same chunk boundaries.
113
113
- Endianness does not affect behavior because updates are byte-wise and use scalar 64-bit operations.
114
114
- SIMD-accelerated implementations (when available) are optimizations only; they produce the same boundaries as the scalar path [rust-gearhash].
Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the `MIN_CHUNK_SIZE` constraint anyway.
119
119
We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.
Copy file name to clipboardExpand all lines: docs/hub/xet/file-id.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Getting a File ID from the Hugging Face Hub
1
+
# Getting a Xet File ID from the Hugging Face Hub
2
2
3
3
This section explains the Xet file ID used in the reconstruction API to download a file from the HuggingFace hub using the xet protocol.
4
4
@@ -28,5 +28,6 @@ If the file is stored on the xet system then a successful response will have a `
28
28
The string value of this header is the Xet file ID and SHOULD be used in the path of the reconstruction API URL.
29
29
This is the string representation of the hash and can be used directly in the file reconstruction API on download.
30
30
31
+
> [!NOTE]
31
32
> The resolve URL will return a 302 redirect http status code, following the redirect will download the content via the old LFS compatible route rather than through the Xet protocol.
32
33
In order to use the Xet protocol you MUST NOT follow this redirect.
Copy file name to clipboardExpand all lines: docs/hub/xet/index.md
+6-5Lines changed: 6 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,6 @@
1
1
# Xet Protocol Specification
2
2
3
+
> [!NOTE]
3
4
> Version 0.1.0 (1.0.0 on release)
4
5
> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119](https://www.ietf.org/rfc/rfc2119.txt)[RFC8174](https://www.ietf.org/rfc/rfc8174.txt)
5
6
when, and only when, they appear in all capitals, as shown here.
@@ -8,15 +9,15 @@ This specification defines the end-to-end Xet protocol for content-addressed dat
8
9
Its goal is interoperability and determinism: independent implementations MUST produce the same hashes, objects, and API behavior so data written by one client can be read by another with integrity and performance.
9
10
Implementors can create their own clients, SDKs, and tools that speak the Xet protocol and interface with the CAS service, as long as they MUST adhere to the requirements defined here.
10
11
11
-
## Building a client library for xet storage
12
+
## Building a Client Library for Xet Storage
12
13
13
14
-[Upload Protocol](./upload-protocol): End-to-end top level description of the upload flow.
14
15
-[Download Protocol](./download-protocol): Instructions for the download procedure.
15
16
-[CAS API](./api): HTTP endpoints for reconstruction, global chunk dedupe, xorb upload, and shard upload, including error semantics.
16
17
-[Authentication and Authorization](./auth): How to obtain Xet tokens from the Hugging Face Hub, token scopes, and security considerations.
17
-
-[Hugging Face Hub Files Conversion to Xet File ID's](./file-id): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
18
+
-[Converting Hugging Face Hub Files to Xet File ID's](./file-id): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
18
19
19
-
## Overall Xet architecture
20
+
## Overall Xet Architecture
20
21
21
22
-[Content-Defined Chunking](./chunking): Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
22
23
-[Hashing Methods](./hashing): Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
@@ -25,7 +26,7 @@ Implementors can create their own clients, SDKs, and tools that speak the Xet pr
25
26
-[Shard Format](./shard): Binary shard structure (header, file info, CAS info, footer), offsets, HMAC key usage, and bookends.
26
27
-[Deduplication](./deduplication): Explanation of chunk level dedupe including global system-wide chunk level dedupe.
27
28
28
-
## Reference implementation
29
+
## Reference Implementation
29
30
30
31
### xet-core: hf-xet + git-xet
31
32
@@ -41,7 +42,7 @@ The primary reference implementation of the protocol written in rust 🦀 lives
41
42
-[hf_xet](https://github.com/huggingface/xet-core/tree/main/hf_xet) - Python bindings to use the Xet protocol for uploads and downloads with the Hugging Face Hub.
42
43
-[git-xet](ttps://github.com/huggingface/xet-core/tree/main/git-xet) - git lfs custom transfer agent that uploads files using the xet protocol to the Hugging Face Hub.
43
44
44
-
### Huggingface.js
45
+
### huggingface.js
45
46
46
47
There is also a second reference implementation in Huggingface.js that can be used when downloading or uploading files with the `@huggingface/hub` library.
> Note that when describing a chunk range in a `FileDataSequenceEntry` use ranges that are start-inclusive but end-exclusive i.e. `[chunk_index_start, chunk_index_end)`
245
247
246
248
**Memory Layout**:
@@ -427,6 +429,7 @@ Since the cas info section immediately follows the file info section bookend, a
427
429
428
430
## 4. Footer (MDBShardFileFooter)
429
431
432
+
> [!NOTE]
430
433
> MUST NOT include the footer when serializing the shard as the body for the shard upload API.
0 commit comments