Skip to content

Commit ef50638

Browse files
committed
link from storage-backend and iterate on title formatting
1 parent b5b3bec commit ef50638

File tree

11 files changed

+47
-28
lines changed

11 files changed

+47
-28
lines changed

docs/hub/_redirects.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,3 @@ adapter-transformers: adapters
1919
security-two-fa: security-2fa
2020
repositories-recommendations: storage-limits
2121
xet: storage-backends#xet
22-
xet-spec: xet/index

docs/hub/storage-backends.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories d
4141

4242
### Using Xet Storage
4343

44-
To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users).
44+
To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users).
4545

4646
> [!TIP]
4747
> For user and organization profiles created before May 23rd, 2025, you can make Xet the default for all your repositories by [signing up here](https://huggingface.co/join/xet). You can apply for yourself or your entire organization (requires [admin permissions](https://huggingface.co/docs/hub/organizations-security)). Once approved, all existing repositories will be automatically migrated to Xet and future repositories will be Xet-enabled by default.
@@ -54,7 +54,7 @@ To access a Xet-aware version of the `huggingface_hub`, simply install the lates
5454
pip install -U huggingface_hub
5555
```
5656

57-
As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend.
57+
As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend.
5858

5959
If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken.
6060

@@ -77,7 +77,7 @@ To see more detailed usage docs, refer to the `huggingface_hub` docs for:
7777
Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage:
7878

7979
- **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files.
80-
- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads.
80+
- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads.
8181
- **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient.
8282
- **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage.
8383
- **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need.
@@ -120,3 +120,9 @@ The legacy storage system on the Hub, Git LFS utilizes many of the same conventi
120120
The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine).
121121

122122
This leads to a worse developer experience along with a proliferation of additional storage.
123+
124+
## Open Source Xet Protocol
125+
126+
The Xet storage backend is built on an open source protocol that enables efficient, chunk-based storage and retrieval of large files. This protocol provides the foundation for the deduplication and performance benefits described throughout this documentation.
127+
128+
For detailed technical specifications about the Xet protocol, including API endpoints, authentication mechanisms, chunking algorithms, and file reconstruction processes, see the [Xet Protocol Specification](./xet/index).

docs/hub/xet/api.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the orde
1818

1919
Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters).
2020

21+
> [!NOTE]
22+
> In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure.
23+
2124
### Example
2225

2326
Suppose a hash value is:

docs/hub/xet/auth.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ Here's a basic implementation flow:
101101
4. **Token refresh (when needed):**
102102
Use the same API to generate a new token.
103103

104+
> [!NOTE]
104105
> In `xet-core` we SHOULD add 30 seconds of buffer time before the provided `expiration` time to refresh the token.
105106
106107
## Token Scope

docs/hub/xet/chunking.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
99
+---------+---------+---------+---------+---------+---------+---------+--------------
1010
```
1111

12-
## Step-by-step algorithm (Gearhash-based CDC)
12+
## Step-by-step Algorithm (Gearhash-based CDC)
1313

1414
### Constant Parameters
1515

@@ -24,15 +24,15 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
2424
- h: 64-bit hash, initialized to 0
2525
- start_offset: start offset of the current chunk, initialized to 0
2626

27-
### Per-byte update rule (Gearhash)
27+
### Per-byte Update Rule (Gearhash)
2828

2929
For each input byte `b`, update the hash with 64-bit wrapping arithmetic:
3030

3131
```text
3232
h = (h << 1) + TABLE[b]
3333
```
3434

35-
### Boundary test and size constraints
35+
### Boundary Test and Size Constraints
3636

3737
At each position after updating `h`, let `size = current_offset - start_offset + 1`.
3838

@@ -89,31 +89,31 @@ Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all
8989
- Locality: small edits only affect nearby boundaries
9090
- Linear time and constant memory: single 64-bit state and counters
9191

92-
### Intuition and rationale
92+
### Intuition and Rationale
9393

9494
- The table `TABLE[256]` injects pseudo-randomness per byte value so that the evolving hash `h` behaves like a random 64-bit value with respect to the mask test. This makes boundaries content-defined yet statistically evenly spaced.
9595
- The left shift `(h << 1)` amplifies recent bytes, helping small changes affect nearby positions without globally shifting all boundaries.
9696
- Resetting `h` to 0 at each boundary prevents long-range carryover and keeps boundary decisions for each chunk statistically independent.
9797

98-
### Implementation notes
98+
### Implementation Notes
9999

100100
- Only reset `h` when you emit a boundary. This ensures chunking is stable even when streaming input in pieces.
101101
- Apply the mask test only once `size >= MIN_CHUNK_SIZE`. This reduces the frequency of tiny chunks and stabilizes average chunk sizes.
102102
- MUST force a boundary at `MAX_CHUNK_SIZE` even if `(h & MASK) != 0`. This guarantees bounded chunk sizes and prevents pathological long chunks when matches are rare.
103103
- Use 64-bit wrapping arithmetic for `(h << 1) + TABLE[b]`. This is the behavior in the reference implementation [rust-gearhash].
104104

105-
### Edge cases
105+
### Edge Cases
106106

107107
- Tiny files: if `len(data) < MIN_CHUNK_SIZE`, the entire `data` is emitted as a single chunk.
108108
- Long runs without a match: if no position matches `(h & MASK) == 0` before `MAX_CHUNK_SIZE`, a boundary is forced at `MAX_CHUNK_SIZE` to cap chunk size.
109109

110-
### Portability and determinism
110+
### Portability and Determinism
111111

112112
- With a fixed `T[256]` table and mask, the algorithm is deterministic across platforms: same input → same chunk boundaries.
113113
- Endianness does not affect behavior because updates are byte-wise and use scalar 64-bit operations.
114114
- SIMD-accelerated implementations (when available) are optimizations only; they produce the same boundaries as the scalar path [rust-gearhash].
115115

116-
## Minimum-size skip-ahead (cut-point skipping optimization)
116+
## Minimum-size Skip-ahead (Cut-point Skipping Optimization)
117117

118118
Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the `MIN_CHUNK_SIZE` constraint anyway.
119119
We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.

docs/hub/xet/deduplication.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@ A **chunk** is a variable-sized content block derived from files using Content-D
2525

2626
[Detailed chunking description](./chunking)
2727

28-
### Xorbs (Extended Object Blocks)
28+
### Xorbs
2929

30-
**Xorbs** are containers that aggregate multiple chunks for efficient storage and transfer:
30+
**Xorbs** are objects that aggregate multiple chunks for efficient storage and transfer:
3131

3232
- **Maximum size**: 64MB
3333
- **Maximum chunks**: 8,192 chunks per xorb

docs/hub/xet/file-id.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Getting a File ID from the Hugging Face Hub
1+
# Getting a Xet File ID from the Hugging Face Hub
22

33
This section explains the Xet file ID used in the reconstruction API to download a file from the HuggingFace hub using the xet protocol.
44

@@ -28,5 +28,6 @@ If the file is stored on the xet system then a successful response will have a `
2828
The string value of this header is the Xet file ID and SHOULD be used in the path of the reconstruction API URL.
2929
This is the string representation of the hash and can be used directly in the file reconstruction API on download.
3030

31+
> [!NOTE]
3132
> The resolve URL will return a 302 redirect http status code, following the redirect will download the content via the old LFS compatible route rather than through the Xet protocol.
3233
In order to use the Xet protocol you MUST NOT follow this redirect.

docs/hub/xet/file-reconstruction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The file is reconstructed by retrieving those chunk ranges, decoding them to raw
2222

2323
### Diagram
2424

25-
> A file with 4 terms. Each term is a pointer to chunk range within a xorb.
25+
A file with 4 terms. Each term is a pointer to chunk range within a xorb.
2626

2727
```txt
2828
File Reconstruction

docs/hub/xet/index.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Xet Protocol Specification
22

3+
> [!NOTE]
34
> Version 0.1.0 (1.0.0 on release)
45
> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119](https://www.ietf.org/rfc/rfc2119.txt) [RFC8174](https://www.ietf.org/rfc/rfc8174.txt)
56
when, and only when, they appear in all capitals, as shown here.
@@ -8,15 +9,15 @@ This specification defines the end-to-end Xet protocol for content-addressed dat
89
Its goal is interoperability and determinism: independent implementations MUST produce the same hashes, objects, and API behavior so data written by one client can be read by another with integrity and performance.
910
Implementors can create their own clients, SDKs, and tools that speak the Xet protocol and interface with the CAS service, as long as they MUST adhere to the requirements defined here.
1011

11-
## Building a client library for xet storage
12+
## Building a Client Library for Xet Storage
1213

1314
- [Upload Protocol](./upload-protocol): End-to-end top level description of the upload flow.
1415
- [Download Protocol](./download-protocol): Instructions for the download procedure.
1516
- [CAS API](./api): HTTP endpoints for reconstruction, global chunk dedupe, xorb upload, and shard upload, including error semantics.
1617
- [Authentication and Authorization](./auth): How to obtain Xet tokens from the Hugging Face Hub, token scopes, and security considerations.
17-
- [Hugging Face Hub Files Conversion to Xet File ID's](./file-id): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
18+
- [Converting Hugging Face Hub Files to Xet File ID's](./file-id): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
1819

19-
## Overall Xet architecture
20+
## Overall Xet Architecture
2021

2122
- [Content-Defined Chunking](./chunking): Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
2223
- [Hashing Methods](./hashing): Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
@@ -25,7 +26,7 @@ Implementors can create their own clients, SDKs, and tools that speak the Xet pr
2526
- [Shard Format](./shard): Binary shard structure (header, file info, CAS info, footer), offsets, HMAC key usage, and bookends.
2627
- [Deduplication](./deduplication): Explanation of chunk level dedupe including global system-wide chunk level dedupe.
2728

28-
## Reference implementation
29+
## Reference Implementation
2930

3031
### xet-core: hf-xet + git-xet
3132

@@ -41,7 +42,7 @@ The primary reference implementation of the protocol written in rust 🦀 lives
4142
- [hf_xet](https://github.com/huggingface/xet-core/tree/main/hf_xet) - Python bindings to use the Xet protocol for uploads and downloads with the Hugging Face Hub.
4243
- [git-xet](ttps://github.com/huggingface/xet-core/tree/main/git-xet) - git lfs custom transfer agent that uploads files using the xet protocol to the Hugging Face Hub.
4344

44-
### Huggingface.js
45+
### huggingface.js
4546

4647
There is also a second reference implementation in Huggingface.js that can be used when downloading or uploading files with the `@huggingface/hub` library.
4748

docs/hub/xet/shard.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ struct MDBShardFileHeader {
132132
4. Verify version equals 2
133133
5. Read 8 bytes for footer_size (u64)
134134

135+
> [!NOTE]
135136
> When serializing, footer_size MUST be the number of bytes that make up the footer, or 0 if the footer is omitted.
136137
137138
## 2. File Info Section
@@ -241,6 +242,7 @@ struct FileDataSequenceEntry {
241242
}
242243
```
243244

245+
> [!NOTE]
244246
> Note that when describing a chunk range in a `FileDataSequenceEntry` use ranges that are start-inclusive but end-exclusive i.e. `[chunk_index_start, chunk_index_end)`
245247
246248
**Memory Layout**:
@@ -427,6 +429,7 @@ Since the cas info section immediately follows the file info section bookend, a
427429

428430
## 4. Footer (MDBShardFileFooter)
429431

432+
> [!NOTE]
430433
> MUST NOT include the footer when serializing the shard as the body for the shard upload API.
431434
432435
**Location**: End of file minus footer_size
@@ -448,6 +451,7 @@ struct MDBShardFileFooter {
448451

449452
**Memory Layout**:
450453

454+
> [!NOTE]
451455
> Fields are not exactly to scale
452456
453457
```txt

0 commit comments

Comments
 (0)