link from storage-backend and iterate on title formatting

assafvayner · assafvayner · commit ef50638442c3 · 2025-09-25T14:28:37.000-07:00
diff --git a/docs/hub/_redirects.yml b/docs/hub/_redirects.yml
@@ -19,4 +19,3 @@ adapter-transformers: adapters
 security-two-fa: security-2fa
 repositories-recommendations: storage-limits
 xet: storage-backends#xet
-xet-spec: xet/index
diff --git a/docs/hub/storage-backends.md b/docs/hub/storage-backends.md
@@ -41,7 +41,7 @@ Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories d
 
 ### Using Xet Storage
 
-To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users). 
+To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users).
 
 > [!TIP]
 > For user and organization profiles created before May 23rd, 2025, you can make Xet the default for all your repositories by [signing up here](https://huggingface.co/join/xet). You can apply for yourself or your entire organization (requires [admin permissions](https://huggingface.co/docs/hub/organizations-security)). Once approved, all existing repositories will be automatically migrated to Xet and future repositories will be Xet-enabled by default.
@@ -54,7 +54,7 @@ To access a Xet-aware version of the `huggingface_hub`, simply install the lates
 pip install -U huggingface_hub
 ```
 
-As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. 
+As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend.
 
 If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken.
 
@@ -77,7 +77,7 @@ To see more detailed usage docs, refer to the `huggingface_hub` docs for:
 Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage:
 
 - **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files.
-- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads. 
+- **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads.
 - **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient.
 - **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage.
 - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need.
@@ -120,3 +120,9 @@ The legacy storage system on the Hub, Git LFS utilizes many of the same conventi
 The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine).
 
 This leads to a worse developer experience along with a proliferation of additional storage.
+
+## Open Source Xet Protocol
+
+The Xet storage backend is built on an open source protocol that enables efficient, chunk-based storage and retrieval of large files. This protocol provides the foundation for the deduplication and performance benefits described throughout this documentation.
+
+For detailed technical specifications about the Xet protocol, including API endpoints, authentication mechanisms, chunking algorithms, and file reconstruction processes, see the [Xet Protocol Specification](./xet/index).
diff --git a/docs/hub/xet/api.md b/docs/hub/xet/api.md
@@ -18,6 +18,9 @@ For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the orde
 
 Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters).
 
+> [!NOTE]
+> In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure.
+
 ### Example
 
 Suppose a hash value is:
diff --git a/docs/hub/xet/auth.md b/docs/hub/xet/auth.md
@@ -101,6 +101,7 @@ Here's a basic implementation flow:
 4. **Token refresh (when needed):**
    Use the same API to generate a new token.
   
+  > [!NOTE]
   > In `xet-core` we SHOULD add 30 seconds of buffer time before the provided `expiration` time to refresh the token.
 
 ## Token Scope
diff --git a/docs/hub/xet/chunking.md b/docs/hub/xet/chunking.md
@@ -9,7 +9,7 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
         +---------+---------+---------+---------+---------+---------+---------+--------------
 ```
 
-## Step-by-step algorithm (Gearhash-based CDC)
+## Step-by-step Algorithm (Gearhash-based CDC)
 
 ### Constant Parameters
 
@@ -24,15 +24,15 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
 - h: 64-bit hash, initialized to 0
 - start_offset: start offset of the current chunk, initialized to 0
 
-### Per-byte update rule (Gearhash)
+### Per-byte Update Rule (Gearhash)
 
 For each input byte `b`, update the hash with 64-bit wrapping arithmetic:
 
 ```text
 h = (h << 1) + TABLE[b]
 ```
 
-### Boundary test and size constraints
+### Boundary Test and Size Constraints
 
 At each position after updating `h`, let `size = current_offset - start_offset + 1`.
 
@@ -89,31 +89,31 @@ Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all
 - Locality: small edits only affect nearby boundaries
 - Linear time and constant memory: single 64-bit state and counters
 
-### Intuition and rationale
+### Intuition and Rationale
 
 - The table `TABLE[256]` injects pseudo-randomness per byte value so that the evolving hash `h` behaves like a random 64-bit value with respect to the mask test. This makes boundaries content-defined yet statistically evenly spaced.
 - The left shift `(h << 1)` amplifies recent bytes, helping small changes affect nearby positions without globally shifting all boundaries.
 - Resetting `h` to 0 at each boundary prevents long-range carryover and keeps boundary decisions for each chunk statistically independent.
 
-### Implementation notes
+### Implementation Notes
 
 - Only reset `h` when you emit a boundary. This ensures chunking is stable even when streaming input in pieces.
 - Apply the mask test only once `size >= MIN_CHUNK_SIZE`. This reduces the frequency of tiny chunks and stabilizes average chunk sizes.
 - MUST force a boundary at `MAX_CHUNK_SIZE` even if `(h & MASK) != 0`. This guarantees bounded chunk sizes and prevents pathological long chunks when matches are rare.
 - Use 64-bit wrapping arithmetic for `(h << 1) + TABLE[b]`. This is the behavior in the reference implementation [rust-gearhash].
 
-### Edge cases
+### Edge Cases
 
 - Tiny files: if `len(data) < MIN_CHUNK_SIZE`, the entire `data` is emitted as a single chunk.
 - Long runs without a match: if no position matches `(h & MASK) == 0` before `MAX_CHUNK_SIZE`, a boundary is forced at `MAX_CHUNK_SIZE` to cap chunk size.
 
-### Portability and determinism
+### Portability and Determinism
 
 - With a fixed `T[256]` table and mask, the algorithm is deterministic across platforms: same input → same chunk boundaries.
 - Endianness does not affect behavior because updates are byte-wise and use scalar 64-bit operations.
 - SIMD-accelerated implementations (when available) are optimizations only; they produce the same boundaries as the scalar path [rust-gearhash].
 
-## Minimum-size skip-ahead (cut-point skipping optimization)
+## Minimum-size Skip-ahead (Cut-point Skipping Optimization)
 
 Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the `MIN_CHUNK_SIZE` constraint anyway.
 We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.
diff --git a/docs/hub/xet/deduplication.md b/docs/hub/xet/deduplication.md
@@ -25,9 +25,9 @@ A **chunk** is a variable-sized content block derived from files using Content-D
 
 [Detailed chunking description](./chunking)
 
-### Xorbs (Extended Object Blocks)
+### Xorbs
 
-**Xorbs** are containers that aggregate multiple chunks for efficient storage and transfer:
+**Xorbs** are objects that aggregate multiple chunks for efficient storage and transfer:
 
 - **Maximum size**: 64MB
 - **Maximum chunks**: 8,192 chunks per xorb
diff --git a/docs/hub/xet/file-id.md b/docs/hub/xet/file-id.md
@@ -1,4 +1,4 @@
-# Getting a File ID from the Hugging Face Hub
+# Getting a Xet File ID from the Hugging Face Hub
 
 This section explains the Xet file ID used in the reconstruction API to download a file from the HuggingFace hub using the xet protocol.
 
@@ -28,5 +28,6 @@ If the file is stored on the xet system then a successful response will have a `
 The string value of this header is the Xet file ID and SHOULD be used in the path of the reconstruction API URL.
 This is the string representation of the hash and can be used directly in the file reconstruction API on download.
 
+> [!NOTE]
 > The resolve URL will return a 302 redirect http status code, following the redirect will download the content via the old LFS compatible route rather than through the Xet protocol.
 In order to use the Xet protocol you MUST NOT follow this redirect.
diff --git a/docs/hub/xet/file-reconstruction.md b/docs/hub/xet/file-reconstruction.md
@@ -22,7 +22,7 @@ The file is reconstructed by retrieving those chunk ranges, decoding them to raw
 
 ### Diagram
 
-> A file with 4 terms. Each term is a pointer to chunk range within a xorb.
+A file with 4 terms. Each term is a pointer to chunk range within a xorb.
 
 ```txt
 File Reconstruction
diff --git a/docs/hub/xet/index.md b/docs/hub/xet/index.md
@@ -1,5 +1,6 @@
 # Xet Protocol Specification
 
+> [!NOTE]
 > Version 0.1.0 (1.0.0 on release)  
 > The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119](https://www.ietf.org/rfc/rfc2119.txt) [RFC8174](https://www.ietf.org/rfc/rfc8174.txt)
 when, and only when, they appear in all capitals, as shown here.
@@ -8,15 +9,15 @@ This specification defines the end-to-end Xet protocol for content-addressed dat
 Its goal is interoperability and determinism: independent implementations MUST produce the same hashes, objects, and API behavior so data written by one client can be read by another with integrity and performance.
 Implementors can create their own clients, SDKs, and tools that speak the Xet protocol and interface with the CAS service, as long as they MUST adhere to the requirements defined here.
 
-## Building a client library for xet storage
+## Building a Client Library for Xet Storage
 
 - [Upload Protocol](./upload-protocol): End-to-end top level description of the upload flow.
 - [Download Protocol](./download-protocol): Instructions for the download procedure.
 - [CAS API](./api): HTTP endpoints for reconstruction, global chunk dedupe, xorb upload, and shard upload, including error semantics.
 - [Authentication and Authorization](./auth): How to obtain Xet tokens from the Hugging Face Hub, token scopes, and security considerations.
-- [Hugging Face Hub Files Conversion to Xet File ID's](./file-id): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
+- [Converting Hugging Face Hub Files to Xet File ID's](./file-id): How to obtain a Xet file id from the Hugging Face Hub for a particular file in a model or dataset repository.
 
-## Overall Xet architecture
+## Overall Xet Architecture
 
 - [Content-Defined Chunking](./chunking): Gearhash-based CDC with parameters, boundary rules, and performance optimizations.
 - [Hashing Methods](./hashing): Descriptions and definitions of the different hashing functions used for chunks, xorbs and term verification entries.
@@ -25,7 +26,7 @@ Implementors can create their own clients, SDKs, and tools that speak the Xet pr
 - [Shard Format](./shard): Binary shard structure (header, file info, CAS info, footer), offsets, HMAC key usage, and bookends.
 - [Deduplication](./deduplication): Explanation of chunk level dedupe including global system-wide chunk level dedupe.
 
-## Reference implementation
+## Reference Implementation
 
 ### xet-core: hf-xet + git-xet
 
@@ -41,7 +42,7 @@ The primary reference implementation of the protocol written in rust 🦀 lives
 - [hf_xet](https://github.com/huggingface/xet-core/tree/main/hf_xet) - Python bindings to use the Xet protocol for uploads and downloads with the Hugging Face Hub.
 - [git-xet](ttps://github.com/huggingface/xet-core/tree/main/git-xet) - git lfs custom transfer agent that uploads files using the xet protocol to the Hugging Face Hub.
 
-### Huggingface.js
+### huggingface.js
 
 There is also a second reference implementation in Huggingface.js that can be used when downloading or uploading files with the `@huggingface/hub` library.
 
diff --git a/docs/hub/xet/shard.md b/docs/hub/xet/shard.md
@@ -132,6 +132,7 @@ struct MDBShardFileHeader {
 4. Verify version equals 2
 5. Read 8 bytes for footer_size (u64)
 
+> [!NOTE]
 > When serializing, footer_size MUST be the number of bytes that make up the footer, or 0 if the footer is omitted.
 
 ## 2. File Info Section
@@ -241,6 +242,7 @@ struct FileDataSequenceEntry {
 }
 ```
 
+> [!NOTE]
 > Note that when describing a chunk range in a `FileDataSequenceEntry` use ranges that are start-inclusive but end-exclusive i.e. `[chunk_index_start, chunk_index_end)`
 
 **Memory Layout**:
@@ -427,6 +429,7 @@ Since the cas info section immediately follows the file info section bookend, a
 
 ## 4. Footer (MDBShardFileFooter)
 
+> [!NOTE]
 > MUST NOT include the footer when serializing the shard as the body for the shard upload API.
 
 **Location**: End of file minus footer_size
@@ -448,6 +451,7 @@ struct MDBShardFileFooter {
 
 **Memory Layout**:
 
+> [!NOTE]
 > Fields are not exactly to scale
 
 ```txt
diff --git a/docs/hub/xet/upload-protocol.md b/docs/hub/xet/upload-protocol.md