Skip to content

Commit f089514

Browse files
authored
move spec to docs (#515)
publish to hub docs out of xet-core for xet-spec. Need to merge this first before iterating to get the github workflows working right.
1 parent 4176674 commit f089514

17 files changed

+243
-142
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
name: Build Documentation
2+
3+
on:
4+
workflow_dispatch:
5+
push:
6+
branches:
7+
- main
8+
- doc-builder*
9+
- v*-release
10+
11+
jobs:
12+
build:
13+
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
14+
with:
15+
commit_sha: ${{ github.sha }}
16+
package: xet-core
17+
package_name: xet-spec
18+
additional_args: --not_python_module
19+
secrets:
20+
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: Build PR Documentation
2+
3+
on:
4+
pull_request:
5+
paths:
6+
- "docs/**"
7+
8+
concurrency:
9+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
10+
cancel-in-progress: true
11+
12+
jobs:
13+
debug-workflow-name:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- name: Echo workflow metadata
17+
run: |
18+
echo "github.workflow = ${{ github.workflow }}"
19+
echo "github.workflow_ref = ${{ github.workflow_ref }}"
20+
echo "github.workflow_sha = ${{ github.workflow_sha }}"
21+
echo "github.run_id = ${{ github.run_id }}"
22+
echo "github.event_name = ${{ github.event_name }}"
23+
echo "github.event.number = ${{ github.event.number }}"
24+
build:
25+
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
26+
with:
27+
commit_sha: ${{ github.event.pull_request.head.sha }}
28+
pr_number: ${{ github.event.number }}
29+
package: xet-core
30+
package_name: xet-spec
31+
additional_args: --not_python_module

.github/workflows/ci.yml

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -105,14 +105,3 @@ jobs:
105105
working-directory: hf_xet_wasm
106106
run: |
107107
./build_wasm.sh
108-
109-
lint_markdown:
110-
runs-on: ubuntu-latest
111-
steps:
112-
- name: Checkout repository
113-
uses: actions/checkout@v4
114-
- name: Install markdownlint-cli
115-
run: |
116-
npm install -g markdownlint-cli
117-
- name: Lint markdown
118-
run: markdownlint spec/**/*.md --config markdownlint.toml
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
name: Upload PR Documentation
2+
3+
on:
4+
workflow_run:
5+
workflows: ["Build PR Documentation"]
6+
types: [completed]
7+
8+
jobs:
9+
build:
10+
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
11+
with:
12+
package_name: xet-spec
13+
secrets:
14+
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
15+
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

docs/source/_toctree.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
- local: index
2+
title: Xet Protocol Specification
3+
4+
- title: Building a client library for xet storage
5+
sections:
6+
- local: upload-protocol
7+
title: Upload Protocol
8+
- local: download-protocol
9+
title: Download Protocol
10+
- local: api
11+
title: CAS API
12+
- local: auth
13+
title: Authentication and Authorization
14+
- local: file-id
15+
title: Hugging Face Hub Files Conversion to Xet File ID's
16+
17+
- title: Overall Xet architecture
18+
sections:
19+
- local: chunking
20+
title: Content-Defined Chunking
21+
- local: hashing
22+
title: Hashing Methods
23+
- local: file-reconstruction
24+
title: File Reconstruction
25+
- local: xorb
26+
title: Xorb Format
27+
- local: shard
28+
title: Shard Format
29+
- local: deduplication
30+
title: Deduplication

spec/api.md renamed to docs/source/api.md

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# CAS API Documentation
22

3-
This document describes the HTTP API endpoints used by the CAS (Content Addressable Storage) client to interact with the remote CAS server.
3+
This document describes the HTTP API endpoints used by the Content Addressable Storage (CAS) client to interact with the remote CAS server.
44

55
## Authentication
66

7-
To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](../spec/auth.md).
7+
To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](./auth).
88

99
## Converting Hashes to Strings
1010

@@ -18,6 +18,9 @@ For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the orde
1818

1919
Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters).
2020

21+
> [!NOTE]
22+
> In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure.
23+
2124
### Example
2225

2326
Suppose a hash value is:
@@ -38,7 +41,7 @@ It is: `07060504030201000f0e0d0c0b0a0908171615141312111f1e1d1c1b1a1918`.
3841
- **Method**: `GET`
3942
- **Parameters**:
4043
- `file_id`: File hash in hex format (64 lowercase hexadecimal characters).
41-
See [file hashes](../spec/hashing.md#file-hashes) for computing the file hash and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
44+
See [file hashes](./hashing#file-hashes) for computing the file hash and [converting hashes to strings](./api#converting-hashes-to-strings).
4245
- **Headers**:
4346
- `Range`: OPTIONAL. Format: `bytes={start}-{end}` (end is inclusive).
4447
- **Minimum Token Scope**: `read`
@@ -53,7 +56,7 @@ See [file hashes](../spec/hashing.md#file-hashes) for computing the file hash an
5356
}
5457
```
5558

56-
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
59+
- **Error Responses**: See [Error Cases](./api#error-cases)
5760
- `400 Bad Request`: Malformed `file_id` in the path. Fix the path before retrying.
5861
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
5962
- `404 Not Found`: The file does not exist. Not retryable.
@@ -67,7 +70,7 @@ OPTIONAL: -H Range: "bytes=0-100000"
6770

6871
### Example File Reconstruction Response Body
6972

70-
See [QueryReconstructionResponse](../spec/download_protocol.md#queryreconstructionresponse-structure) for more details in the download protocol specification.
73+
See [QueryReconstructionResponse](./download-protocol#queryreconstructionresponse-structure) for more details in the download protocol specification.
7174

7275
### 2. Query Chunk Deduplication (Global Deduplication)
7376

@@ -77,11 +80,11 @@ See [QueryReconstructionResponse](../spec/download_protocol.md#queryreconstructi
7780
- **Parameters**:
7881
- `prefix`: The only acceptable prefix for the Global Deduplication API is `default-merkledb`.
7982
- `hash`: Chunk hash in hex format (64 lowercase hexadecimal characters).
80-
See [Chunk Hashes](../spec/hashing.md#chunk-hashes) to compute the chunk hash and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
83+
See [Chunk Hashes](./hashing#chunk-hashes) to compute the chunk hash and [converting hashes to strings](./api#converting-hashes-to-strings).
8184
- **Minimum Token Scope**: `read`
8285
- **Body**: None.
83-
- **Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](../spec/shard.md#global-deduplication).
84-
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
86+
- **Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](./shard#global-deduplication).
87+
- **Error Responses**: See [Error Cases](./api#error-cases)
8588
- `400 Bad Request`: Malformed hash in the path. Fix the path before retrying.
8689
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
8790
- `404 Not Found`: Chunk not already tracked by global deduplication. Not retryable.
@@ -103,10 +106,10 @@ An example shard response body can be found in [Xet reference files](https://hug
103106
- **Parameters**:
104107
- `prefix`: The only acceptable prefix for the Xorb upload API is `default`.
105108
- `hash`: Xorb hash in hex format (64 lowercase hexadecimal characters).
106-
See [Xorb Hashes](../spec/hashing.md#xorb-hashes) to compute the hash, and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
109+
See [Xorb Hashes](./hashing#xorb-hashes) to compute the hash, and [converting hashes to strings](./api#converting-hashes-to-strings).
107110
- **Minimum Token Scope**: `write`
108111
- **Body**: Serialized Xorb bytes (`application/octet-stream`).
109-
See [xorb format serialization](../spec/xorb.md).
112+
See [xorb format serialization](./xorb).
110113
- **Response**: JSON (`UploadXorbResponse`)
111114

112115
```json
@@ -117,7 +120,7 @@ See [xorb format serialization](../spec/xorb.md).
117120

118121
- Note: `was_inserted` is `false` if the Xorb already exists; this is not an error.
119122

120-
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
123+
- **Error Responses**: See [Error Cases](./api#error-cases)
121124
- `400 Bad Request`: Malformed hash in the path, Xorb hash does not match the body, or body is incorrectly serialized.
122125
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
123126
- `403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided). Clients MUST retry with a `write` scope token.
@@ -139,7 +142,7 @@ Uploads file reconstructions and new xorb listing, serialized into the shard for
139142
- **Method**: `POST`
140143
- **Minimum Token Scope**: `write`
141144
- **Body**: Serialized Shard data as bytes (`application/octet-stream`).
142-
See [Shard format guide](../spec/shard.md#shard-upload).
145+
See [Shard format guide](./shard#shard-upload).
143146
- **Response**: JSON (`UploadShardResponse`)
144147

145148
```json
@@ -154,7 +157,7 @@ See [Shard format guide](../spec/shard.md#shard-upload).
154157

155158
The value of `result` does not carry any meaning, if the upload shard API returns a `200 OK` status code, the upload was successful and the files listed are considered uploaded.
156159

157-
- **Error Responses**: See [Error Cases](../spec/api.md#error-cases)
160+
- **Error Responses**: See [Error Cases](./api#error-cases)
158161
- `400 Bad Request`: Shard is incorrectly serialized or Shard contents failed verification.
159162
- Can mean that a referenced Xorb doesn't exist or the shard is too large
160163
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.

spec/auth.md renamed to docs/source/auth.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Authentication and Authorization
22

3-
To invoke any API's mentioned in this specification a client MUST first acquire a token (and the url) to authenticate against the server which serves these API's.
3+
To invoke any API's mentioned in this specification a client MUST first acquire a token (and the URL) to authenticate against the server which serves these API's.
44

55
The Xet protocol server uses bearer authentication via a token generated by the Hugging Face Hub (<https://huggingface.co>).
66

@@ -16,14 +16,14 @@ https://huggingface.co/api/{repo_type}s/{repo_id}/xet-{token_type}-token/{revisi
1616

1717
**Parameters:**
1818

19-
All parameters are required to form the url.
19+
All parameters are required to form the URL.
2020

2121
- `repo_type`: Type of repository - `model`, `dataset`, or `space`
2222
- `repo_id`: Repository identifier in format `namespace/repo-name`
2323
- `token_type`: Either `read` or `write`.
2424
- `revision`: Git revision (branch, tag, or commit hash; default to using `main` if no specific ref is required)
2525

26-
To understand the distinction for between `token_type` values read onwards in this document to [Token Scope](../spec/auth.md#token-scope).
26+
To understand the distinction for between `token_type` values read onwards in this document to [Token Scope](./auth#token-scope).
2727

2828
**Example URLs:**
2929

@@ -101,6 +101,7 @@ Here's a basic implementation flow:
101101
4. **Token refresh (when needed):**
102102
Use the same API to generate a new token.
103103

104+
> [!NOTE]
104105
> In `xet-core` we SHOULD add 30 seconds of buffer time before the provided `expiration` time to refresh the token.
105106
106107
## Token Scope
@@ -109,7 +110,7 @@ Xet tokens can have either a `read` or a `write` scope.
109110
`write` scope supersedes `read` scope and all `read` scope API's can be invoked when using a `write` scope token.
110111
The type of token issued is determined on the `token_type` URI path component when requesting the token from the Hugging Face Hub (see above).
111112

112-
Revise API specification for what scope level is necessary to invoke each API (briefly, only `POST /shard` and `POST /xorb/*` API's require `write` scope).
113+
Check API specification for what scope level is necessary to invoke each API (briefly, only `POST /shard` and `POST /xorb/*` API's require `write` scope).
113114

114115
The scope of the Xet tokens is limited to the repository and ref for which they were issued. To upload or download from different repositories or refs (different branches) clients MUST be issued different tokens.
115116

spec/chunking.md renamed to docs/source/chunking.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
99
+---------+---------+---------+---------+---------+---------+---------+--------------
1010
```
1111

12-
## Step-by-step algorithm (Gearhash-based CDC)
12+
## Step-by-step Algorithm (Gearhash-based CDC)
1313

1414
### Constant Parameters
1515

@@ -24,15 +24,15 @@ File -> | chunk 0 | chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 | chunk 6 |
2424
- h: 64-bit hash, initialized to 0
2525
- start_offset: start offset of the current chunk, initialized to 0
2626

27-
### Per-byte update rule (Gearhash)
27+
### Per-byte Update Rule (Gearhash)
2828

2929
For each input byte `b`, update the hash with 64-bit wrapping arithmetic:
3030

3131
```text
3232
h = (h << 1) + TABLE[b]
3333
```
3434

35-
### Boundary test and size constraints
35+
### Boundary Test and Size Constraints
3636

3737
At each position after updating `h`, let `size = current_offset - start_offset + 1`.
3838

@@ -81,39 +81,39 @@ if start_offset < len(data):
8181

8282
### Boundary probability and mask selection
8383

84-
Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.
84+
Given that MASK has 16 one-bits, for a random 64-bit hash `h`, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.
8585

8686
### Properties
8787

8888
- Deterministic boundaries: same content → same chunks
8989
- Locality: small edits only affect nearby boundaries
9090
- Linear time and constant memory: single 64-bit state and counters
9191

92-
### Intuition and rationale
92+
### Intuition and Rationale
9393

9494
- The table `TABLE[256]` injects pseudo-randomness per byte value so that the evolving hash `h` behaves like a random 64-bit value with respect to the mask test. This makes boundaries content-defined yet statistically evenly spaced.
9595
- The left shift `(h << 1)` amplifies recent bytes, helping small changes affect nearby positions without globally shifting all boundaries.
9696
- Resetting `h` to 0 at each boundary prevents long-range carryover and keeps boundary decisions for each chunk statistically independent.
9797

98-
### Implementation notes
98+
### Implementation Notes
9999

100100
- Only reset `h` when you emit a boundary. This ensures chunking is stable even when streaming input in pieces.
101101
- Apply the mask test only once `size >= MIN_CHUNK_SIZE`. This reduces the frequency of tiny chunks and stabilizes average chunk sizes.
102102
- MUST force a boundary at `MAX_CHUNK_SIZE` even if `(h & MASK) != 0`. This guarantees bounded chunk sizes and prevents pathological long chunks when matches are rare.
103103
- Use 64-bit wrapping arithmetic for `(h << 1) + TABLE[b]`. This is the behavior in the reference implementation [rust-gearhash].
104104

105-
### Edge cases
105+
### Edge Cases
106106

107107
- Tiny files: if `len(data) < MIN_CHUNK_SIZE`, the entire `data` is emitted as a single chunk.
108108
- Long runs without a match: if no position matches `(h & MASK) == 0` before `MAX_CHUNK_SIZE`, a boundary is forced at `MAX_CHUNK_SIZE` to cap chunk size.
109109

110-
### Portability and determinism
110+
### Portability and Determinism
111111

112112
- With a fixed `T[256]` table and mask, the algorithm is deterministic across platforms: same input → same chunk boundaries.
113113
- Endianness does not affect behavior because updates are byte-wise and use scalar 64-bit operations.
114114
- SIMD-accelerated implementations (when available) are optimizations only; they produce the same boundaries as the scalar path [rust-gearhash].
115115

116-
## Minimum-size skip-ahead (cut-point skipping optimization)
116+
## Minimum-size Skip-ahead (Cut-point Skipping Optimization)
117117

118118
Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the `MIN_CHUNK_SIZE` constraint anyway.
119119
We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.
@@ -141,7 +141,7 @@ The [xet-team/xet-spec-reference-files](https://huggingface.co/datasets/xet-team
141141

142142
In the same repository in file [Electric_Vehicle_Population_Data_20250917.csv.chunks](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.chunks)
143143
the chunks produced out of [Electric_Vehicle_Population_Data_20250917.csv](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv) are listed.
144-
Each line in the file is a 64 hexadecimal hash of the chunk, followed by a space and then the number of bytes in that chunk.
144+
Each line in the file is a 64 hexadecimal character string version of the hash of the chunk, followed by a space and then the number of bytes in that chunk.
145145

146146
Implementors should use the chunk lengths to determine that they are producing the right chunk boundaries for this file with their chunking implementation.
147147

spec/deduplication.md renamed to docs/source/deduplication.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@ A **chunk** is a variable-sized content block derived from files using Content-D
2323
- **Size range**: 8KB to 128KB (minimum and maximum constraints)
2424
- **Identification**: Each chunk is uniquely identified by its cryptographic hash (MerkleHash)
2525

26-
[Detailed chunking description](../spec/chunking.md)
26+
[Detailed chunking description](./chunking)
2727

28-
### Xorbs (Extended Object Blocks)
28+
### Xorbs
2929

30-
**Xorbs** are containers that aggregate multiple chunks for efficient storage and transfer:
30+
**Xorbs** are objects that aggregate multiple chunks for efficient storage and transfer:
3131

3232
- **Maximum size**: 64MB
3333
- **Maximum chunks**: 8,192 chunks per xorb
@@ -96,7 +96,7 @@ Xet employs a three-tiered deduplication strategy to maximize efficiency while m
9696

9797
#### Level 3: Global Deduplication API
9898

99-
**Scope**: Entire Xet ecosystem
99+
**Scope**: Entire Xet system
100100
**Mechanism**: Global deduplication service with HMAC protection
101101
**Purpose**: Discover deduplication opportunities across all users and repositories
102102

@@ -143,11 +143,11 @@ They MAY know this chunk hash because they own this data, the match has made the
143143
### Chunk Hash Computation
144144

145145
Each chunk has its content hashed using a cryptographic hash function (Blake3-based MerkleHash) to create a unique identifier for content addressing.
146-
[See section about hashing](../spec/hashing.md#chunk-hashes).
146+
[See section about hashing](./hashing#chunk-hashes).
147147

148148
### Xorb Formation
149149

150-
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](../xorb.md)
150+
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](./xorb)
151151

152152
### File Reconstruction Information
153153

@@ -164,7 +164,7 @@ This information allows the system to reconstruct files by:
164164
2. Extracting the specific chunk ranges from each xorb
165165
3. Concatenating chunks in the correct order
166166

167-
[See section about file reconstruction](../file_reconstruction.md).
167+
[See section about file reconstruction](./file-reconstruction).
168168

169169
## Fragmentation Prevention
170170

0 commit comments

Comments
 (0)