You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/api.md
+16-13Lines changed: 16 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
# CAS API Documentation
2
2
3
-
This document describes the HTTP API endpoints used by the CAS (Content Addressable Storage) client to interact with the remote CAS server.
3
+
This document describes the HTTP API endpoints used by the Content Addressable Storage (CAS) client to interact with the remote CAS server.
4
4
5
5
## Authentication
6
6
7
-
To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](../spec/auth.md).
7
+
To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](./auth).
8
8
9
9
## Converting Hashes to Strings
10
10
@@ -18,6 +18,9 @@ For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the orde
18
18
19
19
Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters).
20
20
21
+
> [!NOTE]
22
+
> In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure.
23
+
21
24
### Example
22
25
23
26
Suppose a hash value is:
@@ -38,7 +41,7 @@ It is: `07060504030201000f0e0d0c0b0a0908171615141312111f1e1d1c1b1a1918`.
38
41
-**Method**: `GET`
39
42
-**Parameters**:
40
43
-`file_id`: File hash in hex format (64 lowercase hexadecimal characters).
41
-
See [file hashes](../spec/hashing.md#file-hashes) for computing the file hash and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
44
+
See [file hashes](./hashing#file-hashes) for computing the file hash and [converting hashes to strings](./api#converting-hashes-to-strings).
42
45
-**Headers**:
43
46
-`Range`: OPTIONAL. Format: `bytes={start}-{end}` (end is inclusive).
44
47
-**Minimum Token Scope**: `read`
@@ -53,7 +56,7 @@ See [file hashes](../spec/hashing.md#file-hashes) for computing the file hash an
53
56
}
54
57
```
55
58
56
-
-**Error Responses**: See [Error Cases](../spec/api.md#error-cases)
59
+
-**Error Responses**: See [Error Cases](./api#error-cases)
57
60
-`400 Bad Request`: Malformed `file_id` in the path. Fix the path before retrying.
58
61
-`401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
59
62
-`404 Not Found`: The file does not exist. Not retryable.
See [QueryReconstructionResponse](../spec/download_protocol.md#queryreconstructionresponse-structure) for more details in the download protocol specification.
73
+
See [QueryReconstructionResponse](./download-protocol#queryreconstructionresponse-structure) for more details in the download protocol specification.
@@ -77,11 +80,11 @@ See [QueryReconstructionResponse](../spec/download_protocol.md#queryreconstructi
77
80
-**Parameters**:
78
81
-`prefix`: The only acceptable prefix for the Global Deduplication API is `default-merkledb`.
79
82
-`hash`: Chunk hash in hex format (64 lowercase hexadecimal characters).
80
-
See [Chunk Hashes](../spec/hashing.md#chunk-hashes) to compute the chunk hash and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
83
+
See [Chunk Hashes](./hashing#chunk-hashes) to compute the chunk hash and [converting hashes to strings](./api#converting-hashes-to-strings).
81
84
-**Minimum Token Scope**: `read`
82
85
-**Body**: None.
83
-
-**Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](../spec/shard.md#global-deduplication).
84
-
-**Error Responses**: See [Error Cases](../spec/api.md#error-cases)
86
+
-**Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](./shard#global-deduplication).
87
+
-**Error Responses**: See [Error Cases](./api#error-cases)
85
88
-`400 Bad Request`: Malformed hash in the path. Fix the path before retrying.
86
89
-`401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
87
90
-`404 Not Found`: Chunk not already tracked by global deduplication. Not retryable.
@@ -103,10 +106,10 @@ An example shard response body can be found in [Xet reference files](https://hug
103
106
-**Parameters**:
104
107
-`prefix`: The only acceptable prefix for the Xorb upload API is `default`.
105
108
-`hash`: Xorb hash in hex format (64 lowercase hexadecimal characters).
106
-
See [Xorb Hashes](../spec/hashing.md#xorb-hashes) to compute the hash, and [converting hashes to strings](../spec/api.md#converting-hashes-to-strings).
109
+
See [Xorb Hashes](./hashing#xorb-hashes) to compute the hash, and [converting hashes to strings](./api#converting-hashes-to-strings).
@@ -117,7 +120,7 @@ See [xorb format serialization](../spec/xorb.md).
117
120
118
121
- Note: `was_inserted` is `false` if the Xorb already exists; this is not an error.
119
122
120
-
-**Error Responses**: See [Error Cases](../spec/api.md#error-cases)
123
+
-**Error Responses**: See [Error Cases](./api#error-cases)
121
124
-`400 Bad Request`: Malformed hash in the path, Xorb hash does not match the body, or body is incorrectly serialized.
122
125
-`401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
123
126
-`403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided). Clients MUST retry with a `write` scope token.
@@ -139,7 +142,7 @@ Uploads file reconstructions and new xorb listing, serialized into the shard for
139
142
-**Method**: `POST`
140
143
-**Minimum Token Scope**: `write`
141
144
-**Body**: Serialized Shard data as bytes (`application/octet-stream`).
142
-
See [Shard format guide](../spec/shard.md#shard-upload).
145
+
See [Shard format guide](./shard#shard-upload).
143
146
-**Response**: JSON (`UploadShardResponse`)
144
147
145
148
```json
@@ -154,7 +157,7 @@ See [Shard format guide](../spec/shard.md#shard-upload).
154
157
155
158
The value of `result` does not carry any meaning, if the upload shard API returns a `200 OK` status code, the upload was successful and the files listed are considered uploaded.
156
159
157
-
-**Error Responses**: See [Error Cases](../spec/api.md#error-cases)
160
+
-**Error Responses**: See [Error Cases](./api#error-cases)
158
161
-`400 Bad Request`: Shard is incorrectly serialized or Shard contents failed verification.
159
162
- Can mean that a referenced Xorb doesn't exist or the shard is too large
160
163
-`401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
Copy file name to clipboardExpand all lines: docs/source/auth.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Authentication and Authorization
2
2
3
-
To invoke any API's mentioned in this specification a client MUST first acquire a token (and the url) to authenticate against the server which serves these API's.
3
+
To invoke any API's mentioned in this specification a client MUST first acquire a token (and the URL) to authenticate against the server which serves these API's.
4
4
5
5
The Xet protocol server uses bearer authentication via a token generated by the Hugging Face Hub (<https://huggingface.co>).
-`repo_type`: Type of repository - `model`, `dataset`, or `space`
22
22
-`repo_id`: Repository identifier in format `namespace/repo-name`
23
23
-`token_type`: Either `read` or `write`.
24
24
-`revision`: Git revision (branch, tag, or commit hash; default to using `main` if no specific ref is required)
25
25
26
-
To understand the distinction for between `token_type` values read onwards in this document to [Token Scope](../spec/auth.md#token-scope).
26
+
To understand the distinction for between `token_type` values read onwards in this document to [Token Scope](./auth#token-scope).
27
27
28
28
**Example URLs:**
29
29
@@ -101,6 +101,7 @@ Here's a basic implementation flow:
101
101
4.**Token refresh (when needed):**
102
102
Use the same API to generate a new token.
103
103
104
+
> [!NOTE]
104
105
> In `xet-core` we SHOULD add 30 seconds of buffer time before the provided `expiration` time to refresh the token.
105
106
106
107
## Token Scope
@@ -109,7 +110,7 @@ Xet tokens can have either a `read` or a `write` scope.
109
110
`write` scope supersedes `read` scope and all `read` scope API's can be invoked when using a `write` scope token.
110
111
The type of token issued is determined on the `token_type` URI path component when requesting the token from the Hugging Face Hub (see above).
111
112
112
-
Revise API specification for what scope level is necessary to invoke each API (briefly, only `POST /shard` and `POST /xorb/*` API's require `write` scope).
113
+
Check API specification for what scope level is necessary to invoke each API (briefly, only `POST /shard` and `POST /xorb/*` API's require `write` scope).
113
114
114
115
The scope of the Xet tokens is limited to the repository and ref for which they were issued. To upload or download from different repositories or refs (different branches) clients MUST be issued different tokens.
- start_offset: start offset of the current chunk, initialized to 0
26
26
27
-
### Per-byte update rule (Gearhash)
27
+
### Per-byte Update Rule (Gearhash)
28
28
29
29
For each input byte `b`, update the hash with 64-bit wrapping arithmetic:
30
30
31
31
```text
32
32
h = (h << 1) + TABLE[b]
33
33
```
34
34
35
-
### Boundary test and size constraints
35
+
### Boundary Test and Size Constraints
36
36
37
37
At each position after updating `h`, let `size = current_offset - start_offset + 1`.
38
38
@@ -81,39 +81,39 @@ if start_offset < len(data):
81
81
82
82
### Boundary probability and mask selection
83
83
84
-
Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.
84
+
Given that MASK has 16 one-bits, for a random 64-bit hash `h`, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.
85
85
86
86
### Properties
87
87
88
88
- Deterministic boundaries: same content → same chunks
89
89
- Locality: small edits only affect nearby boundaries
90
90
- Linear time and constant memory: single 64-bit state and counters
91
91
92
-
### Intuition and rationale
92
+
### Intuition and Rationale
93
93
94
94
- The table `TABLE[256]` injects pseudo-randomness per byte value so that the evolving hash `h` behaves like a random 64-bit value with respect to the mask test. This makes boundaries content-defined yet statistically evenly spaced.
95
95
- The left shift `(h << 1)` amplifies recent bytes, helping small changes affect nearby positions without globally shifting all boundaries.
96
96
- Resetting `h` to 0 at each boundary prevents long-range carryover and keeps boundary decisions for each chunk statistically independent.
97
97
98
-
### Implementation notes
98
+
### Implementation Notes
99
99
100
100
- Only reset `h` when you emit a boundary. This ensures chunking is stable even when streaming input in pieces.
101
101
- Apply the mask test only once `size >= MIN_CHUNK_SIZE`. This reduces the frequency of tiny chunks and stabilizes average chunk sizes.
102
102
- MUST force a boundary at `MAX_CHUNK_SIZE` even if `(h & MASK) != 0`. This guarantees bounded chunk sizes and prevents pathological long chunks when matches are rare.
103
103
- Use 64-bit wrapping arithmetic for `(h << 1) + TABLE[b]`. This is the behavior in the reference implementation [rust-gearhash].
104
104
105
-
### Edge cases
105
+
### Edge Cases
106
106
107
107
- Tiny files: if `len(data) < MIN_CHUNK_SIZE`, the entire `data` is emitted as a single chunk.
108
108
- Long runs without a match: if no position matches `(h & MASK) == 0` before `MAX_CHUNK_SIZE`, a boundary is forced at `MAX_CHUNK_SIZE` to cap chunk size.
109
109
110
-
### Portability and determinism
110
+
### Portability and Determinism
111
111
112
112
- With a fixed `T[256]` table and mask, the algorithm is deterministic across platforms: same input → same chunk boundaries.
113
113
- Endianness does not affect behavior because updates are byte-wise and use scalar 64-bit operations.
114
114
- SIMD-accelerated implementations (when available) are optimizations only; they produce the same boundaries as the scalar path [rust-gearhash].
Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the `MIN_CHUNK_SIZE` constraint anyway.
119
119
We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.
@@ -141,7 +141,7 @@ The [xet-team/xet-spec-reference-files](https://huggingface.co/datasets/xet-team
141
141
142
142
In the same repository in file [Electric_Vehicle_Population_Data_20250917.csv.chunks](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.chunks)
143
143
the chunks produced out of [Electric_Vehicle_Population_Data_20250917.csv](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv) are listed.
144
-
Each line in the file is a 64 hexadecimal hash of the chunk, followed by a space and then the number of bytes in that chunk.
144
+
Each line in the file is a 64 hexadecimal character string version of the hash of the chunk, followed by a space and then the number of bytes in that chunk.
145
145
146
146
Implementors should use the chunk lengths to determine that they are producing the right chunk boundaries for this file with their chunking implementation.
**Xorbs** are containers that aggregate multiple chunks for efficient storage and transfer:
30
+
**Xorbs** are objects that aggregate multiple chunks for efficient storage and transfer:
31
31
32
32
-**Maximum size**: 64MB
33
33
-**Maximum chunks**: 8,192 chunks per xorb
@@ -96,7 +96,7 @@ Xet employs a three-tiered deduplication strategy to maximize efficiency while m
96
96
97
97
#### Level 3: Global Deduplication API
98
98
99
-
**Scope**: Entire Xet ecosystem
99
+
**Scope**: Entire Xet system
100
100
**Mechanism**: Global deduplication service with HMAC protection
101
101
**Purpose**: Discover deduplication opportunities across all users and repositories
102
102
@@ -143,11 +143,11 @@ They MAY know this chunk hash because they own this data, the match has made the
143
143
### Chunk Hash Computation
144
144
145
145
Each chunk has its content hashed using a cryptographic hash function (Blake3-based MerkleHash) to create a unique identifier for content addressing.
146
-
[See section about hashing](../spec/hashing.md#chunk-hashes).
146
+
[See section about hashing](./hashing#chunk-hashes).
147
147
148
148
### Xorb Formation
149
149
150
-
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](../xorb.md)
150
+
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](./xorb)
151
151
152
152
### File Reconstruction Information
153
153
@@ -164,7 +164,7 @@ This information allows the system to reconstruct files by:
164
164
2. Extracting the specific chunk ranges from each xorb
165
165
3. Concatenating chunks in the correct order
166
166
167
-
[See section about file reconstruction](../file_reconstruction.md).
167
+
[See section about file reconstruction](./file-reconstruction).
0 commit comments