You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See [QueryReconstructionResponse](./download-protocol.md#queryreconstructionresponse-structure) for more details in the download protocol specification.
70
+
See [QueryReconstructionResponse](./download-protocol#queryreconstructionresponse-structure) for more details in the download protocol specification.
@@ -117,7 +117,7 @@ See [xorb format serialization](./xorb.md).
117
117
118
118
- Note: `was_inserted` is `false` if the Xorb already exists; this is not an error.
119
119
120
-
-**Error Responses**: See [Error Cases](./api.md#error-cases)
120
+
-**Error Responses**: See [Error Cases](./api#error-cases)
121
121
-`400 Bad Request`: Malformed hash in the path, Xorb hash does not match the body, or body is incorrectly serialized.
122
122
-`401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
123
123
-`403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided). Clients MUST retry with a `write` scope token.
@@ -139,7 +139,7 @@ Uploads file reconstructions and new xorb listing, serialized into the shard for
139
139
-**Method**: `POST`
140
140
-**Minimum Token Scope**: `write`
141
141
-**Body**: Serialized Shard data as bytes (`application/octet-stream`).
142
-
See [Shard format guide](./shard.md#shard-upload).
142
+
See [Shard format guide](./shard#shard-upload).
143
143
-**Response**: JSON (`UploadShardResponse`)
144
144
145
145
```json
@@ -154,7 +154,7 @@ See [Shard format guide](./shard.md#shard-upload).
154
154
155
155
The value of `result` does not carry any meaning, if the upload shard API returns a `200 OK` status code, the upload was successful and the files listed are considered uploaded.
156
156
157
-
-**Error Responses**: See [Error Cases](./api.md#error-cases)
157
+
-**Error Responses**: See [Error Cases](./api#error-cases)
158
158
-`400 Bad Request`: Shard is incorrectly serialized or Shard contents failed verification.
159
159
- Can mean that a referenced Xorb doesn't exist or the shard is too large
160
160
-`401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
Copy file name to clipboardExpand all lines: docs/hub/xet/chunking.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -141,7 +141,7 @@ The [xet-team/xet-spec-reference-files](https://huggingface.co/datasets/xet-team
141
141
142
142
In the same repository in file [Electric_Vehicle_Population_Data_20250917.csv.chunks](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.chunks)
143
143
the chunks produced out of [Electric_Vehicle_Population_Data_20250917.csv](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv) are listed.
144
-
Each line in the file is a 64 hexadecimal hash of the chunk, followed by a space and then the number of bytes in that chunk.
144
+
Each line in the file is a 64 hexadecimal character string version of the hash of the chunk, followed by a space and then the number of bytes in that chunk.
145
145
146
146
Implementors should use the chunk lengths to determine that they are producing the right chunk boundaries for this file with their chunking implementation.
Copy file name to clipboardExpand all lines: docs/hub/xet/deduplication.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ A **chunk** is a variable-sized content block derived from files using Content-D
23
23
-**Size range**: 8KB to 128KB (minimum and maximum constraints)
24
24
-**Identification**: Each chunk is uniquely identified by its cryptographic hash (MerkleHash)
25
25
26
-
[Detailed chunking description](./chunking.md)
26
+
[Detailed chunking description](./chunking)
27
27
28
28
### Xorbs (Extended Object Blocks)
29
29
@@ -143,11 +143,11 @@ They MAY know this chunk hash because they own this data, the match has made the
143
143
### Chunk Hash Computation
144
144
145
145
Each chunk has its content hashed using a cryptographic hash function (Blake3-based MerkleHash) to create a unique identifier for content addressing.
146
-
[See section about hashing](./hashing.md#chunk-hashes).
146
+
[See section about hashing](./hashing#chunk-hashes).
147
147
148
148
### Xorb Formation
149
149
150
-
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](../xorb.md)
150
+
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. [See section about xorb formation](../xorb)
151
151
152
152
### File Reconstruction Information
153
153
@@ -164,7 +164,7 @@ This information allows the system to reconstruct files by:
164
164
2. Extracting the specific chunk ranges from each xorb
165
165
3. Concatenating chunks in the correct order
166
166
167
-
[See section about file reconstruction](./file-reconstruction.md).
167
+
[See section about file reconstruction](./file-reconstruction).
Copy file name to clipboardExpand all lines: docs/hub/xet/download-protocol.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,9 +13,9 @@ File download in the Xet protocol is a two-stage process:
13
13
14
14
### Single File Reconstruction
15
15
16
-
To download a file given a file hash, first call the reconstruction API to get the file reconstruction. Follow the steps in [api.md](./api.md#1-get-file-reconstruction).
16
+
To download a file given a file hash, first call the reconstruction API to get the file reconstruction. Follow the steps in [api](./api#1-get-file-reconstruction).
17
17
18
-
Note that you will need at least a `read` scope auth token, [auth reference](./auth.md).
18
+
Note that you will need at least a `read` scope auth token, [auth reference](./auth).
19
19
20
20
> For large files it is RECOMMENDED to request the reconstruction in batches i.e. the first 10GB, download all the data, then the next 10GB and so on. Clients can use the `Range` header to specify a range of file data.
Copy file name to clipboardExpand all lines: docs/hub/xet/file-reconstruction.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,8 +12,8 @@ This document describes how a file can be represented and reconstructed from a c
12
12
13
13
## Core Idea
14
14
15
-
After following the [chunking procedure](./chunking.md) a file can be represented as an ordering of chunks.
16
-
Those chunks are then packed into [xorbs](./xorb.md) and given the set of xorbs we convert the file representation to "reconstruction" made up of "terms".
15
+
After following the [chunking procedure](./chunking) a file can be represented as an ordering of chunks.
16
+
Those chunks are then packed into [xorbs](./xorb) and given the set of xorbs we convert the file representation to "reconstruction" made up of "terms".
17
17
When forming xorbs the ordering and grouping of chunks prioritizes contiguous runs of chunks that appear in a file such that when referencing a xorb we maximize the term range length.
18
18
19
19
Any file’s raw bytes can be described as the concatenation of data produced by a sequence of terms.
@@ -105,7 +105,7 @@ A file’s reconstruction can be serialized into a shard as part of its file inf
105
105
Conceptually, this section encodes the complete set of terms that describe the file.
106
106
When stored this way, the representation is canonical and sufficient to reconstruct the full file solely from its referenced xorb ranges.
107
107
108
-
Reference: [shard format file info](./shard.md#2-file-info-section)
108
+
Reference: [shard format file info](./shard#2-file-info-section)
109
109
110
110
### Deserialization from the reconstruction API (JSON)
111
111
@@ -114,7 +114,7 @@ This response is represented by a structure named “QueryReconstructionResponse
114
114
The `terms` list contains, for each term, the xorb identifier and the contiguous chunk index range to retrieve.
115
115
Other fields may provide auxiliary details (such as offsets or fetch hints) that optimize retrieval without altering the meaning of the `terms` sequence.
Copy file name to clipboardExpand all lines: docs/hub/xet/hashing.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -137,7 +137,7 @@ Reference files are provided in Hugging Face Dataset repository [xet-team/xet-sp
137
137
In this repository there are a number of different samples implementors can use to verify hash computations.
138
138
139
139
> Note that all hashes are represented as strings.
140
-
To get the raw value of these hashes you must invert the endianness of each byte octet in the hash string, reversing the procedure described in [api.md](./api.md#converting-hashes-to-strings).
140
+
To get the raw value of these hashes you must invert the endianness of each byte octet in the hash string, reversing the procedure described in [api](./api#converting-hashes-to-strings).
0 commit comments