-
Notifications
You must be signed in to change notification settings - Fork 374
xet protocol specification in hub docs #1952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 7 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
c10c984
very initial draft
assafvayner 1d84b67
adding files for CI to get preview
assafvayner 08f5bc5
move to hub subdir
assafvayner 800c0c2
update toc
assafvayner b5b3bec
re-work toc, don't like it off main page
assafvayner ef50638
link from storage-backend and iterate on title formatting
assafvayner e8c859c
fix some local links
assafvayner df6b724
Jared suggestions
assafvayner 4f40ed1
move to xet-specification
assafvayner File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -41,7 +41,7 @@ Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories d | |
|
|
||
| ### Using Xet Storage | ||
|
|
||
| To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users). | ||
| To start using Xet Storage, you need a Xet-enabled repository and a Xet-aware version of the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) Python library. As of May 23rd, 2025, Xet-enabled repositories are the default [for all new users and organizations on the Hub](https://huggingface.co/changelog/xet-default-for-new-users). | ||
|
|
||
| > [!TIP] | ||
| > For user and organization profiles created before May 23rd, 2025, you can make Xet the default for all your repositories by [signing up here](https://huggingface.co/join/xet). You can apply for yourself or your entire organization (requires [admin permissions](https://huggingface.co/docs/hub/organizations-security)). Once approved, all existing repositories will be automatically migrated to Xet and future repositories will be Xet-enabled by default. | ||
|
|
@@ -54,7 +54,7 @@ To access a Xet-aware version of the `huggingface_hub`, simply install the lates | |
| pip install -U huggingface_hub | ||
| ``` | ||
|
|
||
| As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. | ||
| As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. | ||
|
|
||
| If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken. | ||
|
|
||
|
|
@@ -77,7 +77,7 @@ To see more detailed usage docs, refer to the `huggingface_hub` docs for: | |
| Xet integrates seamlessly with the Hub's current Python-based workflows. However, there are a few steps you may consider to get the most benefits from Xet storage: | ||
|
|
||
| - **Use `hf_xet`**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. | ||
| - **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads. | ||
| - **Utilize `hf_xet` environment variables**: The default installation of `hf_xet` is designed to support the broadest range of hardware. To take advantage of setups with more network bandwidth or processing power read up on `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) to further speed up downloads and uploads. | ||
| - **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient. | ||
| - **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. | ||
| - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. | ||
|
|
@@ -120,3 +120,9 @@ The legacy storage system on the Hub, Git LFS utilizes many of the same conventi | |
| The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). | ||
|
|
||
| This leads to a worse developer experience along with a proliferation of additional storage. | ||
|
|
||
| ## Open Source Xet Protocol | ||
|
||
|
|
||
| The Xet storage backend is built on an open source protocol that enables efficient, chunk-based storage and retrieval of large files. This protocol provides the foundation for the deduplication and performance benefits described throughout this documentation. | ||
|
|
||
| For detailed technical specifications about the Xet protocol, including API endpoints, authentication mechanisms, chunking algorithms, and file reconstruction processes, see the [Xet Protocol Specification](./xet/index). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| - local: index | ||
| title: Xet Protocol Specification | ||
|
|
||
| - title: Building a client library for xet storage | ||
| sections: | ||
| - local: upload-protocol | ||
| title: Upload Protocol | ||
| - local: download-protocol | ||
| title: Download Protocol | ||
| - local: api | ||
| title: CAS API | ||
| - local: auth | ||
| title: Authentication and Authorization | ||
| - local: file-id | ||
| title: Hugging Face Hub Files Conversion to Xet File ID's | ||
|
|
||
| - title: Overall Xet architecture | ||
| sections: | ||
| - local: chunking | ||
| title: Content-Defined Chunking | ||
| - local: hashing | ||
| title: Hashing Methods | ||
| - local: file-reconstruction | ||
| title: File Reconstruction | ||
| - local: xorb | ||
| title: Xorb Format | ||
| - local: shard | ||
| title: Shard Format | ||
| - local: deduplication | ||
| title: Deduplication |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| # CAS API Documentation | ||
|
|
||
| This document describes the HTTP API endpoints used by the CAS (Content Addressable Storage) client to interact with the remote CAS server. | ||
assafvayner marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Authentication | ||
|
|
||
| To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](./auth). | ||
|
|
||
| ## Converting Hashes to Strings | ||
|
|
||
| Sometimes hashes are used in API paths as hexadecimal strings (reconstruction, xorb upload, global dedupe API). | ||
|
|
||
| To convert a 32 hash to a 64 hexadecimal character string to be used as part of an API path there is a specific procedure, MUST NOT directly convert each byte. | ||
|
|
||
| ### Procedure | ||
|
|
||
| For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the order of each byte in those regions then concatenate the regions back in order. | ||
|
|
||
| Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters). | ||
|
|
||
| > [!NOTE] | ||
| > In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure. | ||
| ### Example | ||
|
|
||
| Suppose a hash value is: | ||
| `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]` | ||
|
|
||
| Then before converting to a string it will first have its bytes reordered to: | ||
| `[7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 23, 22, 21, 20, 19, 18, 17, 16, 31, 30, 29, 28, 27, 26, 25, 24]` | ||
|
|
||
| So the string value of the the provided hash [0..32] is **NOT** `000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f`. | ||
| It is: `07060504030201000f0e0d0c0b0a0908171615141312111f1e1d1c1b1a1918`. | ||
|
|
||
| ## Endpoints | ||
|
|
||
| ### 1. Get File Reconstruction | ||
|
|
||
| - **Description**: Retrieves reconstruction information for a specific file, includes byte range support when `Range` header is set. | ||
| - **Path**: `/v1/reconstructions/{file_id}` | ||
| - **Method**: `GET` | ||
| - **Parameters**: | ||
| - `file_id`: File hash in hex format (64 lowercase hexadecimal characters). | ||
| See [file hashes](./hashing#file-hashes) for computing the file hash and [converting hashes to strings](./api#converting-hashes-to-strings). | ||
| - **Headers**: | ||
| - `Range`: OPTIONAL. Format: `bytes={start}-{end}` (end is inclusive). | ||
| - **Minimum Token Scope**: `read` | ||
| - **Body**: None. | ||
| - **Response**: JSON (`QueryReconstructionResponse`) | ||
|
|
||
| ```json | ||
| { | ||
| "offset_into_first_range": 0, | ||
| "terms": [...], | ||
| "fetch_info": {...} | ||
| } | ||
| ``` | ||
|
|
||
| - **Error Responses**: See [Error Cases](./api#error-cases) | ||
| - `400 Bad Request`: Malformed `file_id` in the path. Fix the path before retrying. | ||
| - `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header. | ||
| - `404 Not Found`: The file does not exist. Not retryable. | ||
| - `416 Range Not Satisfiable`: The requested byte range start exceeds the end of the file. Not retryable. | ||
|
|
||
| ```txt | ||
| GET /v1/reconstructions/0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef | ||
| -H "Authorization: Bearer <token>" | ||
| OPTIONAL: -H Range: "bytes=0-100000" | ||
| ``` | ||
|
|
||
| ### Example File Reconstruction Response Body | ||
|
|
||
| See [QueryReconstructionResponse](./download-protocol#queryreconstructionresponse-structure) for more details in the download protocol specification. | ||
|
|
||
| ### 2. Query Chunk Deduplication (Global Deduplication) | ||
|
|
||
| - **Description**: Checks if a chunk exists in the CAS for deduplication purposes. | ||
| - **Path**: `/v1/chunks/{prefix}/{hash}` | ||
| - **Method**: `GET` | ||
| - **Parameters**: | ||
| - `prefix`: The only acceptable prefix for the Global Deduplication API is `default-merkledb`. | ||
| - `hash`: Chunk hash in hex format (64 lowercase hexadecimal characters). | ||
| See [Chunk Hashes](./hashing#chunk-hashes) to compute the chunk hash and [converting hashes to strings](./api#converting-hashes-to-strings). | ||
| - **Minimum Token Scope**: `read` | ||
| - **Body**: None. | ||
| - **Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](./shard#global-deduplication). | ||
| - **Error Responses**: See [Error Cases](./api#error-cases) | ||
| - `400 Bad Request`: Malformed hash in the path. Fix the path before retrying. | ||
| - `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header. | ||
| - `404 Not Found`: Chunk not already tracked by global deduplication. Not retryable. | ||
|
|
||
| ```txt | ||
| GET /v1/chunks/default-merkledb/0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef | ||
| -H "Authorization: Bearer <token>" | ||
| ``` | ||
|
|
||
| #### Example Shard Response Body | ||
|
|
||
| An example shard response body can be found in [Xet reference files](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.shard.dedupe). | ||
|
|
||
| ### 3. Upload Xorb | ||
|
|
||
| - **Description**: Uploads a serialized Xorb to the server; uploading real data in serialized format. | ||
| - **Path**: `/v1/xorbs/{prefix}/{hash}` | ||
| - **Method**: `POST` | ||
| - **Parameters**: | ||
| - `prefix`: The only acceptable prefix for the Xorb upload API is `default`. | ||
| - `hash`: Xorb hash in hex format (64 lowercase hexadecimal characters). | ||
| See [Xorb Hashes](./hashing#xorb-hashes) to compute the hash, and [converting hashes to strings](./api#converting-hashes-to-strings). | ||
| - **Minimum Token Scope**: `write` | ||
| - **Body**: Serialized Xorb bytes (`application/octet-stream`). | ||
| See [xorb format serialization](./xorb). | ||
| - **Response**: JSON (`UploadXorbResponse`) | ||
|
|
||
| ```json | ||
| { | ||
| "was_inserted": true | ||
| } | ||
| ``` | ||
|
|
||
| - Note: `was_inserted` is `false` if the Xorb already exists; this is not an error. | ||
|
|
||
| - **Error Responses**: See [Error Cases](./api#error-cases) | ||
| - `400 Bad Request`: Malformed hash in the path, Xorb hash does not match the body, or body is incorrectly serialized. | ||
| - `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header. | ||
| - `403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided). Clients MUST retry with a `write` scope token. | ||
|
|
||
| ```txt | ||
| POST /v1/xorbs/default/0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef | ||
| -H "Authorization: Bearer <token>" | ||
| ``` | ||
|
|
||
| #### Example Xorb Request Body | ||
|
|
||
| An example xorb request body can be found in [Xet reference files](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/eea25d6ee393ccae385820daed127b96ef0ea034dfb7cf6da3a950ce334b7632.xorb). | ||
|
|
||
| ### 4. Upload Shard | ||
|
|
||
| - **Description**: Uploads a Shard to the CAS. | ||
| Uploads file reconstructions and new xorb listing, serialized into the shard format; marks the files as uploaded. | ||
| - **Path**: `/v1/shards` | ||
| - **Method**: `POST` | ||
| - **Minimum Token Scope**: `write` | ||
| - **Body**: Serialized Shard data as bytes (`application/octet-stream`). | ||
| See [Shard format guide](./shard#shard-upload). | ||
| - **Response**: JSON (`UploadShardResponse`) | ||
|
|
||
| ```json | ||
| { | ||
| "result": 0 | ||
| } | ||
| ``` | ||
|
|
||
| - Where `result` is: | ||
| - `0`: The Shard already exists. | ||
| - `1`: `SyncPerformed` — the Shard was registered. | ||
|
|
||
| The value of `result` does not carry any meaning, if the upload shard API returns a `200 OK` status code, the upload was successful and the files listed are considered uploaded. | ||
|
|
||
| - **Error Responses**: See [Error Cases](./api#error-cases) | ||
| - `400 Bad Request`: Shard is incorrectly serialized or Shard contents failed verification. | ||
| - Can mean that a referenced Xorb doesn't exist or the shard is too large | ||
| - `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header. | ||
| - `403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided). | ||
|
|
||
| ```txt | ||
| POST /v1/shards | ||
| -H "Authorization: Bearer <token>" | ||
| ``` | ||
|
|
||
| #### Example Shard Request Body | ||
|
|
||
| An example shard request body can be found in [Xet reference files](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.shard.verification-no-footer). | ||
|
|
||
| ## Error Cases | ||
|
|
||
| ### Non-Retryable Errors | ||
|
|
||
| - **400 Bad Request**: Returned when the request parameters are invalid (for example, invalid Xorb/Shard on upload APIs). | ||
| - **401 Unauthorized**: Refresh the token to continue making requests, or provide a token in the `Authorization` header. | ||
| - **403 Forbidden**: Token provided but does not have a wide enough scope (for example, a `read` token was provided for an API requiring `write` scope). | ||
| - **404 Not Found**: Occurs on `GET` APIs where the resource (Xorb, file) does not exist. | ||
| - **416 Range Not Satisfiable**: Reconstruction API only; returned when byte range requests are invalid. Specifically, the requested start range is greater than or equal to the length of the file. | ||
|
|
||
| ### Retryable Errors | ||
|
|
||
| - **Connection Errors**: Often caused by network issues. Retry if intermittent. | ||
| Clients SHOULD ensure no firewall blocks requests and SHOULD NOT use DNS overrides. | ||
| - **429 Rate Limiting**: Lower your request rate using a backoff strategy, then wait and retry. | ||
| Assume all APIs are rate limited. | ||
| - **500 Internal Server Error**: The server experienced an intermittent issue; clients SHOULD retry their requests. | ||
| - **503 Service Unavailable**: Service is temporarily unable to process requests; wait and retry. | ||
| - **504 Gateway Timeout**: Service took too long to respond; wait and retry. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier section (not sure how to get comment over to area not in diff, sigh) - Security Model: I think you can deep link into protocol to refer to areas of the protocol that speak to the privacy preserving global dedup.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 2 things to link to in regards to "security model"
A. in deduplication.md the very short section titled
#### HMAC Security Mechanismis relevant to what you described.B. the auth.md file as a whole if it's desired to show how the hub is the source of truth on auth for CAS