Skip to content

返回特定的location,利用modelscope作为源站 #51

@simpx

Description

@simpx

调查 #50 的时候发现,huggingface cli实际上是根据url返回头信息里的locationlink字段,判断真实的数据下载地址

当使用xet的时候,从link里拿信息,用xet自己的协议下载数据

当不使用xet的时候,从location里拿信息,用标注http的方式下载

比如下面这个例子里,可以根据link信息用xet下载,也可以从location中提取后面"https://cas-bridge.xethub.hf-mirror.com/xet-" 开头的url下载

(pyenv) 12:12:48-simpx@galaxy~$ curl -I https://hf-mirror.com/Qwen/Qwen3-0.6B/resolve/6130ef31402718485ca4d80a6234f70d9a4cf362/model.safetensors
HTTP/2 302
...
link: <http://hf-mirror.com/api/models/Qwen/Qwen3-0.6B/xet-read-token/6130ef31402718485ca4d80a6234f70d9a4cf362>; rel="xet-auth", <https://cas-server.xethub.hf.co/reconstruction/6f87d3bff66602a9611e541180dec950a6b6a9068ca274ee43ac8f83550f0223>; rel="xet-reconstruction-info"
location: YuY28veGV0LWJyaWRnZS11cy82ODBkYTcxODIzMzgzNDg5MGFhMDFmNTEvNmY4N2QzYmZmNjY2MDJhOTYxMWU1NDExODBkZWM5NTBhNmI2YTkwNjhjYTI3NGVlNDNhYzhmODM1NTBmMDIyMyoifV19&Signature=l6dO5iiLtbISab2D%7ERjg3BCNE7uuW4FLC2z2nvBLC%7EXFqC2GavfwVk5qIoUPcPsr5mIkeIbhwAl-r8BAWN7jMzmokyKyEfaxmVD4yuClLWSLMDiQrlnPZy8hcTf3jzILlW%7EMQwE8k7k0B81z2tGhBLRNp9cc-AHUvebDBL9LzowX8zJFgKfuP2acAnz7Gsn8NUxWB2MeHevDaznbEKC9vXn5QQF9wQeFwtnkeTSEgxbhsSIWxSpK8LumIEiyyL0eRDtygrnU6rjs8GH4P-xoqp9nLAPX-dAw6oJqV1VjVlCIKmXpp7RvCapgxJgjK3VVvVdzwSdzjLhbbq0UC-pgTQ__&Key-Pair-Id=K2L8F4GPSG1IFChttps://cas-bridge.xethub.hf-mirror.com/xet-bridge-us/680da718233834890aa01f51/6f87d3bff66602a9611e541180dec950a6b6a9068ca274ee43ac8f83550f0223?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250510%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250510T041327Z&X-Amz-Expires=3600&X-Amz-Signature=006cefbc3aadef1ed58411822413179e39147c5fdda4dd80c9a9bc0d1ecf1203&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1746854007&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0Njg1NDAwN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82ODBkYTcxODIzMzgzNDg5MGFhMDFmNTEvNmY4N2QzYmZmNjY2MDJhOTYxMWU1NDExODBkZWM5NTBhNmI2YTkwNjhjYTI3NGVlNDNhYzhmODM1NTBmMDIyMyoifV19&Signature=l6dO5iiLtbISab2D%7ERjg3BCNE7uuW4FLC2z2nvBLC%7EXFqC2GavfwVk5qIoUPcPsr5mIkeIbhwAl-r8BAWN7jMzmokyKyEfaxmVD4yuClLWSLMDiQrlnPZy8hcTf3jzILlW%7EMQwE8k7k0B81z2tGhBLRNp9cc-AHUvebDBL9LzowX8zJFgKfuP2acAnz7Gsn8NUxWB2MeHevDaznbEKC9vXn5QQF9wQeFwtnkeTSEgxbhsSIWxSpK8LumIEiyyL0eRDtygrnU6rjs8GH4P-xoqp9nLAPX-dAw6oJqV1VjVlCIKmXpp7RvCapgxJgjK3VVvVdzwSdzjLhbbq0UC-pgTQ__&Key-Pair-Id=K2L8F4GPSG1IFC
referrer-policy: strict-origin-when-cross-origin
server: hf-mirror
...
content-length: 1317

那么这里hf-mirror就能修改response,并把别的网站(比如modelscope)作为源站了,只需要确保两点:

  1. 去掉link字段,避免cli自动走xet协议
  2. 修改location字段里的地址,直接使用modelscope的地址

我没有本地server,所以hack了一下cli的代码,模拟“server返回特定location”
结果是可行的,如下hack代码:

@@ -145,7 +145,7 @@ def are_symlinks_supported(cache_dir: Union[str, Path, None] = None) -> bool:
    return _are_symlinks_supported_in_dir[cache_dir]


@dataclass(frozen=True)
@dataclass(frozen=False)
class HfFileMetadata:
    """Data structure containing information about a file versioned on the Hub.
@@ -1457,7 +1457,7 @@ def get_hf_file_metadata(
    hf_raise_for_status(r)

    # Return
-    return HfFileMetadata(
+    meta = HfFileMetadata( # 
        commit_hash=r.headers.get(constants.HUGGINGFACE_HEADER_X_REPO_COMMIT),
        # We favor a custom header indicating the etag of the linked resource, and
        # we fallback to the regular etag header.
@@ -1471,6 +1471,9 @@ def get_hf_file_metadata(
        ),
        xet_file_data=parse_xet_file_data_from_response(r),  # type: ignore
    )
+    if meta.etag == 'f47f71177f32bcd101b7573ec9171e6a57f4f4d31148d38e382306f42996874b':
+        meta.xet_file_data=None
+        meta.location='https://modelscope.cn/models/Qwen/Qwen3-0.6B/resolve/master/model.safetensors'


def _get_metadata_or_catch_error(

这里的etag是从https://modelscope.cn/models/Qwen/Qwen3-0.6B/file/view/master/model.safetensors?status=2获得的,即模型文件的sha256值

本来huggingface-cli应该获得hf-mirror的地址,被我在这里改成了modelscope

因为etag匹配,因此也不用担心下载下来的数据有问题

最终方案

  1. hf-mirror.com可以考虑维护一个meta表,用{etag: url}的方式保存来自可信来源的url
  2. 当客户端访问"/resove"地址的时候,如果匹配etag,则修改location,且去掉links

这样做或许可以减少大量流量压力

当然,还需要额外考虑可信来源的用户协议,确保这样把它们当做源站是合规的

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions