-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
调查 #50 的时候发现,huggingface cli实际上是根据url返回头信息里的location和link字段,判断真实的数据下载地址
当使用xet的时候,从link里拿信息,用xet自己的协议下载数据
当不使用xet的时候,从location里拿信息,用标注http的方式下载
比如下面这个例子里,可以根据link信息用xet下载,也可以从location中提取后面"https://cas-bridge.xethub.hf-mirror.com/xet-" 开头的url下载
(pyenv) 12:12:48-simpx@galaxy~$ curl -I https://hf-mirror.com/Qwen/Qwen3-0.6B/resolve/6130ef31402718485ca4d80a6234f70d9a4cf362/model.safetensors
HTTP/2 302
...
link: <http://hf-mirror.com/api/models/Qwen/Qwen3-0.6B/xet-read-token/6130ef31402718485ca4d80a6234f70d9a4cf362>; rel="xet-auth", <https://cas-server.xethub.hf.co/reconstruction/6f87d3bff66602a9611e541180dec950a6b6a9068ca274ee43ac8f83550f0223>; rel="xet-reconstruction-info"
location: YuY28veGV0LWJyaWRnZS11cy82ODBkYTcxODIzMzgzNDg5MGFhMDFmNTEvNmY4N2QzYmZmNjY2MDJhOTYxMWU1NDExODBkZWM5NTBhNmI2YTkwNjhjYTI3NGVlNDNhYzhmODM1NTBmMDIyMyoifV19&Signature=l6dO5iiLtbISab2D%7ERjg3BCNE7uuW4FLC2z2nvBLC%7EXFqC2GavfwVk5qIoUPcPsr5mIkeIbhwAl-r8BAWN7jMzmokyKyEfaxmVD4yuClLWSLMDiQrlnPZy8hcTf3jzILlW%7EMQwE8k7k0B81z2tGhBLRNp9cc-AHUvebDBL9LzowX8zJFgKfuP2acAnz7Gsn8NUxWB2MeHevDaznbEKC9vXn5QQF9wQeFwtnkeTSEgxbhsSIWxSpK8LumIEiyyL0eRDtygrnU6rjs8GH4P-xoqp9nLAPX-dAw6oJqV1VjVlCIKmXpp7RvCapgxJgjK3VVvVdzwSdzjLhbbq0UC-pgTQ__&Key-Pair-Id=K2L8F4GPSG1IFChttps://cas-bridge.xethub.hf-mirror.com/xet-bridge-us/680da718233834890aa01f51/6f87d3bff66602a9611e541180dec950a6b6a9068ca274ee43ac8f83550f0223?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250510%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250510T041327Z&X-Amz-Expires=3600&X-Amz-Signature=006cefbc3aadef1ed58411822413179e39147c5fdda4dd80c9a9bc0d1ecf1203&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1746854007&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0Njg1NDAwN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82ODBkYTcxODIzMzgzNDg5MGFhMDFmNTEvNmY4N2QzYmZmNjY2MDJhOTYxMWU1NDExODBkZWM5NTBhNmI2YTkwNjhjYTI3NGVlNDNhYzhmODM1NTBmMDIyMyoifV19&Signature=l6dO5iiLtbISab2D%7ERjg3BCNE7uuW4FLC2z2nvBLC%7EXFqC2GavfwVk5qIoUPcPsr5mIkeIbhwAl-r8BAWN7jMzmokyKyEfaxmVD4yuClLWSLMDiQrlnPZy8hcTf3jzILlW%7EMQwE8k7k0B81z2tGhBLRNp9cc-AHUvebDBL9LzowX8zJFgKfuP2acAnz7Gsn8NUxWB2MeHevDaznbEKC9vXn5QQF9wQeFwtnkeTSEgxbhsSIWxSpK8LumIEiyyL0eRDtygrnU6rjs8GH4P-xoqp9nLAPX-dAw6oJqV1VjVlCIKmXpp7RvCapgxJgjK3VVvVdzwSdzjLhbbq0UC-pgTQ__&Key-Pair-Id=K2L8F4GPSG1IFC
referrer-policy: strict-origin-when-cross-origin
server: hf-mirror
...
content-length: 1317
那么这里hf-mirror就能修改response,并把别的网站(比如modelscope)作为源站了,只需要确保两点:
- 去掉
link字段,避免cli自动走xet协议 - 修改
location字段里的地址,直接使用modelscope的地址
我没有本地server,所以hack了一下cli的代码,模拟“server返回特定location”
结果是可行的,如下hack代码:
@@ -145,7 +145,7 @@ def are_symlinks_supported(cache_dir: Union[str, Path, None] = None) -> bool:
return _are_symlinks_supported_in_dir[cache_dir]
@dataclass(frozen=True)
@dataclass(frozen=False)
class HfFileMetadata:
"""Data structure containing information about a file versioned on the Hub.
@@ -1457,7 +1457,7 @@ def get_hf_file_metadata(
hf_raise_for_status(r)
# Return
- return HfFileMetadata(
+ meta = HfFileMetadata( #
commit_hash=r.headers.get(constants.HUGGINGFACE_HEADER_X_REPO_COMMIT),
# We favor a custom header indicating the etag of the linked resource, and
# we fallback to the regular etag header.
@@ -1471,6 +1471,9 @@ def get_hf_file_metadata(
),
xet_file_data=parse_xet_file_data_from_response(r), # type: ignore
)
+ if meta.etag == 'f47f71177f32bcd101b7573ec9171e6a57f4f4d31148d38e382306f42996874b':
+ meta.xet_file_data=None
+ meta.location='https://modelscope.cn/models/Qwen/Qwen3-0.6B/resolve/master/model.safetensors'
def _get_metadata_or_catch_error(这里的etag是从https://modelscope.cn/models/Qwen/Qwen3-0.6B/file/view/master/model.safetensors?status=2获得的,即模型文件的sha256值
本来huggingface-cli应该获得hf-mirror的地址,被我在这里改成了modelscope
因为etag匹配,因此也不用担心下载下来的数据有问题
最终方案
- hf-mirror.com可以考虑维护一个meta表,用
{etag: url}的方式保存来自可信来源的url - 当客户端访问"/resove"地址的时候,如果匹配etag,则修改
location,且去掉links
这样做或许可以减少大量流量压力
当然,还需要额外考虑可信来源的用户协议,确保这样把它们当做源站是合规的
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels