Skip to content

Commit 9a0497e

Browse files
WauplinpcuencaLysandreJik
authored
Download file to specific destination (#1360)
* Add local_dir option to fhf_hub_download/snapshot_download + docstring + tests * add snapshot_download test + fix is subdirectory * return local_dir in snapshot_download + fix * adapt tests * refactor snapshot download test -remove duplicate tests * small * remove unrelated image * updated download.mdx * doc * Fix CI on windows * better fix for windows * warn about symlinks in doc * docs * fix readlink test * fix windows ci * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> * Add auto local_dir_use_symlink * doc update * fix tests * style * broken doc * test files are overwritten if any * fix test * overview * Update src/huggingface_hub/file_download.py Co-authored-by: Lysandre Debut <[email protected]> --------- Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>
1 parent 0cc3822 commit 9a0497e

File tree

9 files changed

+524
-319
lines changed

9 files changed

+524
-319
lines changed

docs/source/guides/download.mdx

Lines changed: 96 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -5,100 +5,99 @@ stored on the Hub. You can use these functions independently or integrate them i
55
own library, making it more convenient for your users to interact with the Hub. This
66
guide will show you how to:
77

8-
* Download and store a file from the Hub.
9-
* Download all the files in a repository.
8+
* Download and cache a single file.
9+
* Download and cache an entire repository.
10+
* Download files to a local folder.
1011

11-
## Download and store a file from the Hub
12+
## Download a single file
1213

1314
The [`hf_hub_download`] function is the main function for downloading files from the Hub.
15+
It downloads the remote file, caches it on disk (in a version-aware way), and returns its local file path.
1416

15-
It downloads the remote file, stores it on disk (in a version-aware way), and returns its local file path.
17+
<Tip>
1618

17-
Use the `repo_id` and `filename` parameters to specify which file to download:
19+
The returned filepath is a pointer to the HF local cache. Therefore, it is important to not modify the file to avoid
20+
having a corrupted cache. If you are interested in getting to know more about how files are cached, please refer to our
21+
[caching guide](./manage-cache).
22+
23+
</Tip>
24+
25+
### From latest version
26+
27+
Select the file to download using the `repo_id`, `repo_type` and `filename` parameters. By default, the file will
28+
be considered as being part of a `model` repo.
1829

1930
```python
2031
>>> from huggingface_hub import hf_hub_download
2132
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json")
2233
'/root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json'
23-
```
24-
25-
<div class="flex justify-center">
26-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/repo.png"/>
27-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/repo-dark.png"/>
28-
</div>
2934

30-
Specify a particular file version by providing the file revision, which can be the
31-
branch name, a tag, or a commit hash. When using the commit hash, it must be the
32-
full-length hash instead of a 7-character commit hash:
33-
34-
```python
35-
>>> hf_hub_download(
36-
... repo_id="lysandre/arxiv-nlp",
37-
... filename="config.json",
38-
... revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a",
39-
... )
40-
'/root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/877b84a8f93f2d619faa2a6e514a32beef88ab0a/config.json'
35+
# Download from a dataset
36+
>>> hf_hub_download(repo_id="google/fleurs", filename="fleurs.py", repo_type="dataset")
37+
'/root/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34/fleurs.py'
4138
```
4239

43-
To specify a file revision with the branch name:
44-
45-
```python
46-
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="main")
47-
```
40+
### From specific version
4841

49-
To specify a file revision with a tag identifier. For example, if you want `v1.0` of the
50-
`config.json` file:
42+
By default, the latest version from the `main` branch is downloaded. However, in some cases you want to download a file
43+
at a particular version (e.g. from a specific branch, a PR, a tag or a commit hash).
44+
To do so, use the `revision` parameter:
5145

5246
```python
47+
# Download from the `v1.0` tag
5348
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="v1.0")
54-
```
5549

56-
To download from a `dataset` or a `space`, specify the `repo_type`. By default, file will
57-
be considered as being part of a `model` repo.
50+
# Download from the `test-branch` branch
51+
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="test-branch")
5852

59-
```python
60-
>>> hf_hub_download(repo_id="google/fleurs", filename="fleurs.py", repo_type="dataset")
53+
# Download from Pull Request #3
54+
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="refs/pr/3")
55+
56+
# Download from a specific commit hash
57+
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a")
6158
```
6259

63-
## Construct a download URL
60+
**Note:** When using the commit hash, it must be the full-length hash instead of a 7-character commit hash.
61+
62+
### Construct a download URL
6463

6564
In case you want to construct the URL used to download a file from a repo, you can use [`hf_hub_url`] which returns a URL.
6665
Note that it is used internally by [`hf_hub_download`].
6766

6867
## Download an entire repository
6968

70-
[`snapshot_download`] downloads an entire repository at a given revision. Like
71-
[`hf_hub_download`], all downloaded files are cached on your local disk.
69+
[`snapshot_download`] downloads an entire repository at a given revision. It uses internally [`hf_hub_download`] which
70+
means all downloaded files are also cached on your local disk. Downloads are made concurrently to speed-up the process.
7271

73-
Download a whole repository as shown in the following:
72+
To download a whole repository, just pass the `repo_id` and `repo_type`:
7473

7574
```python
7675
>>> from huggingface_hub import snapshot_download
7776
>>> snapshot_download(repo_id="lysandre/arxiv-nlp")
78-
'/home/lysandre/.cache/huggingface/hub/lysandre__arxiv-nlp.894a9adde21d9a3e3843e6d5aeaaf01875c7fade'
77+
'/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade'
78+
79+
# Or from a dataset
80+
>>> snapshot_download(repo_id="google/fleurs", repo_type="dataset")
81+
'/home/lysandre/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34'
7982
```
8083

81-
[`snapshot_download`] downloads the latest revision by default. If you want a specific
82-
repository revision, use the `revision` parameter:
84+
[`snapshot_download`] downloads the latest revision by default. If you want a specific repository revision, use the
85+
`revision` parameter:
8386

8487
```python
8588
>>> from huggingface_hub import snapshot_download
86-
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", revision="main")
89+
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", revision="refs/pr/1")
8790
```
8891

89-
In general, it is usually better to download files with [`hf_hub_download`] - if you
90-
already know the file names you need.
91-
[`snapshot_download`] is helpful when you are unaware of which files to download.
92+
### Filter files to download
9293

93-
However, you don't always want to download the contents of an entire repository with
94-
[`snapshot_download`]. Even if you don't know the file name, you can download specific
95-
files if you know the file type with `allow_patterns` and `ignore_patterns`. Use the
96-
`allow_patterns` and `ignore_patterns` arguments to specify which files to download. These
97-
parameters accept either a single pattern or a list of patterns.
94+
[`snapshot_download`] provides an easy way to download a repository. However, you don't always want to download the
95+
entire content of a repository. For example, you might want to prevent downloading all `.bin` files if you know you'll
96+
only use the `.safetensors` weights. You can do that using `allow_patterns` and `ignore_patterns` parameters.
9897

99-
Patterns are Standard Wildcards (globbing patterns) as documented
100-
[here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm). The pattern
101-
matching is based on [`fnmatch`](https://docs.python.org/3/library/fnmatch.html).
98+
These parameters accept either a single pattern or a list of patterns. Patterns are Standard Wildcards (globbing
99+
patterns) as documented [here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm). The pattern matching is
100+
based on [`fnmatch`](https://docs.python.org/3/library/fnmatch.html).
102101

103102
For example, you can use `allow_patterns` to only download JSON configuration files:
104103

@@ -115,5 +114,47 @@ following example ignores the `.msgpack` and `.h5` file extensions:
115114
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", ignore_patterns=["*.msgpack", "*.h5"])
116115
```
117116

118-
Passing a pattern can be especially useful when repositories contain files that are never
119-
expected to be downloaded by [`snapshot_download`].
117+
Finally, you can combine both to precisely filter your download. Here is an example to download all json and markdown
118+
files except `vocab.json`.
119+
120+
```python
121+
>>> from huggingface_hub import snapshot_download
122+
>>> snapshot_download(repo_id="gpt2", allow_patterns=["*.md", "*.json"], ignore_patterns="vocab.json")
123+
```
124+
125+
## Download file(s) to local folder
126+
127+
The recommended (and default) way to download files from the Hub is to use the [cache-system](./manage-cache).
128+
You can define your cache location by setting `cache_dir` parameter (both in [`hf_hub_download`] and [`snapshot_download`]).
129+
130+
However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow
131+
closer to what `git` commands offer. You can do that using the `local_dir` and `local_dir_use_symlinks` parameters:
132+
- `local_dir` must be a path to a folder on your system. The downloaded files will keep the same file structure as in the
133+
repo. For example if `filename="data/train.csv"` and `local_dir="path/to/folder"`, then the returned filepath will be
134+
`"path/to/folder/data/train.csv"`.
135+
- `local_dir_use_symlinks` defines how the file must be saved in your local folder.
136+
- The default behavior (`"auto"`) is to duplicate small files (<5MB) and use symlinks for bigger files. Symlinks allow
137+
to optimize both bandwidth and disk usage. However manually editing a symlinked file might corrupt the cache, hence
138+
the duplication for small files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
139+
environment variable.
140+
- If `local_dir_use_symlinks=True` is set, all files are symlinked for an optimal disk space optimization. This is
141+
for example useful when downloading a huge dataset with thousands of small files.
142+
- Finally, if you don't want symlinks at all you can disable them (`local_dir_use_symlinks=False`). The cache directory
143+
will still be used to check wether the file is already cached or not. If already cached, the file is **duplicated**
144+
from the cache (i.e. saves bandwidth but increases disk usage). If the file is not already cached, it will be
145+
downloaded and moved directly to the local dir. This means that if you need to reuse it somewhere else later, it
146+
will be **re-downloaded**.
147+
148+
Here is a table that summarizes the different options to help you choose the parameters that best suit your use case.
149+
150+
<!-- Generated with https://www.tablesgenerator.com/markdown_tables -->
151+
| Parameters | File already cached | Returned path | Can read path? | Can save to path? | Optimized bandwidth | Optimized disk usage |
152+
|---|:---:|:---:|:---:|:---:|:---:|:---:|
153+
| `local_dir=None` | | symlink in cache | | ❌<br>_(save would corrupt the cache)_ | | |
154+
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks="auto"` | | file or symlink in folder | | _(for small files)_ <br> ⚠️ _(for big files do not resolve path before saving)_ | | |
155+
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=True` | | symlink in folder | | ⚠️<br>_(do not resolve path before saving)_ | | |
156+
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | No | file in folder | | | ❌<br>_(if re-run, file is re-downloaded)_ | ⚠️<br>(multiple copies if ran in multiple folders) |
157+
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | Yes | file in folder | | | ⚠️<br>_(file has to be cached first)_ | ❌<br>_(file is duplicated)_ |
158+
159+
**Note:** if you are on a Windows machine, you need to enable developer mode or run `huggingface_hub` as admin to enable
160+
symlinks. Check out the [cache limitations](../guides/manage-cache#limitations) section for more details.

docs/source/guides/overview.mdx

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
5454
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
5555
href="./community">
5656
<div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
57-
Community
57+
Community Tab
5858
</div><p class="text-gray-700">
5959
How to interact with the Community tab (Discussions and Pull Requests)?
6060
</p>
@@ -87,5 +87,14 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
8787
</p>
8888
</a>
8989

90+
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
91+
href="./integrations">
92+
<div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
93+
Integrate a library
94+
</div><p class="text-gray-700">
95+
What does it mean to integrate a library with the Hub? And how to do it?
96+
</p>
97+
</a>
98+
9099
</div>
91100
</div>

docs/source/package_reference/environment_variables.mdx

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,13 @@ Defaults to `"warning"`.
6363

6464
For more details, see [logging reference](../package_reference/utilities#huggingface_hub.utils.logging.get_verbosity).
6565

66+
### HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD
67+
68+
Integer value to define under which size a file is considered as "small". When downloading files to a local directory,
69+
small files will be duplicated to ease user experience while bigger files are symlinked to save disk usage.
70+
71+
For more details, see the [download guide](../guides/download#download-files-to-local-folder).
72+
6673
## Boolean values
6774

6875
The following environment variables expect a boolean value. The variable will be considered

src/huggingface_hub/_snapshot_download.py

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
from .hf_api import HfApi
1616
from .utils import filter_repo_objects, logging, validate_hf_hub_args
1717
from .utils import tqdm as hf_tqdm
18+
from .utils._typing import Literal
1819

1920

2021
logger = logging.get_logger(__name__)
@@ -27,6 +28,8 @@ def snapshot_download(
2728
revision: Optional[str] = None,
2829
repo_type: Optional[str] = None,
2930
cache_dir: Union[str, Path, None] = None,
31+
local_dir: Union[str, Path, None] = None,
32+
local_dir_use_symlinks: Union[bool, Literal["auto"]] = "auto",
3033
library_name: Optional[str] = None,
3134
library_version: Optional[str] = None,
3235
user_agent: Optional[Union[Dict, str]] = None,
@@ -40,15 +43,30 @@ def snapshot_download(
4043
max_workers: int = 8,
4144
tqdm_class: Optional[base_tqdm] = None,
4245
) -> str:
43-
"""Download all files of a repo.
44-
45-
Downloads a whole snapshot of a repo's files at the specified revision. This
46-
is useful when you want all files from a repo, because you don't know which
47-
ones you will need a priori. All files are nested inside a folder in order
48-
to keep their actual filename relative to that folder.
49-
50-
An alternative would be to just clone a repo but this would require that the
51-
user always has git and git-lfs installed, and properly configured.
46+
"""Download repo files.
47+
48+
Download a whole snapshot of a repo's files at the specified revision. This is useful when you want all files from
49+
a repo, because you don't know which ones you will need a priori. All files are nested inside a folder in order
50+
to keep their actual filename relative to that folder. You can also filter which files to download using
51+
`allow_patterns` and `ignore_patterns`.
52+
53+
If `local_dir` is provided, the file structure from the repo will be replicated in this location. You can configure
54+
how you want to move those files:
55+
- If `local_dir_use_symlinks="auto"` (default), files are downloaded and stored in the cache directory as blob
56+
files. Small files (<5MB) are duplicated in `local_dir` while a symlink is created for bigger files. The goal
57+
is to be able to manually edit and save small files without corrupting the cache while saving disk space for
58+
binary files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
59+
environment variable.
60+
- If `local_dir_use_symlinks=True`, files are downloaded, stored in the cache directory and symlinked in `local_dir`.
61+
This is optimal in term of disk usage but files must not be manually edited.
62+
- If `local_dir_use_symlinks=False` and the blob files exist in the cache directory, they are duplicated in the
63+
local dir. This means disk usage is not optimized.
64+
- Finally, if `local_dir_use_symlinks=False` and the blob files do not exist in the cache directory, then the
65+
files are downloaded and directly placed under `local_dir`. This means if you need to download them again later,
66+
they will be re-downloaded entirely.
67+
68+
An alternative would be to clone the repo but this requires git and git-lfs to be installed and properly
69+
configured. It is also not possible to filter which files to download when cloning a repository using git.
5270
5371
Args:
5472
repo_id (`str`):
@@ -61,6 +79,14 @@ def snapshot_download(
6179
`None` or `"model"` if downloading from a model. Default is `None`.
6280
cache_dir (`str`, `Path`, *optional*):
6381
Path to the folder where cached files are stored.
82+
local_dir (`str` or `Path`, *optional*:
83+
If provided, the downloaded files will be placed under this directory, either as symlinks (default) or
84+
regular files (see description for more details).
85+
local_dir_use_symlinks (`"auto"` or `bool`, defaults to `"auto"`):
86+
To be used with `local_dir`. If set to "auto", the cache directory will be used and the file will be either
87+
duplicated or symlinked to the local directory depending on its size. It set to `True`, a symlink will be
88+
created, no matter the file size. If set to `False`, the file will either be duplicated from cache (if
89+
already exists) or downloaded from the Hub and not cached. See description for more details.
6490
library_name (`str`, *optional*):
6591
The name of the library to which the object corresponds.
6692
library_version (`str`, *optional*):
@@ -189,6 +215,8 @@ def _inner_hf_hub_download(repo_file: str):
189215
repo_type=repo_type,
190216
revision=commit_hash,
191217
cache_dir=cache_dir,
218+
local_dir=local_dir,
219+
local_dir_use_symlinks=local_dir_use_symlinks,
192220
library_name=library_name,
193221
library_version=library_version,
194222
user_agent=user_agent,
@@ -213,4 +241,6 @@ def _inner_hf_hub_download(repo_file: str):
213241
tqdm_class=tqdm_class or hf_tqdm,
214242
)
215243

244+
if local_dir is not None:
245+
return str(os.path.realpath(local_dir))
216246
return snapshot_folder

src/huggingface_hub/constants.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@ def _is_true(value: Optional[str]) -> bool:
1515
return value.upper() in ENV_VARS_TRUE_VALUES
1616

1717

18-
def _is_true_or_auto(value: Optional[str]) -> bool:
18+
def _as_int(value: Optional[str]) -> Optional[int]:
1919
if value is None:
20-
return False
21-
return value.upper() in ENV_VARS_TRUE_AND_AUTO_VALUES
20+
return None
21+
return int(value)
2222

2323

2424
# Constants for file downloads
@@ -118,3 +118,11 @@ def _is_true_or_auto(value: Optional[str]) -> bool:
118118
# - https://pypi.org/project/hf-transfer/
119119
# - https://github.com/huggingface/hf_transfer (private)
120120
HF_HUB_ENABLE_HF_TRANSFER: bool = _is_true(os.environ.get("HF_HUB_ENABLE_HF_TRANSFER"))
121+
122+
123+
# Used if download to `local_dir` and `local_dir_use_symlinks="auto"`
124+
# Files smaller than 5MB are copy-pasted while bigger files are symlinked. The idea is to save disk-usage by symlinking
125+
# huge files (i.e. LFS files most of the time) while allowing small files to be manually edited in local folder.
126+
HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD: int = (
127+
_as_int(os.environ.get("HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD")) or 5 * 1024 * 1024
128+
)

0 commit comments

Comments
 (0)