@@ -5,100 +5,99 @@ stored on the Hub. You can use these functions independently or integrate them i
55own library, making it more convenient for your users to interact with the Hub. This
66guide will show you how to:
77
8- * Download and store a file from the Hub.
9- * Download all the files in a repository.
8+ * Download and cache a single file.
9+ * Download and cache an entire repository.
10+ * Download files to a local folder.
1011
11- ## Download and store a file from the Hub
12+ ## Download a single file
1213
1314The [ ` hf_hub_download ` ] function is the main function for downloading files from the Hub.
15+ It downloads the remote file, caches it on disk (in a version-aware way), and returns its local file path.
1416
15- It downloads the remote file, stores it on disk (in a version-aware way), and returns its local file path.
17+ < Tip >
1618
17- Use the ` repo_id ` and ` filename ` parameters to specify which file to download:
19+ The returned filepath is a pointer to the HF local cache. Therefore, it is important to not modify the file to avoid
20+ having a corrupted cache. If you are interested in getting to know more about how files are cached, please refer to our
21+ [ caching guide] ( ./manage-cache ) .
22+
23+ </Tip >
24+
25+ ### From latest version
26+
27+ Select the file to download using the ` repo_id ` , ` repo_type ` and ` filename ` parameters. By default, the file will
28+ be considered as being part of a ` model ` repo.
1829
1930``` python
2031>> > from huggingface_hub import hf_hub_download
2132>> > hf_hub_download(repo_id = " lysandre/arxiv-nlp" , filename = " config.json" )
2233' /root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json'
23- ```
24-
25- <div class = " flex justify-center" >
26- <img class = " block dark:hidden" src = " https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/repo.png" />
27- <img class = " hidden dark:block" src = " https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/repo-dark.png" />
28- </div >
2934
30- Specify a particular file version by providing the file revision, which can be the
31- branch name, a tag, or a commit hash. When using the commit hash, it must be the
32- full-length hash instead of a 7-character commit hash:
33-
34- ``` python
35- >> > hf_hub_download(
36- ... repo_id = " lysandre/arxiv-nlp" ,
37- ... filename = " config.json" ,
38- ... revision = " 877b84a8f93f2d619faa2a6e514a32beef88ab0a" ,
39- ... )
40- ' /root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/877b84a8f93f2d619faa2a6e514a32beef88ab0a/config.json'
35+ # Download from a dataset
36+ >> > hf_hub_download(repo_id = " google/fleurs" , filename = " fleurs.py" , repo_type = " dataset" )
37+ ' /root/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34/fleurs.py'
4138```
4239
43- To specify a file revision with the branch name:
44-
45- ``` python
46- >> > hf_hub_download(repo_id = " lysandre/arxiv-nlp" , filename = " config.json" , revision = " main" )
47- ```
40+ ### From specific version
4841
49- To specify a file revision with a tag identifier. For example, if you want ` v1.0 ` of the
50- ` config.json ` file:
42+ By default, the latest version from the ` main ` branch is downloaded. However, in some cases you want to download a file
43+ at a particular version (e.g. from a specific branch, a PR, a tag or a commit hash).
44+ To do so, use the ` revision ` parameter:
5145
5246``` python
47+ # Download from the `v1.0` tag
5348>> > hf_hub_download(repo_id = " lysandre/arxiv-nlp" , filename = " config.json" , revision = " v1.0" )
54- ```
5549
56- To download from a ` dataset ` or a ` space ` , specify the ` repo_type ` . By default, file will
57- be considered as being part of a ` model ` repo.
50+ # Download from the `test-branch` branch
51+ >> > hf_hub_download( repo_id = " lysandre/arxiv-nlp " , filename = " config.json " , revision = " test-branch " )
5852
59- ``` python
60- >> > hf_hub_download(repo_id = " google/fleurs" , filename = " fleurs.py" , repo_type = " dataset" )
53+ # Download from Pull Request #3
54+ >> > hf_hub_download(repo_id = " lysandre/arxiv-nlp" , filename = " config.json" , revision = " refs/pr/3" )
55+
56+ # Download from a specific commit hash
57+ >> > hf_hub_download(repo_id = " lysandre/arxiv-nlp" , filename = " config.json" , revision = " 877b84a8f93f2d619faa2a6e514a32beef88ab0a" )
6158```
6259
63- ## Construct a download URL
60+ ** Note:** When using the commit hash, it must be the full-length hash instead of a 7-character commit hash.
61+
62+ ### Construct a download URL
6463
6564In case you want to construct the URL used to download a file from a repo, you can use [ ` hf_hub_url ` ] which returns a URL.
6665Note that it is used internally by [ ` hf_hub_download ` ] .
6766
6867## Download an entire repository
6968
70- [ ` snapshot_download ` ] downloads an entire repository at a given revision. Like
71- [ ` hf_hub_download ` ] , all downloaded files are cached on your local disk.
69+ [ ` snapshot_download ` ] downloads an entire repository at a given revision. It uses internally [ ` hf_hub_download ` ] which
70+ means all downloaded files are also cached on your local disk. Downloads are made concurrently to speed-up the process .
7271
73- Download a whole repository as shown in the following :
72+ To download a whole repository, just pass the ` repo_id ` and ` repo_type ` :
7473
7574``` python
7675>> > from huggingface_hub import snapshot_download
7776>> > snapshot_download(repo_id = " lysandre/arxiv-nlp" )
78- ' /home/lysandre/.cache/huggingface/hub/lysandre__arxiv-nlp.894a9adde21d9a3e3843e6d5aeaaf01875c7fade'
77+ ' /home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade'
78+
79+ # Or from a dataset
80+ >> > snapshot_download(repo_id = " google/fleurs" , repo_type = " dataset" )
81+ ' /home/lysandre/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34'
7982```
8083
81- [ ` snapshot_download ` ] downloads the latest revision by default. If you want a specific
82- repository revision, use the ` revision ` parameter:
84+ [ ` snapshot_download ` ] downloads the latest revision by default. If you want a specific repository revision, use the
85+ ` revision ` parameter:
8386
8487``` python
8588>> > from huggingface_hub import snapshot_download
86- >> > snapshot_download(repo_id = " lysandre/arxiv-nlp" , revision = " main " )
89+ >> > snapshot_download(repo_id = " lysandre/arxiv-nlp" , revision = " refs/pr/1 " )
8790```
8891
89- In general, it is usually better to download files with [ ` hf_hub_download ` ] - if you
90- already know the file names you need.
91- [ ` snapshot_download ` ] is helpful when you are unaware of which files to download.
92+ ### Filter files to download
9293
93- However, you don't always want to download the contents of an entire repository with
94- [ ` snapshot_download ` ] . Even if you don't know the file name, you can download specific
95- files if you know the file type with ` allow_patterns ` and ` ignore_patterns ` . Use the
96- ` allow_patterns ` and ` ignore_patterns ` arguments to specify which files to download. These
97- parameters accept either a single pattern or a list of patterns.
94+ [ ` snapshot_download ` ] provides an easy way to download a repository. However, you don't always want to download the
95+ entire content of a repository. For example, you might want to prevent downloading all ` .bin ` files if you know you'll
96+ only use the ` .safetensors ` weights. You can do that using ` allow_patterns ` and ` ignore_patterns ` parameters.
9897
99- Patterns are Standard Wildcards (globbing patterns) as documented
100- [ here] ( https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm ) . The pattern
101- matching is based on [ ` fnmatch ` ] ( https://docs.python.org/3/library/fnmatch.html ) .
98+ These parameters accept either a single pattern or a list of patterns. Patterns are Standard Wildcards (globbing
99+ patterns) as documented [ here] ( https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm ) . The pattern matching is
100+ based on [ ` fnmatch ` ] ( https://docs.python.org/3/library/fnmatch.html ) .
102101
103102For example, you can use ` allow_patterns ` to only download JSON configuration files:
104103
@@ -115,5 +114,47 @@ following example ignores the `.msgpack` and `.h5` file extensions:
115114>> > snapshot_download(repo_id = " lysandre/arxiv-nlp" , ignore_patterns = [" *.msgpack" , " *.h5" ])
116115```
117116
118- Passing a pattern can be especially useful when repositories contain files that are never
119- expected to be downloaded by [ ` snapshot_download ` ] .
117+ Finally, you can combine both to precisely filter your download. Here is an example to download all json and markdown
118+ files except ` vocab.json ` .
119+
120+ ``` python
121+ >> > from huggingface_hub import snapshot_download
122+ >> > snapshot_download(repo_id = " gpt2" , allow_patterns = [" *.md" , " *.json" ], ignore_patterns = " vocab.json" )
123+ ```
124+
125+ ## Download file(s) to local folder
126+
127+ The recommended (and default) way to download files from the Hub is to use the [ cache-system] ( ./manage-cache ) .
128+ You can define your cache location by setting ` cache_dir ` parameter (both in [ ` hf_hub_download ` ] and [ ` snapshot_download ` ] ).
129+
130+ However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow
131+ closer to what ` git ` commands offer. You can do that using the ` local_dir ` and ` local_dir_use_symlinks ` parameters:
132+ - ` local_dir ` must be a path to a folder on your system. The downloaded files will keep the same file structure as in the
133+ repo. For example if ` filename="data/train.csv" ` and ` local_dir="path/to/folder" ` , then the returned filepath will be
134+ ` "path/to/folder/data/train.csv" ` .
135+ - ` local_dir_use_symlinks ` defines how the file must be saved in your local folder.
136+ - The default behavior (` "auto" ` ) is to duplicate small files (<5MB) and use symlinks for bigger files. Symlinks allow
137+ to optimize both bandwidth and disk usage. However manually editing a symlinked file might corrupt the cache, hence
138+ the duplication for small files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
139+ environment variable.
140+ - If `local_dir_use_symlinks=True` is set, all files are symlinked for an optimal disk space optimization. This is
141+ for example useful when downloading a huge dataset with thousands of small files.
142+ - Finally, if you don't want symlinks at all you can disable them (`local_dir_use_symlinks=False`). The cache directory
143+ will still be used to check wether the file is already cached or not. If already cached, the file is **duplicated**
144+ from the cache (i.e. saves bandwidth but increases disk usage). If the file is not already cached, it will be
145+ downloaded and moved directly to the local dir. This means that if you need to reuse it somewhere else later, it
146+ will be **re-downloaded**.
147+
148+ Here is a table that summarizes the different options to help you choose the parameters that best suit your use case.
149+
150+ <!-- Generated with https://www.tablesgenerator.com/markdown_tables -->
151+ | Parameters | File already cached | Returned path | Can read path? | Can save to path? | Optimized bandwidth | Optimized disk usage |
152+ |---|:---:|:---:|:---:|:---:|:---:|:---:|
153+ | `local_dir=None` | | symlink in cache | ✅ | ❌<br>_(save would corrupt the cache)_ | ✅ | ✅ |
154+ | `local_dir="path/to/folder"`<br>`local_dir_use_symlinks="auto"` | | file or symlink in folder | ✅ | ✅ _(for small files)_ <br> ⚠️ _(for big files do not resolve path before saving)_ | ✅ | ✅ |
155+ | `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=True` | | symlink in folder | ✅ | ⚠️<br>_(do not resolve path before saving)_ | ✅ | ✅ |
156+ | `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | No | file in folder | ✅ | ✅ | ❌<br>_(if re-run, file is re-downloaded)_ | ⚠️<br>(multiple copies if ran in multiple folders) |
157+ | `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | Yes | file in folder | ✅ | ✅ | ⚠️<br>_(file has to be cached first)_ | ❌<br>_(file is duplicated)_ |
158+
159+ **Note:** if you are on a Windows machine, you need to enable developer mode or run `huggingface_hub` as admin to enable
160+ symlinks. Check out the [cache limitations](../guides/manage-cache#limitations) section for more details.
0 commit comments