huggingface
diff --git a/‎docs/source/guides/download.mdx‎
Lines changed: 96 additions & 55 deletions b/‎docs/source/guides/download.mdx‎
Lines changed: 96 additions & 55 deletions
diff --git a/‎docs/source/guides/overview.mdx‎
Lines changed: 10 additions & 1 deletion b/‎docs/source/guides/overview.mdx‎
Lines changed: 10 additions & 1 deletion
diff --git a/‎docs/source/package_reference/environment_variables.mdx‎
Lines changed: 7 additions & 0 deletions b/‎docs/source/package_reference/environment_variables.mdx‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎src/huggingface_hub/_snapshot_download.py‎
Lines changed: 39 additions & 9 deletions b/‎src/huggingface_hub/_snapshot_download.py‎
Lines changed: 39 additions & 9 deletions
diff --git a/‎src/huggingface_hub/constants.py‎
Lines changed: 11 additions & 3 deletions b/‎src/huggingface_hub/constants.py‎
Lines changed: 11 additions & 3 deletions
@@ -5,100 +5,99 @@ stored on the Hub. You can use these functions independently or integrate them i
 own library, making it more convenient for your users to interact with the Hub. This
 guide will show you how to:
 
-* Download and store a file from the Hub.
-* Download all the files in a repository.
+* Download and cache a single file.
+* Download and cache an entire repository.
+* Download files to a local folder. 
 
-## Download and store a file from the Hub
+## Download a single file
 
 The [`hf_hub_download`] function is the main function for downloading files from the Hub.
+It downloads the remote file, caches it on disk (in a version-aware way), and returns its local file path.
 
-It downloads the remote file, stores it on disk (in a version-aware way), and returns its local file path.
+<Tip>
 
-Use the `repo_id` and `filename` parameters to specify which file to download:
+The returned filepath is a pointer to the HF local cache. Therefore, it is important to not modify the file to avoid
+having a corrupted cache. If you are interested in getting to know more about how files are cached, please refer to our
+[caching guide](./manage-cache).
+
+</Tip>
+
+### From latest version
+
+Select the file to download using the `repo_id`, `repo_type` and `filename` parameters. By default, the file will
+be considered as being part of a `model` repo.
 
 ```python
 >>> from huggingface_hub import hf_hub_download
 >>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json")
 '/root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json'
-```
-
-<div class="flex justify-center">
-<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/repo.png"/>
-<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/repo-dark.png"/>
-</div>
 
-Specify a particular file version by providing the file revision, which can be the
-branch name, a tag, or a commit hash. When using the commit hash, it must be the
-full-length hash instead of a 7-character commit hash:
-
-```python
->>> hf_hub_download(
-...    repo_id="lysandre/arxiv-nlp",
-...    filename="config.json",
-...    revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a",
-... )
-'/root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/877b84a8f93f2d619faa2a6e514a32beef88ab0a/config.json'
+# Download from a dataset
+>>> hf_hub_download(repo_id="google/fleurs", filename="fleurs.py", repo_type="dataset")
+'/root/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34/fleurs.py'
 ```
 
-To specify a file revision with the branch name:
-
-```python
->>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="main")
-```
+### From specific version
 
-To specify a file revision with a tag identifier. For example, if you want `v1.0` of the
-`config.json` file:
+By default, the latest version from the `main` branch is downloaded. However, in some cases you want to download a file
+at a particular version (e.g. from a specific branch, a PR, a tag or a commit hash).
+To do so, use the `revision` parameter:
 
 ```python
+# Download from the `v1.0` tag
 >>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="v1.0")
-```
 
-To download from a `dataset` or a `space`, specify the `repo_type`. By default, file will
-be considered as being part of a `model` repo.
+# Download from the `test-branch` branch
+>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="test-branch")
 
-```python
->>> hf_hub_download(repo_id="google/fleurs", filename="fleurs.py", repo_type="dataset")
+# Download from Pull Request #3
+>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="refs/pr/3")
+
+# Download from a specific commit hash
+>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a")
 ```
 
-## Construct a download URL
+**Note:** When using the commit hash, it must be the full-length hash instead of a 7-character commit hash.
+
+### Construct a download URL
 
 In case you want to construct the URL used to download a file from a repo, you can use [`hf_hub_url`] which returns a URL.
 Note that it is used internally by [`hf_hub_download`].
 
 ## Download an entire repository
 
-[`snapshot_download`] downloads an entire repository at a given revision. Like
-[`hf_hub_download`], all downloaded files are cached on your local disk.
+[`snapshot_download`] downloads an entire repository at a given revision. It uses internally [`hf_hub_download`] which
+means all downloaded files are also cached on your local disk. Downloads are made concurrently to speed-up the process.
 
-Download a whole repository as shown in the following:
+To download a whole repository, just pass the `repo_id` and `repo_type`:
 
 ```python
 >>> from huggingface_hub import snapshot_download
 >>> snapshot_download(repo_id="lysandre/arxiv-nlp")
-'/home/lysandre/.cache/huggingface/hub/lysandre__arxiv-nlp.894a9adde21d9a3e3843e6d5aeaaf01875c7fade'
+'/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade'
+
+# Or from a dataset
+>>> snapshot_download(repo_id="google/fleurs", repo_type="dataset")
+'/home/lysandre/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34'
 ```
 
-[`snapshot_download`] downloads the latest revision by default. If you want a specific
-repository revision, use the `revision` parameter:
+[`snapshot_download`] downloads the latest revision by default. If you want a specific repository revision, use the
+`revision` parameter:
 
 ```python
 >>> from huggingface_hub import snapshot_download
->>> snapshot_download(repo_id="lysandre/arxiv-nlp", revision="main")
+>>> snapshot_download(repo_id="lysandre/arxiv-nlp", revision="refs/pr/1")
 ```
 
-In general, it is usually better to download files with [`hf_hub_download`] - if you
-already know the file names you need.
-[`snapshot_download`] is helpful when you are unaware of which files to download.
+### Filter files to download
 
-However, you don't always want to download the contents of an entire repository with
-[`snapshot_download`]. Even if you don't know the file name, you can download specific
-files if you know the file type with `allow_patterns` and `ignore_patterns`. Use the
-`allow_patterns` and `ignore_patterns` arguments to specify which files to download. These
-parameters accept either a single pattern or a list of patterns.
+[`snapshot_download`] provides an easy way to download a repository. However, you don't always want to download the
+entire content of a repository. For example, you might want to prevent downloading all `.bin` files if you know you'll
+only use the `.safetensors` weights. You can do that using `allow_patterns` and `ignore_patterns` parameters.
 
-Patterns are Standard Wildcards (globbing patterns) as documented
-[here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm). The pattern
-matching is based on [`fnmatch`](https://docs.python.org/3/library/fnmatch.html).
+These parameters accept either a single pattern or a list of patterns. Patterns are Standard Wildcards (globbing
+patterns) as documented [here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm). The pattern matching is
+based on [`fnmatch`](https://docs.python.org/3/library/fnmatch.html).
 
 For example, you can use `allow_patterns` to only download JSON configuration files:
 
@@ -115,5 +114,47 @@ following example ignores the `.msgpack` and `.h5` file extensions:
 >>> snapshot_download(repo_id="lysandre/arxiv-nlp", ignore_patterns=["*.msgpack", "*.h5"])
 ```
 
-Passing a pattern can be especially useful when repositories contain files that are never
-expected to be downloaded by [`snapshot_download`].
+Finally, you can combine both to precisely filter your download. Here is an example to download all json and markdown
+files except `vocab.json`.
+
+```python
+>>> from huggingface_hub import snapshot_download
+>>> snapshot_download(repo_id="gpt2", allow_patterns=["*.md", "*.json"], ignore_patterns="vocab.json")
+```
+
+## Download file(s) to local folder
+
+The recommended (and default) way to download files from the Hub is to use the [cache-system](./manage-cache).
+You can define your cache location by setting `cache_dir` parameter (both in [`hf_hub_download`] and [`snapshot_download`]).
+
+However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow
+closer to what `git` commands offer. You can do that using the `local_dir` and `local_dir_use_symlinks` parameters:
+- `local_dir` must be a path to a folder on your system. The downloaded files will keep the same file structure as in the
+repo. For example if `filename="data/train.csv"` and `local_dir="path/to/folder"`, then the returned filepath will be
+`"path/to/folder/data/train.csv"`.
+- `local_dir_use_symlinks` defines how the file must be saved in your local folder.
+  - The default behavior (`"auto"`) is to duplicate small files (<5MB) and use symlinks for bigger files. Symlinks allow
+    to optimize both bandwidth and disk usage. However manually editing a symlinked file might corrupt the cache, hence
+    the duplication for small files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
+    environment variable.
+  - If `local_dir_use_symlinks=True` is set, all files are symlinked for an optimal disk space optimization. This is
+    for example useful when downloading a huge dataset with thousands of small files.
+  - Finally, if you don't want symlinks at all you can disable them (`local_dir_use_symlinks=False`). The cache directory
+    will still be used to check wether the file is already cached or not. If already cached, the file is **duplicated**
+    from the cache (i.e. saves bandwidth but increases disk usage). If the file is not already cached, it will be
+    downloaded and moved directly to the local dir. This means that if you need to reuse it somewhere else later, it
+    will be **re-downloaded**.
+
+Here is a table that summarizes the different options to help you choose the parameters that best suit your use case.
+
+<!-- Generated with https://www.tablesgenerator.com/markdown_tables -->
+| Parameters | File already cached | Returned path | Can read path? | Can save to path? | Optimized bandwidth | Optimized disk usage |
+|---|:---:|:---:|:---:|:---:|:---:|:---:|
+| `local_dir=None` |  | symlink in cache | ✅ | ❌<br>_(save would corrupt the cache)_ | ✅ | ✅ |
+| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks="auto"` |  | file or symlink in folder | ✅ | ✅ _(for small files)_ <br> ⚠️ _(for big files do not resolve path before saving)_ | ✅ | ✅ |
+| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=True` |  | symlink in folder | ✅ | ⚠️<br>_(do not resolve path before saving)_ | ✅ | ✅ |
+| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | No | file in folder | ✅ | ✅ | ❌<br>_(if re-run, file is re-downloaded)_ | ⚠️<br>(multiple copies if ran in multiple folders) |
+| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | Yes | file in folder | ✅ | ✅ | ⚠️<br>_(file has to be cached first)_ | ❌<br>_(file is duplicated)_ |
+
+**Note:** if you are on a Windows machine, you need to enable developer mode or run `huggingface_hub` as admin to enable
+symlinks. Check out the [cache limitations](../guides/manage-cache#limitations) section for more details.
@@ -54,7 +54,7 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
     <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
        href="./community">
       <div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
-        Community
+        Community Tab
       </div><p class="text-gray-700">
         How to interact with the Community tab (Discussions and Pull Requests)?
       </p>
@@ -87,5 +87,14 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
       </p>
     </a>
 
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
+       href="./integrations">
+      <div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
+        Integrate a library
+      </div><p class="text-gray-700">
+        What does it mean to integrate a library with the Hub? And how to do it?
+      </p>
+    </a>
+
   </div>
 </div>
@@ -63,6 +63,13 @@ Defaults to `"warning"`.
 
 For more details, see [logging reference](../package_reference/utilities#huggingface_hub.utils.logging.get_verbosity).
 
+### HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD
+
+Integer value to define under which size a file is considered as "small". When downloading files to a local directory,
+small files will be duplicated to ease user experience while bigger files are symlinked to save disk usage.
+
+For more details, see the [download guide](../guides/download#download-files-to-local-folder).
+
 ## Boolean values
 
 The following environment variables expect a boolean value. The variable will be considered
 
@@ -15,6 +15,7 @@
 from .hf_api import HfApi
 from .utils import filter_repo_objects, logging, validate_hf_hub_args
 from .utils import tqdm as hf_tqdm
+from .utils._typing import Literal
 
 
 logger = logging.get_logger(__name__)
@@ -27,6 +28,8 @@ def snapshot_download(
     revision: Optional[str] = None,
     repo_type: Optional[str] = None,
     cache_dir: Union[str, Path, None] = None,
+    local_dir: Union[str, Path, None] = None,
+    local_dir_use_symlinks: Union[bool, Literal["auto"]] = "auto",
     library_name: Optional[str] = None,
     library_version: Optional[str] = None,
     user_agent: Optional[Union[Dict, str]] = None,
@@ -40,15 +43,30 @@ def snapshot_download(
     max_workers: int = 8,
     tqdm_class: Optional[base_tqdm] = None,
 ) -> str:
-    """Download all files of a repo.
-
-    Downloads a whole snapshot of a repo's files at the specified revision. This
-    is useful when you want all files from a repo, because you don't know which
-    ones you will need a priori. All files are nested inside a folder in order
-    to keep their actual filename relative to that folder.
-
-    An alternative would be to just clone a repo but this would require that the
-    user always has git and git-lfs installed, and properly configured.
+    """Download repo files.
+
+    Download a whole snapshot of a repo's files at the specified revision. This is useful when you want all files from
+    a repo, because you don't know which ones you will need a priori. All files are nested inside a folder in order
+    to keep their actual filename relative to that folder. You can also filter which files to download using
+    `allow_patterns` and `ignore_patterns`.
+
+    If `local_dir` is provided, the file structure from the repo will be replicated in this location. You can configure
+    how you want to move those files:
+      - If `local_dir_use_symlinks="auto"` (default), files are downloaded and stored in the cache directory as blob
+        files. Small files (<5MB) are duplicated in `local_dir` while a symlink is created for bigger files. The goal
+        is to be able to manually edit and save small files without corrupting the cache while saving disk space for
+        binary files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
+        environment variable.
+      - If `local_dir_use_symlinks=True`, files are downloaded, stored in the cache directory and symlinked in `local_dir`.
+        This is optimal in term of disk usage but files must not be manually edited.
+      - If `local_dir_use_symlinks=False` and the blob files exist in the cache directory, they are duplicated in the
+        local dir. This means disk usage is not optimized.
+      - Finally, if `local_dir_use_symlinks=False` and the blob files do not exist in the cache directory, then the
+        files are downloaded and directly placed under `local_dir`. This means if you need to download them again later,
+        they will be re-downloaded entirely.
+
+    An alternative would be to clone the repo but this requires git and git-lfs to be installed and properly
+    configured. It is also not possible to filter which files to download when cloning a repository using git.
 
     Args:
         repo_id (`str`):
@@ -61,6 +79,14 @@ def snapshot_download(
             `None` or `"model"` if downloading from a model. Default is `None`.
         cache_dir (`str`, `Path`, *optional*):
             Path to the folder where cached files are stored.
+        local_dir (`str` or `Path`, *optional*:
+            If provided, the downloaded files will be placed under this directory, either as symlinks (default) or
+            regular files (see description for more details).
+        local_dir_use_symlinks (`"auto"` or `bool`, defaults to `"auto"`):
+            To be used with `local_dir`. If set to "auto", the cache directory will be used and the file will be either
+            duplicated or symlinked to the local directory depending on its size. It set to `True`, a symlink will be
+            created, no matter the file size. If set to `False`, the file will either be duplicated from cache (if
+            already exists) or downloaded from the Hub and not cached. See description for more details.
         library_name (`str`, *optional*):
             The name of the library to which the object corresponds.
         library_version (`str`, *optional*):
@@ -189,6 +215,8 @@ def _inner_hf_hub_download(repo_file: str):
             repo_type=repo_type,
             revision=commit_hash,
             cache_dir=cache_dir,
+            local_dir=local_dir,
+            local_dir_use_symlinks=local_dir_use_symlinks,
             library_name=library_name,
             library_version=library_version,
             user_agent=user_agent,
@@ -213,4 +241,6 @@ def _inner_hf_hub_download(repo_file: str):
             tqdm_class=tqdm_class or hf_tqdm,
         )
 
+    if local_dir is not None:
+        return str(os.path.realpath(local_dir))
     return snapshot_folder
@@ -15,10 +15,10 @@ def _is_true(value: Optional[str]) -> bool:
     return value.upper() in ENV_VARS_TRUE_VALUES
 
 
-def _is_true_or_auto(value: Optional[str]) -> bool:
+def _as_int(value: Optional[str]) -> Optional[int]:
     if value is None:
-        return False
-    return value.upper() in ENV_VARS_TRUE_AND_AUTO_VALUES
+        return None
+    return int(value)
 
 
 # Constants for file downloads
@@ -118,3 +118,11 @@ def _is_true_or_auto(value: Optional[str]) -> bool:
 # - https://pypi.org/project/hf-transfer/
 # - https://github.com/huggingface/hf_transfer (private)
 HF_HUB_ENABLE_HF_TRANSFER: bool = _is_true(os.environ.get("HF_HUB_ENABLE_HF_TRANSFER"))
+
+
+# Used if download to `local_dir` and `local_dir_use_symlinks="auto"`
+# Files smaller than 5MB are copy-pasted while bigger files are symlinked. The idea is to save disk-usage by symlinking
+# huge files (i.e. LFS files most of the time) while allowing small files to be manually edited in local folder.
+HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD: int = (
+    _as_int(os.environ.get("HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD")) or 5 * 1024 * 1024
+)