Skip to content

Commit 7d7cce8

Browse files
authored
Allowlist and denylist when uploading a folder (#994)
* Allowlist and denylist when uploading a folder * renaming in documentation * wrong arg name * add test case * documentation * do not re-run tests when updating doc * doc * Deprecate allow_regex and ignore_regex usage * FIX merge problem * move paths utils to private module + indentation in doc
1 parent 17cf79a commit 7d7cce8

File tree

13 files changed

+414
-71
lines changed

13 files changed

+414
-71
lines changed

.github/workflows/python-tests.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,12 @@ on:
55
branches:
66
- main
77
- ci_*
8+
paths-ignore:
9+
- "docs/**"
810
pull_request:
911
types: [assigned, opened, synchronize, reopened]
12+
paths-ignore:
13+
- "docs/**"
1014

1115
env:
1216
HUGGINGFACE_CO_STAGING: yes

docs/source/how-to-downstream.mdx

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -85,28 +85,28 @@ already know the file names you need.
8585

8686
However, you don't always want to download the contents of an entire repository with
8787
[`snapshot_download`]. Even if you don't know the file name, you can download specific
88-
files if you know the file type with `allow_regex` and `ignore_regex`. Use the
89-
`allow_regex` and `ignore_regex` arguments to specify which files to download. These
90-
parameters accept either a single regex or a list of regexes.
88+
files if you know the file type with `allow_patterns` and `ignore_patterns`. Use the
89+
`allow_patterns` and `ignore_patterns` arguments to specify which files to download. These
90+
parameters accept either a single pattern or a list of patterns.
9191

92-
The regex matching is based on
93-
[`fnmatch`](https://docs.python.org/3/library/fnmatch.html), which provides support for
94-
Unix shell-style wildcards.
92+
Patterns are Standard Wildcards (globbing patterns) as documented
93+
[here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm). The pattern
94+
matching is based on [`fnmatch`](https://docs.python.org/3/library/fnmatch.html).
9595

96-
For example, you can use `allow_regex` to only download JSON configuration files:
96+
For example, you can use `allow_patterns` to only download JSON configuration files:
9797

9898
```python
9999
>>> from huggingface_hub import snapshot_download
100-
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", allow_regex="*.json")
100+
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", allow_patterns="*.json")
101101
```
102102

103-
On the other hand, `ignore_regex` can exclude certain files from being downloaded. The
103+
On the other hand, `ignore_patterns` can exclude certain files from being downloaded. The
104104
following example ignores the `.msgpack` and `.h5` file extensions:
105105

106106
```python
107107
>>> from huggingface_hub import snapshot_download
108-
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", ignore_regex=["*.msgpack", "*.h5"])
108+
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", ignore_patterns=["*.msgpack", "*.h5"])
109109
```
110110

111-
Passing a regex can be especially useful when repositories contain files that are never
111+
Passing a pattern can be especially useful when repositories contain files that are never
112112
expected to be downloaded by [`snapshot_download`].

docs/source/how-to-upstream.mdx

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,24 +39,31 @@ Specify the path of the file to upload, where you want to upload the file to in
3939
```py
4040
>>> from huggingface_hub import HfApi
4141
>>> api = HfApi()
42-
>>> api.upload_file(path_or_fileobj="/path/to/local/folder/README.md",
43-
... path_in_repo="README.md",
44-
... repo_id="username/test-dataset",
45-
... repo_type="dataset",
42+
>>> api.upload_file(
43+
... path_or_fileobj="/path/to/local/folder/README.md",
44+
... path_in_repo="README.md",
45+
... repo_id="username/test-dataset",
46+
... repo_type="dataset",
4647
... )
4748
```
4849

4950
### Upload a folder
5051

5152
Use the [`upload_folder`] function to upload a local folder to an existing repository. Specify the path of the local folder to upload, where you want to upload the folder to in the repository, and the name of the repository you want to add the folder to. Depending on your repository type, you can optionally set the repository type as a `dataset`, `model`, or `space`.
5253

54+
Use the `allow_patterns` and `ignore_patterns` arguments to specify which files to upload. These parameters accept either a single pattern or a list of patterns.
55+
Patterns are Standard Wildcards (globbing patterns) as documented [here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm).
56+
If both `allow_patterns` and `ignore_patterns` are provided, both constraints apply. By default, all files from the folder are uploaded.
57+
5358
```py
5459
>>> from huggingface_hub import HfApi
5560
>>> api = HfApi()
56-
>>> api.upload_folder(folder_path="/path/to/local/folder",
57-
... path_in_repo="my-dataset/train",
58-
... repo_id="username/test-dataset",
59-
... repo_type="dataset",
61+
>>> api.upload_folder(
62+
... folder_path="/path/to/local/folder",
63+
... path_in_repo="my-dataset/train",
64+
... repo_id="username/test-dataset",
65+
... repo_type="dataset",
66+
... ignore_patterns="**/logs/*.txt",
6067
... )
6168
```
6269

src/huggingface_hub/_snapshot_download.py

Lines changed: 24 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,22 @@
11
import os
2-
from fnmatch import fnmatch
32
from pathlib import Path
43
from typing import Dict, List, Optional, Union
54

65
from .constants import DEFAULT_REVISION, HUGGINGFACE_HUB_CACHE, REPO_TYPES
76
from .file_download import REGEX_COMMIT_HASH, hf_hub_download, repo_folder_name
87
from .hf_api import HfApi, HfFolder
9-
from .utils import logging
8+
from .utils import filter_repo_objects, logging
9+
from .utils._deprecation import _deprecate_arguments
1010

1111

1212
logger = logging.get_logger(__name__)
1313

1414

15-
def _filter_repo_files(
16-
*,
17-
repo_files: List[str],
18-
allow_regex: Optional[Union[List[str], str]] = None,
19-
ignore_regex: Optional[Union[List[str], str]] = None,
20-
) -> List[str]:
21-
allow_regex = [allow_regex] if isinstance(allow_regex, str) else allow_regex
22-
ignore_regex = [ignore_regex] if isinstance(ignore_regex, str) else ignore_regex
23-
filtered_files = []
24-
for repo_file in repo_files:
25-
# if there's an allowlist, skip download if file does not match any regex
26-
if allow_regex is not None and not any(
27-
fnmatch(repo_file, r) for r in allow_regex
28-
):
29-
continue
30-
31-
# if there's a denylist, skip download if file does matches any regex
32-
if ignore_regex is not None and any(
33-
fnmatch(repo_file, r) for r in ignore_regex
34-
):
35-
continue
36-
37-
filtered_files.append(repo_file)
38-
return filtered_files
39-
40-
15+
@_deprecate_arguments(
16+
version="0.12",
17+
deprecated_args={"allow_regex", "ignore_regex"},
18+
custom_message="Please use `allow_patterns` and `ignore_patterns` instead.",
19+
)
4120
def snapshot_download(
4221
repo_id: str,
4322
*,
@@ -54,6 +33,8 @@ def snapshot_download(
5433
local_files_only: Optional[bool] = False,
5534
allow_regex: Optional[Union[List[str], str]] = None,
5635
ignore_regex: Optional[Union[List[str], str]] = None,
36+
allow_patterns: Optional[Union[List[str], str]] = None,
37+
ignore_patterns: Optional[Union[List[str], str]] = None,
5738
) -> str:
5839
"""Download all files of a repo.
5940
@@ -98,10 +79,10 @@ def snapshot_download(
9879
local_files_only (`bool`, *optional*, defaults to `False`):
9980
If `True`, avoid downloading the file and return the path to the
10081
local cached file if it exists.
101-
allow_regex (`list of str`, `str`, *optional*):
102-
If provided, only files matching this regex are downloaded.
103-
ignore_regex (`list of str`, `str`, *optional*):
104-
If provided, files matching this regex are not downloaded.
82+
allow_patterns (`List[str]` or `str`, *optional*):
83+
If provided, only files matching at least one pattern are downloaded.
84+
ignore_patterns (`List[str]` or `str`, *optional*):
85+
If provided, files matching any of the patterns are not downloaded.
10586
10687
Returns:
10788
Local folder path (string) of repo snapshot
@@ -119,7 +100,6 @@ def snapshot_download(
119100
120101
</Tip>
121102
"""
122-
123103
if cache_dir is None:
124104
cache_dir = HUGGINGFACE_HUB_CACHE
125105
if revision is None:
@@ -151,6 +131,13 @@ def snapshot_download(
151131
cache_dir, repo_folder_name(repo_id=repo_id, repo_type=repo_type)
152132
)
153133

134+
# TODO: remove these 4 lines in version 0.12
135+
# Deprecated code to ensure backward compatibility.
136+
if allow_regex is not None:
137+
allow_patterns = allow_regex
138+
if ignore_regex is not None:
139+
ignore_patterns = ignore_regex
140+
154141
# if we have no internet connection we will look for an
155142
# appropriate folder in the cache
156143
# If the specified revision is a commit hash, look inside "snapshots".
@@ -181,10 +168,10 @@ def snapshot_download(
181168
repo_info = _api.repo_info(
182169
repo_id=repo_id, repo_type=repo_type, revision=revision, token=token
183170
)
184-
filtered_repo_files = _filter_repo_files(
185-
repo_files=[f.rfilename for f in repo_info.siblings],
186-
allow_regex=allow_regex,
187-
ignore_regex=ignore_regex,
171+
filtered_repo_files = filter_repo_objects(
172+
items=[f.rfilename for f in repo_info.siblings],
173+
allow_patterns=allow_patterns,
174+
ignore_patterns=ignore_patterns,
188175
)
189176
commit_hash = repo_info.sha
190177
snapshot_folder = os.path.join(storage_folder, "snapshots", commit_hash)

src/huggingface_hub/hf_api.py

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@
5151
REPO_TYPES_URL_PREFIXES,
5252
SPACES_SDK_TYPES,
5353
)
54-
from .utils import logging
54+
from .utils import filter_repo_objects, logging
5555
from .utils._deprecation import _deprecate_positional_args
5656
from .utils._errors import (
5757
_raise_convert_bad_request,
@@ -2215,6 +2215,8 @@ def upload_folder(
22152215
revision: Optional[str] = None,
22162216
create_pr: Optional[bool] = None,
22172217
parent_commit: Optional[str] = None,
2218+
allow_patterns: Optional[Union[List[str], str]] = None,
2219+
ignore_patterns: Optional[Union[List[str], str]] = None,
22182220
):
22192221
"""
22202222
Upload a local folder to the given repo. The upload is done
@@ -2224,6 +2226,13 @@ def upload_folder(
22242226
The structure of the folder will be preserved. Files with the same name
22252227
already present in the repository will be overwritten, others will be left untouched.
22262228
2229+
Use the `allow_patterns` and `ignore_patterns` arguments to specify which files
2230+
to upload. These parameters accept either a single pattern or a list of
2231+
patterns. Patterns are Standard Wildcards (globbing patterns) as documented
2232+
[here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm). If both
2233+
`allow_patterns` and `ignore_patterns` are provided, both constraints apply. By
2234+
default, all files from the folder are uploaded.
2235+
22272236
Uses `HfApi.create_commit` under the hood.
22282237
22292238
Args:
@@ -2259,7 +2268,10 @@ def upload_folder(
22592268
If specified and `create_pr` is `True`, the pull request will be created from `parent_commit`.
22602269
Specifying `parent_commit` ensures the repo has not changed before committing the changes, and can be
22612270
especially useful if the repo is updated / committed to concurrently.
2262-
2271+
allow_patterns (`List[str]` or `str`, *optional*):
2272+
If provided, only files matching at least one pattern are uploaded.
2273+
ignore_patterns (`List[str]` or `str`, *optional*):
2274+
If provided, files matching any of the patterns are not uploaded.
22632275
22642276
Returns:
22652277
`str`: A URL to visualize the uploaded folder on the hub
@@ -2284,6 +2296,7 @@ def upload_folder(
22842296
... repo_id="username/my-dataset",
22852297
... repo_type="datasets",
22862298
... token="my_token",
2299+
... ignore_patterns="**/logs/*.txt",
22872300
... )
22882301
# "https://huggingface.co/datasets/username/my-dataset/tree/main/remote/experiment/checkpoints"
22892302
@@ -2312,7 +2325,12 @@ def upload_folder(
23122325
else f"Upload {path_in_repo} with huggingface_hub"
23132326
)
23142327

2315-
files_to_add = _prepare_upload_folder_commit(folder_path, path_in_repo)
2328+
files_to_add = _prepare_upload_folder_commit(
2329+
folder_path,
2330+
path_in_repo,
2331+
allow_patterns=allow_patterns,
2332+
ignore_patterns=ignore_patterns,
2333+
)
23162334

23172335
pr_url = self.create_commit(
23182336
repo_type=repo_type,
@@ -3232,9 +3250,16 @@ def delete_token(cls):
32323250

32333251

32343252
def _prepare_upload_folder_commit(
3235-
folder_path: str, path_in_repo: str
3253+
folder_path: str,
3254+
path_in_repo: str,
3255+
allow_patterns: Optional[Union[List[str], str]] = None,
3256+
ignore_patterns: Optional[Union[List[str], str]] = None,
32363257
) -> List[CommitOperationAdd]:
3237-
"""Generate the list of Add operations for a commit to upload a folder."""
3258+
"""Generate the list of Add operations for a commit to upload a folder.
3259+
3260+
Files not matching the `allow_patterns` (allowlist) and `ignore_patterns` (denylist)
3261+
constraints are discarded.
3262+
"""
32383263
folder_path = os.path.normpath(os.path.expanduser(folder_path))
32393264
if not os.path.isdir(folder_path):
32403265
raise ValueError(f"Provided path: '{folder_path}' is not a directory")
@@ -3252,7 +3277,15 @@ def _prepare_upload_folder_commit(
32523277
).replace(os.sep, "/"),
32533278
)
32543279
)
3255-
return files_to_add
3280+
3281+
return list(
3282+
filter_repo_objects(
3283+
files_to_add,
3284+
allow_patterns=allow_patterns,
3285+
ignore_patterns=ignore_patterns,
3286+
key=lambda x: x.path_in_repo,
3287+
)
3288+
)
32563289

32573290

32583291
def _parse_revision_from_pr_url(pr_url: str) -> str:

src/huggingface_hub/hub_mixin.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
import os
33
import tempfile
44
from pathlib import Path
5-
from typing import Dict, Optional, Union
5+
from typing import Dict, List, Optional, Union
66

77
import requests
88
from huggingface_hub import hf_api
@@ -268,6 +268,8 @@ def push_to_hub(
268268
token: Optional[str] = None,
269269
branch: Optional[str] = None,
270270
create_pr: Optional[bool] = None,
271+
allow_patterns: Optional[Union[List[str], str]] = None,
272+
ignore_patterns: Optional[Union[List[str], str]] = None,
271273
# TODO (release 0.12): signature must be the following
272274
# repo_id: str,
273275
# *,
@@ -278,10 +280,15 @@ def push_to_hub(
278280
# branch: Optional[str] = None,
279281
# create_pr: Optional[bool] = None,
280282
# config: Optional[dict] = None,
283+
# allow_patterns: Optional[Union[List[str], str]] = None,
284+
# ignore_patterns: Optional[Union[List[str], str]] = None,
281285
) -> str:
282286
"""
283287
Upload model checkpoint to the Hub.
284288
289+
Use `allow_patterns` and `ignore_patterns` to precisely filter which files
290+
should be pushed to the hub. See [`upload_folder`] reference for more details.
291+
285292
Parameters:
286293
repo_id (`str`, *optional*):
287294
Repository name to which push.
@@ -304,6 +311,10 @@ def push_to_hub(
304311
Defaults to `False`.
305312
config (`dict`, *optional*):
306313
Configuration object to be saved alongside the model weights.
314+
allow_patterns (`List[str]` or `str`, *optional*):
315+
If provided, only files matching at least one pattern are pushed.
316+
ignore_patterns (`List[str]` or `str`, *optional*):
317+
If provided, files matching any of the patterns are not pushed.
307318
308319
Returns:
309320
The url of the commit of your model in the given repository.
@@ -334,6 +345,8 @@ def push_to_hub(
334345
commit_message=commit_message,
335346
revision=branch,
336347
create_pr=create_pr,
348+
allow_patterns=allow_patterns,
349+
ignore_patterns=ignore_patterns,
337350
)
338351

339352
# If the repo id is None, it means we use the deprecated version using Git

0 commit comments

Comments
 (0)