-
Notifications
You must be signed in to change notification settings - Fork 821
Description
Current hf cache scan
and hf cache delete
are not very satisfying. The scan output is usually unreadable "as-is" due to long lines that break in the terminal (mostly due to long local paths being printed, which add little value). The delete UX is kinda annoying as well with manual steps to do in order to select which revisions/repos to delete. These problems have been highlighted multiple times in the past:
- Automatic pruning of previously downloaded files superceded by updated files #1013
- 🚩 Scan-cache tool: select columns, sort by size, csv-style output #1024
- Add options to the "delete-cache" command #1065
huggingface-cli delete-cache --disable-tui
improvements #1997- Easier Way use delete-cache --disable-tui? #2935
The main proposition of this RFC is to articulate the CLI around 3 commands, mostly inspired by the well established docker
CLI:
hf cache ls
(inspired bydocker image ls
anddocker container ls
)hf cache rm
(inspired bydocker image rm
anddocker container rm
)hf cache prune
(inspired bydocker image prune
anddocker container prune
)
With the current hf scan cache
and hf scan delete
being deleted for the v1 release. We would also completely get rid of the optional dependency InquirerPy
(and therefore huggingface_hub[cli]
).
All the logic to achieve the features below already exists in huggingface_hub
. It's just a matter of exposing it nicely in the UI.
hf cache ls
List all cached repos
Kinda equivalent of current hf cache scan
hf cache ls
The output would be less verbose:
- ID combines repo type repo id (useful for later reuse)
- nb files is removed
- path is removed
ID SIZE LAST_ACCESSED LAST_MODIFIED REFS
dataset/IPEC-COMMUNITY/fractal20220817_data_lerobot 87.4M 5 months ago 5 months ago main
dataset/Narsil/image_dummy 292.4K 4 weeks ago 9 months ago main
dataset/Wauplin/gpt-oss-rag-dev 84.0M 2 months ago 2 months ago main
dataset/Wauplin/test-template 3.4K 5 weeks ago 2 months ago main
dataset/__DUMMY_TRANSFORMERS_USER__/test-dataset-4ce69c-17585454341099 165.0 3 weeks ago 3 weeks ago main
dataset/facebook/wiki_dpr 8.6K 4 months ago 4 months ago main
dataset/hf-internal-testing/_dummy 0.0 1 week ago 1 week ago
model/xai-org/grok-1 18.9M 2 years ago 2 years ago refs/pr/25, refs/pr/28
List all revisions
Kinda equivalent of current hf cache scan --verbose
hf cache ls --revisions
Filter by total size
hf cache ls --filter "size>1000000"
hf cache ls --filter "size<1MB"
Let's say that filters are case insensitive.
Filter by last modified or accessed
Docker is able to handle many time representation ("7d", "2024-05-01", a timestamp, iso date format, timezone, etc.). In practice it's already good if we can handle semantic terms like 10s
, m
, h
, d
, w
, mo
and y
+ timestamps.
hf cache ls --filter "modified>7d"
hf cache ls --filter "accessed>1y"
Filter by repo type
hf cache ls --filter "type=dataset"
Combine filters
e.g. "give me all models of at least 1MB and not accessed for a year)"
hf cache ls --filter "type=model" --filter "size>1000000" --filter "accessed>1y"
Filters are processed as logical AND. Let's not support "OR".
Quiet mode: print only ids
hf cache ls --filter "accessed>1y" -q
hf cache ls --filter "accessed>1y" --quiet
IDs would be something like this: model/meta-llama/Llama-2-70b-hf.
Custom format
Default output format is to print as a table. But one could want to have a CSV or JSON. Docker handles custom templates but we don't need that much flexibility.
hf cache ls --format json
hf cache ls --format csv
hf cache rm
Delete specific revision(s)
hf cache rm 9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e
hf cache rm 9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e 1bb3f918c345c9d351dd5434c6fda5153506f8c5
Delete specific repo(s)
hf cache rm model/meta-llama/Llama-2-70b-hf
hf cache rm model/meta-llama/Llama-2-70b-hf dataset/facebook/wiki_dpr
Delete repos based on a query
Same as for docker
, we use the quiet mode
e.g. "delete all repos not accessed in the last year"
hf cache rm $(hf cache ls --filter "accessed>1y" -q)
or on unix:
hf cache ls --filter "accessed>1y" -q | xargs hf cache rm
Confirmation step / dry-run
It would be good to have a confirmation step by default.
hf cache rm ... -y
Alternatively (or in addition), we could have a dry-run mode:
hf cache rm ... --dry-run
hf cache prune
Delete all detached revision
When downloading the same repo over time, the user might get several revisions in cache. Revisions can be linked to git refs (e.g. main
, refs/pr/2
, etc.) or "detached". Pruning the cache will delete all revisions not explicitly bound to a reference.
In practice, if a user has always downloaded from main
, all revisions will be deleted except the last one.
hf cache prune
About to delete 18 unreferenced revisions (2.4 GB total)
Proceed? [y/N]:
Confirmation step / dry-run
Same as for hf cache rm
.
hf cache prune -y
hf cache prune --dry-run
hf cache info
Not sure about this one but can be useful for high level info / summary.
hf cache info
# Total cached: 45.2GB across 127 repos
# Models: 38.1GB (84%)
# Datasets: 7.1GB (16%)
Not tackled in this RFC
- In Select which files to delete in
huggingface-cli delete-cache
#2219, we got the feature request to delete specific files from the cache, not entire revisions or repos. This is useful if one want to keep e.g. safetensors files but deleted .bin ones after asnapshot_download
. It would be great to find a way to nicely integrate it in the proposed CLI. A solution would be to havehf cache ls --files
but I'm worried it could lead to too many rows being printed out (imagine a 100k files datasets being printed out...). Maybe if--files
is passed, then we error out in case of more than 1k files being listed? (but again, a bit clunky). - from 🚩 Scan-cache tool: select columns, sort by size, csv-style output #1024, we don't plan for now to add sorting options (can be done once this is implemented)