Skip to content

[RFC] Revamp hf cache #3432

@Wauplin

Description

@Wauplin

Current hf cache scan and hf cache delete are not very satisfying. The scan output is usually unreadable "as-is" due to long lines that break in the terminal (mostly due to long local paths being printed, which add little value). The delete UX is kinda annoying as well with manual steps to do in order to select which revisions/repos to delete. These problems have been highlighted multiple times in the past:

The main proposition of this RFC is to articulate the CLI around 3 commands, mostly inspired by the well established docker CLI:

With the current hf scan cache and hf scan delete being deleted for the v1 release. We would also completely get rid of the optional dependency InquirerPy (and therefore huggingface_hub[cli]).

All the logic to achieve the features below already exists in huggingface_hub. It's just a matter of exposing it nicely in the UI.


hf cache ls

List all cached repos

Kinda equivalent of current hf cache scan

hf cache ls

The output would be less verbose:

  • ID combines repo type repo id (useful for later reuse)
  • nb files is removed
  • path is removed
ID                                                                        SIZE   LAST_ACCESSED LAST_MODIFIED REFS

dataset/IPEC-COMMUNITY/fractal20220817_data_lerobot                      87.4M  5 months ago  5 months ago  main
dataset/Narsil/image_dummy                                              292.4K  4 weeks ago   9 months ago  main
dataset/Wauplin/gpt-oss-rag-dev                                          84.0M  2 months ago  2 months ago  main
dataset/Wauplin/test-template                                             3.4K  5 weeks ago   2 months ago  main
dataset/__DUMMY_TRANSFORMERS_USER__/test-dataset-4ce69c-17585454341099   165.0  3 weeks ago   3 weeks ago   main
dataset/facebook/wiki_dpr                                                 8.6K  4 months ago  4 months ago  main
dataset/hf-internal-testing/_dummy                                         0.0  1 week ago    1 week ago
model/xai-org/grok-1                                                     18.9M  2 years ago   2 years ago   refs/pr/25, refs/pr/28

List all revisions

Kinda equivalent of current hf cache scan --verbose

hf cache ls --revisions

Filter by total size

hf cache ls --filter "size>1000000"
hf cache ls --filter "size<1MB"

Let's say that filters are case insensitive.

Filter by last modified or accessed

Docker is able to handle many time representation ("7d", "2024-05-01", a timestamp, iso date format, timezone, etc.). In practice it's already good if we can handle semantic terms like 10s, m, h, d, w, mo and y + timestamps.

hf cache ls --filter "modified>7d"
hf cache ls --filter "accessed>1y"

Filter by repo type

hf cache ls --filter "type=dataset"

Combine filters

e.g. "give me all models of at least 1MB and not accessed for a year)"

hf cache ls --filter "type=model" --filter "size>1000000" --filter "accessed>1y"

Filters are processed as logical AND. Let's not support "OR".

Quiet mode: print only ids

hf cache ls --filter "accessed>1y" -q
hf cache ls --filter "accessed>1y" --quiet

IDs would be something like this: model/meta-llama/Llama-2-70b-hf.

Custom format

Default output format is to print as a table. But one could want to have a CSV or JSON. Docker handles custom templates but we don't need that much flexibility.

hf cache ls --format json
hf cache ls --format csv

hf cache rm

Delete specific revision(s)

hf cache rm 9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e
hf cache rm 9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e 1bb3f918c345c9d351dd5434c6fda5153506f8c5

Delete specific repo(s)

hf cache rm model/meta-llama/Llama-2-70b-hf
hf cache rm model/meta-llama/Llama-2-70b-hf dataset/facebook/wiki_dpr

Delete repos based on a query

Same as for docker, we use the quiet mode
e.g. "delete all repos not accessed in the last year"

hf cache rm $(hf cache ls --filter "accessed>1y" -q)

or on unix:

hf cache ls --filter "accessed>1y" -q | xargs hf cache rm

Confirmation step / dry-run

It would be good to have a confirmation step by default.

hf cache rm ... -y

Alternatively (or in addition), we could have a dry-run mode:

hf cache rm ... --dry-run

hf cache prune

Delete all detached revision

When downloading the same repo over time, the user might get several revisions in cache. Revisions can be linked to git refs (e.g. main, refs/pr/2, etc.) or "detached". Pruning the cache will delete all revisions not explicitly bound to a reference.

In practice, if a user has always downloaded from main, all revisions will be deleted except the last one.

hf cache prune
About to delete 18 unreferenced revisions (2.4 GB total)
Proceed? [y/N]:

Confirmation step / dry-run

Same as for hf cache rm.

hf cache prune -y
hf cache prune --dry-run

hf cache info

Not sure about this one but can be useful for high level info / summary.

hf cache info
# Total cached: 45.2GB across 127 repos
# Models: 38.1GB (84%)
# Datasets: 7.1GB (16%)

Not tackled in this RFC

  • In Select which files to delete in huggingface-cli delete-cache #2219, we got the feature request to delete specific files from the cache, not entire revisions or repos. This is useful if one want to keep e.g. safetensors files but deleted .bin ones after a snapshot_download. It would be great to find a way to nicely integrate it in the proposed CLI. A solution would be to have hf cache ls --files but I'm worried it could lead to too many rows being printed out (imagine a 100k files datasets being printed out...). Maybe if --files is passed, then we error out in case of more than 1k files being listed? (but again, a bit clunky).
  • from 🚩 Scan-cache tool: select columns, sort by size, csv-style output #1024, we don't plan for now to add sorting options (can be done once this is implemented)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions