Skip to content

Commit 9662454

Browse files
osansevierodavanstrienjulien-c
authored
Add docs on download stats (#1108)
* Add docs on download stats * Destroy the wall * Apply suggestions from code review Co-authored-by: Daniel van Strien <[email protected]> Co-authored-by: Julien Chaumond <[email protected]> * Move to their own sections * Open source all the way * Update order and add to index * Update datasets-download-stats.md * Update models-download-stats.md --------- Co-authored-by: Daniel van Strien <[email protected]> Co-authored-by: Julien Chaumond <[email protected]>
1 parent 5e1389e commit 9662454

File tree

5 files changed

+173
-2
lines changed

5 files changed

+173
-2
lines changed

docs/hub/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@
108108
title: Widget Examples
109109
- local: models-inference
110110
title: Inference API docs
111+
- local: models-download-stats
112+
title: Models Download Stats
111113
- local: models-faq
112114
title: Frequently Asked Questions
113115
- local: models-advanced
@@ -149,6 +151,8 @@
149151
sections:
150152
- local: datasets-viewer-configure
151153
title: Configure the Dataset Viewer
154+
- local: datasets-download-stats
155+
title: Datasets Download Stats
152156
- local: datasets-data-files-configuration
153157
title: Data files Configuration
154158
sections:
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Datasets Download Stats
2+
3+
## How are download stats generated for datasets?
4+
5+
The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that:
6+
7+
* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source.
8+
* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count.

docs/hub/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k
3131
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-tasks">Tasks</a>
3232
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-widgets">Widgets</a>
3333
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-inference">Inference API</a>
34+
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-download-stats">Download Stats</a>
3435
</div>
3536

3637
<div class="group flex flex-col space-y-2 rounded-xl border border-red-100 bg-gradient-to-br from-red-50 dark:bg-none px-6 py-4 transition-colors hover:shadow dark:border-red-700">
@@ -44,6 +45,7 @@ The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k
4445
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-downloading">Downloading Datasets</a>
4546
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-libraries">Libraries</a>
4647
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-viewer">Dataset Viewer</a>
48+
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-download-stats">Download Stats</a>
4749
<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-data-files-configuration">Data files Configuration</a>
4850
</div>
4951

docs/hub/models-download-stats.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Models Download Stats
2+
3+
## How are download stats generated for models?
4+
5+
Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads.
6+
7+
Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`.
8+
9+
## Which are the query files for different libraries?
10+
11+
By default, the Hub looks at `config.json`, `config.yaml`, `hyperparams.yaml`, and `meta.yaml`. For the following set of libraries, there are specific query files
12+
13+
```json
14+
{
15+
"adapter-transformers": {
16+
filter: [
17+
{
18+
term: { path: "adapter_config.json" },
19+
},
20+
],
21+
},
22+
"asteroid": {
23+
filter: [
24+
{
25+
term: { path: "pytorch_model.bin" },
26+
},
27+
],
28+
},
29+
"flair": {
30+
filter: [
31+
{
32+
term: { path: "pytorch_model.bin" },
33+
},
34+
],
35+
},
36+
"keras": {
37+
filter: [
38+
{
39+
term: { path: "saved_model.pb" },
40+
},
41+
],
42+
},
43+
"ml-agents": {
44+
filter: [
45+
{
46+
wildcard: { path: "*.onnx" },
47+
},
48+
],
49+
},
50+
"nemo": {
51+
filter: [
52+
{
53+
wildcard: { path: "*.nemo" },
54+
},
55+
],
56+
},
57+
"open_clip": {
58+
filter: [
59+
{
60+
wildcard: { path: "*pytorch_model.bin" },
61+
},
62+
],
63+
},
64+
"sample-factory": {
65+
filter: [
66+
{
67+
term: { path: "cfg.json" },
68+
},
69+
],
70+
},
71+
"paddlenlp": {
72+
filter: [
73+
{
74+
term: { path: "model_config.json" },
75+
},
76+
],
77+
},
78+
"speechbrain": {
79+
filter: [
80+
{
81+
term: { path: "hyperparams.yaml" },
82+
},
83+
],
84+
},
85+
"sklearn": {
86+
filter: [
87+
{
88+
term: { path: "sklearn_model.joblib" },
89+
},
90+
],
91+
},
92+
"spacy": {
93+
filter: [
94+
{
95+
wildcard: { path: "*.whl" },
96+
},
97+
],
98+
},
99+
"stanza": {
100+
filter: [
101+
{
102+
term: { path: "models/default.zip" },
103+
},
104+
],
105+
},
106+
"stable-baselines3": {
107+
filter: [
108+
{
109+
wildcard: { path: "*.zip" },
110+
},
111+
],
112+
},
113+
"timm": {
114+
filter: [
115+
{
116+
terms: { path: ["pytorch_model.bin", "model.safetensors"] },
117+
},
118+
],
119+
},
120+
"diffusers": {
121+
/// Filter out nested safetensors and pickle weights to avoid double counting downloads from the diffusers lib
122+
must_not: [
123+
{
124+
wildcard: { path: "*/*.safetensors" },
125+
},
126+
{
127+
wildcard: { path: "*/*.bin" },
128+
},
129+
],
130+
/// Include documents that match at least one of the following rules
131+
should: [
132+
/// Downloaded from diffusers lib
133+
{
134+
term: { path: "model_index.json" },
135+
},
136+
/// Direct downloads (LoRa, Auto1111 and others)
137+
{
138+
wildcard: { path: "*.safetensors" },
139+
},
140+
{
141+
wildcard: { path: "*.ckpt" },
142+
},
143+
{
144+
wildcard: { path: "*.bin" },
145+
},
146+
],
147+
minimum_should_match: 1,
148+
},
149+
"peft": {
150+
filter: [
151+
{
152+
term: { path: "adapter_config.json" },
153+
},
154+
],
155+
}
156+
}
157+
```

docs/hub/models-faq.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Frequently Asked Questions
1+
# Models Frequently Asked Questions
22

33
## How can I see what dataset was used to train the model?
44

@@ -42,4 +42,4 @@ If the model card includes a link to a paper on arXiv, the Hugging Face Hub will
4242
<img class="hidden dark:block" width="300" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-arxiv-dark.png"/>
4343
</div>
4444

45-
Read more about paper pages [here](./paper-pages).
45+
Read more about paper pages [here](./paper-pages).

0 commit comments

Comments
 (0)