Add docs on download stats (#1108)

osanseviero · davanstrien · julien-c · web-flow · commit 96624542423a · 2023-11-20T22:59:09.000+01:00
* Add docs on download stats

* Destroy the wall

* Apply suggestions from code review

Co-authored-by: Daniel van Strien &lt;davanstrien@users.noreply.github.com&gt;
Co-authored-by: Julien Chaumond &lt;julien@huggingface.co&gt;

* Move to their own sections

* Open source all the way

* Update order and add to index

* Update datasets-download-stats.md

* Update models-download-stats.md

---------

Co-authored-by: Daniel van Strien &lt;davanstrien@users.noreply.github.com&gt;
Co-authored-by: Julien Chaumond &lt;julien@huggingface.co&gt;
diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml
@@ -108,6 +108,8 @@
         title: Widget Examples
   - local: models-inference
     title: Inference API docs
+  - local: models-download-stats
+    title: Models Download Stats
   - local: models-faq
     title: Frequently Asked Questions
   - local: models-advanced
@@ -149,6 +151,8 @@
     sections:
       - local: datasets-viewer-configure
         title: Configure the Dataset Viewer
+  - local: datasets-download-stats
+    title: Datasets Download Stats
   - local: datasets-data-files-configuration
     title: Data files Configuration
     sections:
diff --git a/docs/hub/datasets-download-stats.md b/docs/hub/datasets-download-stats.md
@@ -0,0 +1,8 @@
+# Datasets Download Stats
+
+## How are download stats generated for datasets?
+
+The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that:
+
+* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source.
+* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count.
diff --git a/docs/hub/index.md b/docs/hub/index.md
@@ -31,6 +31,7 @@ The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k
 <a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-tasks">Tasks</a>
 <a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-widgets">Widgets</a>
 <a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-inference">Inference API</a>
+<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./models-download-stats">Download Stats</a>
 </div>
 
 <div class="group flex flex-col space-y-2 rounded-xl border border-red-100 bg-gradient-to-br from-red-50 dark:bg-none px-6 py-4 transition-colors hover:shadow dark:border-red-700">
@@ -44,6 +45,7 @@ The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k
 <a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-downloading">Downloading Datasets</a>
 <a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-libraries">Libraries</a>
 <a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-viewer">Dataset Viewer</a>
+<a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-download-stats">Download Stats</a>
 <a class="!no-underline hover:opacity-60 transform transition-colors hover:translate-x-px" href="./datasets-data-files-configuration">Data files Configuration</a>
 </div>
 
diff --git a/docs/hub/models-download-stats.md b/docs/hub/models-download-stats.md
@@ -0,0 +1,157 @@
+# Models Download Stats
+
+## How are download stats generated for models?
+
+Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads.
+
+Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. 
+
+## Which are the query files for different libraries?
+
+By default, the Hub looks at `config.json`, `config.yaml`, `hyperparams.yaml`, and `meta.yaml`. For the following set of libraries, there are specific query files
+
+```json
+{
+    "adapter-transformers": {
+        filter: [
+            {
+                term: { path: "adapter_config.json" },
+            },
+        ],
+    },
+    "asteroid": {
+        filter: [
+            {
+                term: { path: "pytorch_model.bin" },
+            },
+        ],
+    },
+    "flair": {
+        filter: [
+            {
+                term: { path: "pytorch_model.bin" },
+            },
+        ],
+    },
+    "keras": {
+        filter: [
+            {
+                term: { path: "saved_model.pb" },
+            },
+        ],
+    },
+    "ml-agents": {
+        filter: [
+            {
+                wildcard: { path: "*.onnx" },
+            },
+        ],
+    },
+    "nemo": {
+        filter: [
+            {
+                wildcard: { path: "*.nemo" },
+            },
+        ],
+    },
+    "open_clip": {
+        filter: [
+            {
+                wildcard: { path: "*pytorch_model.bin" },
+            },
+        ],
+    },
+    "sample-factory": {
+        filter: [
+            {
+                term: { path: "cfg.json" },
+            },
+        ],
+    },
+    "paddlenlp": {
+        filter: [
+            {
+                term: { path: "model_config.json" },
+            },
+        ],
+    },
+    "speechbrain": {
+        filter: [
+            {
+                term: { path: "hyperparams.yaml" },
+            },
+        ],
+    },
+    "sklearn": {
+        filter: [
+            {
+                term: { path: "sklearn_model.joblib" },
+            },
+        ],
+    },
+    "spacy": {
+        filter: [
+            {
+                wildcard: { path: "*.whl" },
+            },
+        ],
+    },
+    "stanza": {
+        filter: [
+            {
+                term: { path: "models/default.zip" },
+            },
+        ],
+    },
+    "stable-baselines3": {
+        filter: [
+            {
+                wildcard: { path: "*.zip" },
+            },
+        ],
+    },
+    "timm": {
+        filter: [
+            {
+                terms: { path: ["pytorch_model.bin", "model.safetensors"] },
+            },
+        ],
+    },
+    "diffusers": {
+        /// Filter out nested safetensors and pickle weights to avoid double counting downloads from the diffusers lib
+        must_not: [
+            {
+                wildcard: { path: "*/*.safetensors" },
+            },
+            {
+                wildcard: { path: "*/*.bin" },
+            },
+        ],
+        /// Include documents that match at least one of the following rules
+        should: [
+            /// Downloaded from diffusers lib
+            {
+                term: { path: "model_index.json" },
+            },
+            /// Direct downloads (LoRa, Auto1111 and others)
+            {
+                wildcard: { path: "*.safetensors" },
+            },
+            {
+                wildcard: { path: "*.ckpt" },
+            },
+            {
+                wildcard: { path: "*.bin" },
+            },
+        ],
+        minimum_should_match: 1,
+    },
+    "peft": {
+        filter: [
+            {
+                term: { path: "adapter_config.json" },
+            },
+        ],
+    }
+}
+```
diff --git a/docs/hub/models-faq.md b/docs/hub/models-faq.md
@@ -1,4 +1,4 @@
-# Frequently Asked Questions
+# Models Frequently Asked Questions
 
 ## How can I see what dataset was used to train the model?
 
@@ -42,4 +42,4 @@ If the model card includes a link to a paper on arXiv, the Hugging Face Hub will
 <img class="hidden dark:block" width="300" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-arxiv-dark.png"/>
 </div>
 
-Read more about paper pages [here](./paper-pages).
+Read more about paper pages [here](./paper-pages).