Skip to content

Commit 04e84f5

Browse files
authored
Update datasets-download-stats.md
1 parent 781aced commit 04e84f5

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

docs/hub/datasets-download-stats.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,11 @@
22

33
## How are download stats generated for datasets?
44

5-
The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that:
5+
Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To avoid double counting downloads (e.g., counting a single download of a dataset as multiple downloads), the Hub counts as one download every series of files requests in an interval of 5 minutes per user. No information is sent from the user, and no additional calls are made for this. The count is done server-side as the Hub serves files for downloads. Every HTTP request to the files, including `GET` and `HEAD`, will be counted as a download.
66

7-
* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source.
8-
* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count.
7+
## Before Setpember 2024
8+
9+
The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub used to count every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This means that:
10+
11+
* The download count was the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source.
12+
* If a user manually downloaded the data using tools like `wget` or the Hub's user interface (UI), those downloads were not included in the download count.

0 commit comments

Comments
 (0)