From 04e84f5e081b27222c21eb8808fc5f6edf03a999 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Tue, 22 Oct 2024 17:42:25 +0200 Subject: [PATCH 1/3] Update datasets-download-stats.md --- docs/hub/datasets-download-stats.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/hub/datasets-download-stats.md b/docs/hub/datasets-download-stats.md index 4af7854f5..095a11d7a 100644 --- a/docs/hub/datasets-download-stats.md +++ b/docs/hub/datasets-download-stats.md @@ -2,7 +2,11 @@ ## How are download stats generated for datasets? -The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that: +Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To avoid double counting downloads (e.g., counting a single download of a dataset as multiple downloads), the Hub counts as one download every series of files requests in an interval of 5 minutes per user. No information is sent from the user, and no additional calls are made for this. The count is done server-side as the Hub serves files for downloads. Every HTTP request to the files, including `GET` and `HEAD`, will be counted as a download. -* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source. -* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count. +## Before Setpember 2024 + +The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub used to count every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This means that: + +* The download count was the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source. +* If a user manually downloaded the data using tools like `wget` or the Hub's user interface (UI), those downloads were not included in the download count. From edf34f191dabd36201d5d8bfb6cb8c796f69e131 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Tue, 22 Oct 2024 17:53:44 +0200 Subject: [PATCH 2/3] Update docs/hub/datasets-download-stats.md Co-authored-by: Daniel van Strien --- docs/hub/datasets-download-stats.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-download-stats.md b/docs/hub/datasets-download-stats.md index 095a11d7a..183c52ebc 100644 --- a/docs/hub/datasets-download-stats.md +++ b/docs/hub/datasets-download-stats.md @@ -6,7 +6,7 @@ Counting the number of downloads for datasets is not a trivial task, as a single ## Before Setpember 2024 -The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub used to count every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This means that: +The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub previously counted every time `load_dataset` was called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This means that: * The download count was the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source. * If a user manually downloaded the data using tools like `wget` or the Hub's user interface (UI), those downloads were not included in the download count. From 572e6471abf2a3f5386bddfe26db6638d0aeca98 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Tue, 22 Oct 2024 17:54:56 +0200 Subject: [PATCH 3/3] Update docs/hub/datasets-download-stats.md Co-authored-by: Daniel van Strien --- docs/hub/datasets-download-stats.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-download-stats.md b/docs/hub/datasets-download-stats.md index 183c52ebc..3f47b74b1 100644 --- a/docs/hub/datasets-download-stats.md +++ b/docs/hub/datasets-download-stats.md @@ -2,7 +2,7 @@ ## How are download stats generated for datasets? -Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To avoid double counting downloads (e.g., counting a single download of a dataset as multiple downloads), the Hub counts as one download every series of files requests in an interval of 5 minutes per user. No information is sent from the user, and no additional calls are made for this. The count is done server-side as the Hub serves files for downloads. Every HTTP request to the files, including `GET` and `HEAD`, will be counted as a download. +Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To solve this issue and avoid counting one person's download multiple times, we treat all files downloaded by a user within a 5-minute window as a single dataset download. This counting happens automatically on our servers when files are downloaded (through GET or HEAD requests), with no need to collect any user information or make additional calls. ## Before Setpember 2024