Include pre-release download when summarizing download counts (#25)

gsheni · rwedge · web-flow · commit e219cd303320 · 2025-06-17T17:35:07.000-04:00
* wip

* lint

* fix start date

* add project print

* fix print

* update message

* update to use pyarrow dtypes

* fix string

* update to ubuntu-latest-largeA

* update to ubuntu

* fix engine

* docstring

* use category dtype

* remove pyarrow

* fix ns

* lint

* use pyarrow everywhere

* remove pyarrow dtypes

* add readme instructions

* fix manual

* cleanup

* fix manual

* fix manual

* fix max_days

* fix docs

* wip

* wip

* wip

* fix workflow

* fix workflow

* add message to workflow

* cleanup

* fix repo

* fix slack msg

* fix slack msg

* use extensions

* summarize fix"

* use uv

* fix uv

* use cache

* change token

* add unit tests

* add unit workflow

* add dry-run

* remove unused arg

* fix dry run

* use uv in lint

* add date

* cleanup readme

* Rename daily_collect.yaml to daily_collection.yaml

* Update daily_collection.yaml

* Update daily_summarize.yaml

* Update dryrun.yaml

* Update lint.yaml

* Update manual.yaml

* Update unit.yaml

* wip

* Address feedback 2

* add version parse

* use object dtype

* fix local write

* lint

* Update daily_summarize.yaml

* cleanup

* wip

* exclude pre-releases

* wip

* cleanup

* cleanup

* cleanup

* cleanup

* update workflow

* rename workflow

* fix unit tests

* define cache break

* force reinstall

* remove force install

* lint

* fix summarize config

* cleanup

* fix dry run

* Update dryrun.yaml

* Update dryrun.yaml

* fix dry run

* fix dry run

* fix dry run

* fix write

* remove breakpoint

* Update download_analytics/time_utils.py

Co-authored-by: Roy Wedge &lt;roy.wedge@gmail.com&gt;

* fix tz

* update readme and add parameter

* cleanup

* cleanup

* cleanup

* cleanup

* img

* fix based on feedback

* fix workflow

* Update summarize.py

* Update README.md

* Update summarize.py

---------

Co-authored-by: Roy Wedge &lt;roy.wedge@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -1,63 +1,96 @@
 # Download Analytics
 
-The Download Analytics project allows you to extract download metrics from a Python library published on [PyPI](https://pypi.org/).
+The Download Analytics project allows you to extract download metrics for Python libraries published on [PyPI](https://pypi.org/) and [Anaconda](https://www.anaconda.com/).
 
-## Overview
+The DataCebo team uses these scripts to report download counts for the libraries in the [SDV ecosystem](https://sdv.dev/) and other libraries.
 
+## Overview
 The Download Analytics project is a collection of scripts and tools to extract information
-about OSS project downloads from diffierent sources and to analyze them to produce user
+about OSS project downloads from different sources and to analyze them to produce user
 engagement metrics.
 
-### Data sources
+### Data Sources
+Currently, the download data is collected from the following distributions:
+* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
+  obtained from the public BigQuery dataset, equivalent to the information shown on
+  [pepy.tech](https://pepy.tech) and [ClickPy](https://clickpy.clickhouse.com/)
+  - More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/)
 
-Currently the download data is collected from the following distributions:
+* [Anaconda](https://www.anaconda.com/): Information about conda package downloads for default and select Anaconda channels.
+  - The conda package download data is provided by Anaconda, Inc. It includes package download counts
+    starting from January 2017. More information about this dataset can be found on the [official README.md](https://github.com/anaconda/anaconda-package-data/blob/master/README.md).
+  - Additional conda package downloads are retrieved using the public API provided by Anaconda. This allows for the retrieval of the current number of downloads for each file served.
+    - Anaconda API Endpoint: https://api.anaconda.org/package/{username}/{package_name}
+      - Replace `{username}` with the Anaconda channel (`conda-forge`)
+      - Replace `{package_name}` with the specific package (`sdv`) in the Anaconda channel
+    - For each file returned by the API endpoint, the current number of downloads is saved. Over time, a historical download recording can be built.
+  - Both of these sources were used to track Anaconda downloads because the package data for Anaconda does not match the download count on the website. This is due to missing download data. See: https://github.com/anaconda/anaconda-package-data/issues/45
 
-* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
-  obtained from the public Big Query dataset, equivalent to the information shown on
-  [pepy.tech](https://pepy.tech).
-* [conda-forge](https://conda-forge.org/): Information about the project downloads from the
-  `conda-forge` channel on `conda`.
-  - The conda package download data provided by Anaconda. It includes package download counts
-    starting from January 2017. More information:
-    - https://github.com/anaconda/anaconda-package-data
-  - The conda package metadata data provided by Anaconda. There is a public API which allows for
-    the retrieval of package information, including current number of downloads.
-    - https://api.anaconda.org/package/{username}/{package_name}
-    - Replace {username} with the Anaconda username (`conda-forge`) and {package_name} with
-    the specific package name (`sdv`).
-
-In the future, we may also expand the source distributions to include:
-
-* [github](https://github.com/): Information about the project downloads from github releases.
-
-For more information about how to configure and use the software, or about the data that is being
-collected check the resources below.
-
-### Add new libraries
-In order add new libraries, it is important to follow these steps to ensure that data is backfilled.
-1. Update `config.yaml` with the new libraries (pypi project names only for now)
-2. Run the [Manual collection workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml) on your branch.
-    - Use workflow from **your branch name**.
-    - List all project names from config.yaml
-    - Remove `7` from max days to indicate you want all data
-    - Pass any extra arguments (for example `--dry-run` to test your changes)
-3. Let the workflow finish and check that pypi.csv contains the right data.
-4. Get your pull request reviewed and merged into `main`. The daily collection workflow will fill the data for the last 30 days and future days.
-    - Note: The collection script looks at timestamps and avoids adding overlapping data.
-
-### Metrics
-This library collects the number of downloads for your chosen software. You can break these up along several dimensions:
-
-- **By Month**: The number of downloads per month
-- **By Version**: The number of downloads per version of the software, as determine by the software maintainers
-- **By Python Version**: The number of downloads per minor Python version (eg. 3.8)
-- **And more!** See the resources below for more information.
+### Future Data Sources
+In the future, we may expand the source distributions to include:
+* [GitHub Releases](https://github.com/): Information about the project downloads from GitHub releases.
+
+## Workflows
+
+### Daily Collection
+On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published to Google Drive in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
+
+#### Metrics
+This PyPI download metrics are computed along several dimensions:
+
+- **By Month**: The number of downloads per month.
+- **By Version**: The number of downloads per version of the software, as determined by the software maintainers.
+- **By Python Version**: The number of downloads per minor Python version (eg. 3.8).
+- **By Full Python Version**: The number of downloads per full Python version (eg. 3.9.1).
+- **And more!**
+
+### Daily Summarize
+
+On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries.
+
+The summarized data is uploaded to a GitHub repo:
+- [Downloads_Summary.xlsx](https://github.com/sdv-dev/sdv-dev.github.io/blob/gatsby-home/assets/Downloads_Summary.xlsx)
+
+#### SDV Calculation
+Installing the main SDV library also installs all the other libraries as dependencies. To calculate SDV downloads, we use an exclusive download methodology:
+
+1. Get download counts for `sdgym` and `sdv`.
+2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since sdgym depends on sdv).
+3. Get download counts for direct SDV dependencies: `rdt`, `copulas`, `ctgan`, `deepecho`, `sdmetrics`.
+4. Adjust downloads for each dependency by subtracting the `sdv` download count.
+5. Ensure no download count goes negative using `max(0, adjusted_count)` for each library.
+
+This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.
 
 ## Resources
-For more information about the configuration, workflows and metrics, see the resources below.
+For more information about the configuration, workflows, and metrics, see the resources below.
 |               | Document                            | Description |
 | ------------- | ----------------------------------- | ----------- |
-| :pilot:       | [WORKFLOWS](docs/WORKFLOWS.md)           | How to collect data and add new libraries to the Github actions. |
-| :gear:        | [SETUP](docs/SETUP.md)                   | How to generate credentials to access BigQuery and Google Drive and add them to Github Actions. |
+| :pilot:       | [WORKFLOWS](docs/WORKFLOWS.md)           | How to collect data and add new libraries to the GitHub actions. |
+| :gear:        | [SETUP](docs/SETUP.md)                   | How to generate credentials to access BigQuery and Google Drive and add them to GitHub Actions. |
 | :keyboard:    | [DEVELOPMENT](docs/DEVELOPMENT.md)       | How to install and run the scripts locally. Overview of the project implementation. |
 | :floppy_disk: | [COLLECTED DATA](docs/COLLECTED_DATA.md) | Explanation about the data that is being collected. |
+
+
+---
+
+<div align="center">
+  <a href="https://datacebo.com"><picture>
+      <source media="(prefers-color-scheme: dark)" srcset="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo-dark-mode.png">
+      <img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo.png"></img>
+  </picture></a>
+</div>
+<br/>
+<br/>
+
+[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](
+https://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we
+created [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.
+Today, DataCebo is the proud developer of SDV, the largest ecosystem for
+synthetic data generation & evaluation. It is home to multiple libraries that support synthetic
+data, including:
+
+* 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
+* 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,
+  multi table and time series data.
+* 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data
diff --git a/download_analytics/output.py b/download_analytics/output.py
@@ -5,7 +5,6 @@
 import pathlib
 
 import pandas as pd
-from packaging.version import parse
 
 from download_analytics import drive
 
@@ -178,8 +177,6 @@ def load_csv(csv_path, read_csv_kwargs=None):
             data = pd.read_csv(stream, **read_csv_kwargs)
         else:
             data = pd.read_csv(csv_path, **read_csv_kwargs)
-        if 'version' in data.columns:
-            data['version'] = data['version'].apply(parse)
     except FileNotFoundError:
         LOGGER.info('Failed to load CSV file %s: not found', csv_path)
         return None
diff --git a/download_analytics/summarize.py b/download_analytics/summarize.py
@@ -4,7 +4,7 @@
 import os
 
 import pandas as pd
-from packaging.version import Version
+from packaging.version import Version, parse
 
 from download_analytics.output import append_row, create_spreadsheet, get_path, load_csv
 from download_analytics.time_utils import get_current_year, get_min_max_dt_in_year
@@ -42,7 +42,30 @@ def _calculate_projects_count(
     min_datetime=None,
     version=None,
     version_operator=None,
+    exclude_prereleases=False,
 ):
+    """Get number of PyPI downloads for specified project(s).
+
+    Args:
+        downloads (pd.DataFrame): PyPI Download data. It must contain the project, version,
+            and timestamp column. The version column must be packaging Version objects.
+        projects (str, tuple(str), list[str]): The project name or list of project names to filter
+            the download for.
+        max_datetime (datetime): The maximum datetime to include downloads for (inclusive).
+            Downloads after this datetime will be excluded.
+        min_datetime (datetime): The minimum datetime to include downloads for (inclusive).
+            Downloads before this datetime will be excluded.
+        version (str): The version string to compare against when filtering by version.
+            Must be used in conjunction with version_operator.
+        version_operator (str): The comparison operator to use with version filtering.
+            Supported operators: '<=', '>', '>=', '<'. Must be used in conjunction with version.
+        exclude_prereleases (bool): If True, excludes pre-release versions from the count.
+            Defaults to False, which means to include downloads for pre-releases.
+
+    Returns:
+        int: The number of downloads matching the specified criteria.
+
+    """
     if isinstance(projects, str):
         projects = (projects,)
 
@@ -51,12 +74,23 @@ def _calculate_projects_count(
         project_downloads = project_downloads[project_downloads['version'] <= Version(version)]
     if version and version_operator and version_operator == '>':
         project_downloads = project_downloads[project_downloads['version'] > Version(version)]
+    if version and version_operator and version_operator == '>=':
+        project_downloads = project_downloads[project_downloads['version'] >= Version(version)]
+    if version and version_operator and version_operator == '<':
+        project_downloads = project_downloads[project_downloads['version'] < Version(version)]
 
     if max_datetime:
         project_downloads = project_downloads[project_downloads['timestamp'] <= max_datetime]
     if min_datetime:
         project_downloads = project_downloads[project_downloads['timestamp'] >= min_datetime]
 
+    if exclude_prereleases is True:
+        LOGGER.info(f'Excluding pre-release downloads for {projects}')
+        project_downloads = project_downloads[
+            ~project_downloads['version'].apply(lambda v: v.is_prerelease)
+        ]
+    else:
+        LOGGER.info(f'Including pre-release downloads for {projects}')
     return len(project_downloads)
 
 
@@ -77,7 +111,15 @@ def _sum_counts(base_count, dep_to_count, parent_to_count):
 
 
 def get_previous_pypi_downloads(input_file, output_folder):
-    """Read pypi.csv and return a DataFrame of the downloads."""
+    """Read pypi.csv and return a DataFrame of the downloads.
+
+    Args:
+        input_file (str): Location of the pypi.csv to use as the previous downloads.
+
+        output_folder (str): If input_file is None, this directory location must contain
+            pypi.csv file to use.
+
+    """
     csv_path = input_file or get_path(output_folder, 'pypi.csv')
     read_csv_kwargs = {
         'parse_dates': ['timestamp'],
@@ -96,7 +138,10 @@ def get_previous_pypi_downloads(input_file, output_folder):
             'cpu': pd.CategoricalDtype(),
         },
     }
-    return load_csv(csv_path, read_csv_kwargs=read_csv_kwargs)
+    data = load_csv(csv_path, read_csv_kwargs=read_csv_kwargs)
+    LOGGER.info('Parsing version column to Version class objects')
+    data['version'] = data['version'].apply(parse)
+    return data
 
 
 def _ecosystem_count_by_year(downloads, base_project, dependency_projects, parent_projects):