Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
7564a82
wip
gsheni Jun 3, 2025
b4bcf87
lint
gsheni Jun 3, 2025
cfa02fa
fix start date
gsheni Jun 3, 2025
18bbb95
add project print
gsheni Jun 3, 2025
059e9ac
fix print
gsheni Jun 3, 2025
c735af7
update message
gsheni Jun 3, 2025
f1d9a31
update to use pyarrow dtypes
gsheni Jun 3, 2025
f73291f
fix string
gsheni Jun 3, 2025
20a0108
update to ubuntu-latest-largeA
gsheni Jun 3, 2025
ecdc74d
update to ubuntu
gsheni Jun 3, 2025
a694d4c
fix engine
gsheni Jun 3, 2025
420cf09
docstring
gsheni Jun 3, 2025
e6b751b
use category dtype
gsheni Jun 3, 2025
4924d5b
remove pyarrow
gsheni Jun 3, 2025
3c380e4
fix ns
gsheni Jun 3, 2025
21c4701
lint
gsheni Jun 3, 2025
b71f64b
use pyarrow everywhere
gsheni Jun 3, 2025
6384bc9
remove pyarrow dtypes
gsheni Jun 3, 2025
dac71d5
add readme instructions
gsheni Jun 3, 2025
e195d6e
fix manual
gsheni Jun 3, 2025
b4aca6d
cleanup
gsheni Jun 3, 2025
e40da58
fix manual
gsheni Jun 3, 2025
70d339e
fix manual
gsheni Jun 3, 2025
e0f728f
fix max_days
gsheni Jun 3, 2025
b8b9741
fix docs
gsheni Jun 3, 2025
473e19b
wip
gsheni Jun 4, 2025
4f75889
Merge branch 'main' of https://github.com/datacebo/download-analytics…
gsheni Jun 4, 2025
193b842
wip
gsheni Jun 5, 2025
80628e8
wip
gsheni Jun 6, 2025
c9372f7
pull main
gsheni Jun 6, 2025
e912721
pull main
gsheni Jun 6, 2025
3aa5abf
fix workflow
gsheni Jun 6, 2025
9511379
fix workflow
gsheni Jun 6, 2025
3b86fee
add message to workflow
gsheni Jun 6, 2025
9e3fc48
cleanup
gsheni Jun 6, 2025
aaa0c29
fix repo
gsheni Jun 6, 2025
7c7d615
fix slack msg
gsheni Jun 6, 2025
430b5f6
fix slack msg
gsheni Jun 6, 2025
1fc2573
use extensions
gsheni Jun 6, 2025
d386d47
summarize fix"
gsheni Jun 6, 2025
bc163eb
use uv
gsheni Jun 6, 2025
c1d3370
fix uv
gsheni Jun 6, 2025
3cc2fe6
use cache
gsheni Jun 6, 2025
8064477
change token
gsheni Jun 6, 2025
b53c6bf
add unit tests
gsheni Jun 6, 2025
a4001b4
add unit workflow
gsheni Jun 6, 2025
e06bf05
add dry-run
gsheni Jun 6, 2025
b11083c
remove unused arg
gsheni Jun 6, 2025
e8bffc9
fix dry run
gsheni Jun 6, 2025
77658fe
use uv in lint
gsheni Jun 6, 2025
21ad83a
add date
gsheni Jun 6, 2025
53a197b
cleanup readme
gsheni Jun 6, 2025
7a2c11e
Rename daily_collect.yaml to daily_collection.yaml
gsheni Jun 6, 2025
b518f22
Update daily_collection.yaml
gsheni Jun 6, 2025
3808ffa
Update daily_summarize.yaml
gsheni Jun 6, 2025
e6d0177
Update dryrun.yaml
gsheni Jun 6, 2025
ce1ed90
Update lint.yaml
gsheni Jun 6, 2025
4106bfe
Update manual.yaml
gsheni Jun 6, 2025
39c0abe
Update unit.yaml
gsheni Jun 6, 2025
c8ffbb4
wip
gsheni Jun 9, 2025
f18f836
Address feedback 2
gsheni Jun 10, 2025
5de3e1f
add version parse
gsheni Jun 10, 2025
cddcdce
use object dtype
gsheni Jun 10, 2025
8730c66
fix local write
gsheni Jun 10, 2025
2e8586a
lint
gsheni Jun 10, 2025
8803f0e
Merge branch '13-add-a-daily-workflow-to-summarize-download-counts-fo…
gsheni Jun 10, 2025
f357ed6
Update daily_summarize.yaml
gsheni Jun 10, 2025
74a95f1
cleanup
gsheni Jun 10, 2025
60e4cf8
wip
gsheni Jun 11, 2025
7dc033d
exclude pre-releases
gsheni Jun 11, 2025
3debe90
wip
gsheni Jun 11, 2025
ec7740c
cleanup
gsheni Jun 11, 2025
6bf1852
cleanup
gsheni Jun 11, 2025
80153c7
cleanup
gsheni Jun 11, 2025
860b5d5
cleanup
gsheni Jun 11, 2025
ff200e3
fix conflicts
gsheni Jun 11, 2025
0ada91e
update workflow
gsheni Jun 11, 2025
46e126a
rename workflow
gsheni Jun 11, 2025
5c7ffe1
fix unit tests
gsheni Jun 11, 2025
8f00d16
define cache break
gsheni Jun 11, 2025
6b9dc1a
force reinstall
gsheni Jun 11, 2025
84ce344
remove force install
gsheni Jun 11, 2025
140825e
lint
gsheni Jun 11, 2025
a806920
fix summarize config
gsheni Jun 11, 2025
d459539
cleanup
gsheni Jun 11, 2025
430b0b3
fix conflicts
gsheni Jun 11, 2025
2369a4f
fix dry run
gsheni Jun 11, 2025
ee9b714
fix conflicts
gsheni Jun 12, 2025
bf745df
Merge branch 'main' into 15-add-daily-workflow-to-export-anaconda-dow…
gsheni Jun 13, 2025
314b182
Update dryrun.yaml
gsheni Jun 13, 2025
aea12c4
Update dryrun.yaml
gsheni Jun 13, 2025
e20cfea
fix dry run
gsheni Jun 13, 2025
5ba5f8d
fix dry run
gsheni Jun 13, 2025
a14f53c
fix dry run
gsheni Jun 13, 2025
b408543
fix write
gsheni Jun 13, 2025
d023546
remove breakpoint
gsheni Jun 13, 2025
0e0d548
Update download_analytics/time_utils.py
gsheni Jun 16, 2025
70f93ab
fix tz
gsheni Jun 16, 2025
d971653
update readme and add parameter
gsheni Jun 16, 2025
51a053f
cleanup
gsheni Jun 16, 2025
f3b4ebd
cleanup
gsheni Jun 16, 2025
d1020c6
cleanup
gsheni Jun 16, 2025
5b8dbfd
cleanup
gsheni Jun 16, 2025
17b2df0
img
gsheni Jun 16, 2025
262994e
fix based on feedback
gsheni Jun 17, 2025
db48947
fix workflow
gsheni Jun 17, 2025
719ce06
Merge branch '15-add-daily-workflow-to-export-anaconda-downloads-to-g…
gsheni Jun 17, 2025
266df1a
Update summarize.py
gsheni Jun 17, 2025
34184a5
Merge branch 'main' into 22-include-pre-releases-downloads-for-pypi-d…
gsheni Jun 17, 2025
005a0d4
Update README.md
gsheni Jun 17, 2025
f067174
Update summarize.py
gsheni Jun 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 81 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,63 +1,96 @@
# Download Analytics

The Download Analytics project allows you to extract download metrics from a Python library published on [PyPI](https://pypi.org/).
The Download Analytics project allows you to extract download metrics for Python libraries published on [PyPI](https://pypi.org/) and [Anaconda](https://www.anaconda.com/).

## Overview
The DataCebo team uses these scripts to report download counts for the libraries in the [SDV ecosystem](https://sdv.dev/) and other libraries.

## Overview
The Download Analytics project is a collection of scripts and tools to extract information
about OSS project downloads from diffierent sources and to analyze them to produce user
about OSS project downloads from different sources and to analyze them to produce user
engagement metrics.

### Data sources
### Data Sources
Currently, the download data is collected from the following distributions:
* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
obtained from the public BigQuery dataset, equivalent to the information shown on
[pepy.tech](https://pepy.tech) and [ClickPy](https://clickpy.clickhouse.com/)
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/)

Currently the download data is collected from the following distributions:
* [Anaconda](https://www.anaconda.com/): Information about conda package downloads for default and select Anaconda channels.
- The conda package download data is provided by Anaconda, Inc. It includes package download counts
starting from January 2017. More information about this dataset can be found on the [official README.md](https://github.com/anaconda/anaconda-package-data/blob/master/README.md).
- Additional conda package downloads are retrieved using the public API provided by Anaconda. This allows for the retrieval of the current number of downloads for each file served.
- Anaconda API Endpoint: https://api.anaconda.org/package/{username}/{package_name}
- Replace `{username}` with the Anaconda channel (`conda-forge`)
- Replace `{package_name}` with the specific package (`sdv`) in the Anaconda channel
- For each file returned by the API endpoint, the current number of downloads is saved. Over time, a historical download recording can be built.
- Both of these sources were used to track Anaconda downloads because the package data for Anaconda does not match the download count on the website. This is due to missing download data. See: https://github.com/anaconda/anaconda-package-data/issues/45

* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
obtained from the public Big Query dataset, equivalent to the information shown on
[pepy.tech](https://pepy.tech).
* [conda-forge](https://conda-forge.org/): Information about the project downloads from the
`conda-forge` channel on `conda`.
- The conda package download data provided by Anaconda. It includes package download counts
starting from January 2017. More information:
- https://github.com/anaconda/anaconda-package-data
- The conda package metadata data provided by Anaconda. There is a public API which allows for
the retrieval of package information, including current number of downloads.
- https://api.anaconda.org/package/{username}/{package_name}
- Replace {username} with the Anaconda username (`conda-forge`) and {package_name} with
the specific package name (`sdv`).

In the future, we may also expand the source distributions to include:

* [github](https://github.com/): Information about the project downloads from github releases.

For more information about how to configure and use the software, or about the data that is being
collected check the resources below.

### Add new libraries
In order add new libraries, it is important to follow these steps to ensure that data is backfilled.
1. Update `config.yaml` with the new libraries (pypi project names only for now)
2. Run the [Manual collection workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml) on your branch.
- Use workflow from **your branch name**.
- List all project names from config.yaml
- Remove `7` from max days to indicate you want all data
- Pass any extra arguments (for example `--dry-run` to test your changes)
3. Let the workflow finish and check that pypi.csv contains the right data.
4. Get your pull request reviewed and merged into `main`. The daily collection workflow will fill the data for the last 30 days and future days.
- Note: The collection script looks at timestamps and avoids adding overlapping data.

### Metrics
This library collects the number of downloads for your chosen software. You can break these up along several dimensions:

- **By Month**: The number of downloads per month
- **By Version**: The number of downloads per version of the software, as determine by the software maintainers
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8)
- **And more!** See the resources below for more information.
### Future Data Sources
In the future, we may expand the source distributions to include:
* [GitHub Releases](https://github.com/): Information about the project downloads from GitHub releases.

## Workflows

### Daily Collection
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published to Google Drive in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).

#### Metrics
This PyPI download metrics are computed along several dimensions:

- **By Month**: The number of downloads per month.
- **By Version**: The number of downloads per version of the software, as determined by the software maintainers.
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8).
- **By Full Python Version**: The number of downloads per full Python version (eg. 3.9.1).
- **And more!**

### Daily Summarize

On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries.

The summarized data is uploaded to a GitHub repo:
- [Downloads_Summary.xlsx](https://github.com/sdv-dev/sdv-dev.github.io/blob/gatsby-home/assets/Downloads_Summary.xlsx)

#### SDV Calculation
Installing the main SDV library also installs all the other libraries as dependencies. To calculate SDV downloads, we use an exclusive download methodology:

1. Get download counts for `sdgym` and `sdv`.
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since sdgym depends on sdv).
3. Get download counts for direct SDV dependencies: `rdt`, `copulas`, `ctgan`, `deepecho`, `sdmetrics`.
4. Adjust downloads for each dependency by subtracting the `sdv` download count.
5. Ensure no download count goes negative using `max(0, adjusted_count)` for each library.

This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.

## Resources
For more information about the configuration, workflows and metrics, see the resources below.
For more information about the configuration, workflows, and metrics, see the resources below.
| | Document | Description |
| ------------- | ----------------------------------- | ----------- |
| :pilot: | [WORKFLOWS](docs/WORKFLOWS.md) | How to collect data and add new libraries to the Github actions. |
| :gear: | [SETUP](docs/SETUP.md) | How to generate credentials to access BigQuery and Google Drive and add them to Github Actions. |
| :pilot: | [WORKFLOWS](docs/WORKFLOWS.md) | How to collect data and add new libraries to the GitHub actions. |
| :gear: | [SETUP](docs/SETUP.md) | How to generate credentials to access BigQuery and Google Drive and add them to GitHub Actions. |
| :keyboard: | [DEVELOPMENT](docs/DEVELOPMENT.md) | How to install and run the scripts locally. Overview of the project implementation. |
| :floppy_disk: | [COLLECTED DATA](docs/COLLECTED_DATA.md) | Explanation about the data that is being collected. |


---

<div align="center">
<a href="https://datacebo.com"><picture>
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo-dark-mode.png">
<img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo.png"></img>
</picture></a>
</div>
<br/>
<br/>

[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](
https://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we
created [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.
Today, DataCebo is the proud developer of SDV, the largest ecosystem for
synthetic data generation & evaluation. It is home to multiple libraries that support synthetic
data, including:

* 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
* 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,
multi table and time series data.
* 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data
2 changes: 1 addition & 1 deletion docs/COLLECTED_DATA.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ the aggregations metrics that are computed on them.
## PyPI Downloads

Download Analytics collects information about the downloads from PyPI by making queries to the
[public PyPI download statistics dataset on Big Query](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset)
[public PyPI download statistics dataset on BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset)
by running the following query:

```
Expand Down
45 changes: 23 additions & 22 deletions docs/WORKFLOWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

This document describes how to perform the two most common workflows:

- **Daily Collection**: Adding one or more libraries to the Github Actions configuration
- **Daily PyPI Collection**: Adding one or more libraries to the GitHub Actions configuration
to download data daily.
- **One Shot**: Collecting downloads data for one or more libraries over a specific period
using Github Actions.
- **One Shot**: Collecting download data for one or more libraries over a specific period
using GitHub Actions.

## Daily Collection - Adding libraries to the scheduled Github Actions workflow
## Daily Collection - Adding libraries to the scheduled GitHub Actions workflow

Every day at 00:00 UTC time new data is collected from all the configured sources.
Every day at 00:00 UTC time, new data is collected from all the configured sources.

The configuration about which libraries are collected is written in the [config.yaml](
../config.yaml) file, which has the following format:
Expand All @@ -34,43 +34,43 @@ projects:
- ydata-synthetic
```

In order to add a new library to the collection you need to follow these steps:
In order to add a new library to the collection, you need to follow these steps:

1. Open the [config.yaml](../config.yaml) file to edit it. If you do this directly on the Github
UI you can edit the file directly by clicking on the pencil icon at the top right corner.
1. Open the [config.yaml](../config.yaml) file to edit it. If you do this directly on the GitHub
UI, you can edit the file directly by clicking on the pencil icon at the top right corner.

| ![edit-config](imgs/edit-config.png "Edit the config.yaml file") |
| - |

2. Add the libraries that you want to track to the list under `projects`. Make sure to replicate
the same indentation as the existing libraries and use the exact same name as the corresponding
`https://pypi.org` and `https://pepy.tech` project pages, otherwise the project downloads
`https://pypi.org` and `https://pepy.tech` project pages; otherwise, the project downloads
will not be found.

For example, in the screenshot below we would be adding the library [DataSynthesizer](
For example, in the screenshot below, we would be adding the library [DataSynthesizer](
https://pypi.org/project/DataSynthesizer/):

| ![add-library](imgs/add-library.png "Add a new library to the list") |
| - |

3. Save the file, commmit it and create a Pull Request. If you are editing the file directly
on Github, all these steps can be done at the same time using the form at the bottom of
3. Save the file, commit it, and create a Pull Request. If you are editing the file directly
on GitHub, all these steps can be done at the same time using the form at the bottom of
the file.

| ![create-pr](imgs/create-pr.png "Create a Pull Request") |
| - |

After these steps are done, the Pull Request will be ready to be validated and merged, and
after it is merged the downloads of the new library will start to be added to the output CSV file.
after it is merged, the downloads of the new library will start to be added to the output CSV file.

## One Shot - Collecting data over a specific period.

Download Analytics is prepared to collect data for one or more libraries over a specific period
using a [Github Actions Workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml).
using a [GitHub Actions Workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml).

In order to do this, you will need to follow these steps:

1. Enter the [Github Actions Section of the repository](https://github.com/datacebo/download-analytics/actions)
1. Enter the [GitHub Actions Section of the repository](https://github.com/datacebo/download-analytics/actions)
and click on the [Manual Collection Workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml).

| ![manual-collection](imgs/manual-collection.png "Manual Collection Workflow") |
Expand All @@ -85,8 +85,8 @@ In order to do this, you will need to follow these steps:

| Argument | Example | Description |
| -------- | ------- | ----------- |
| **Project(s)** | `sdv copulas ctgan` | The project or projects that need to be collected, sparated by spaces. If left empty, all the projects included in the `config.yaml` file will be collectged. |
| **Google Folder ID** | `10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n` | The ID of the folder where results will be stored. In most cases the default does not need to be changed. |
| **Project(s)** | `sdv copulas ctgan` | The project or projects that need to be collected, separated by spaces. If left empty, all the projects included in the `config.yaml` file will be collected. |
| **Google Folder ID** | `10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n` | The ID of the folder where results will be stored. In most cases, the default does not need to be changed. |
| **Max Days** | 30 | Maximum number of historical days to collect. If a `start-date` is provided in the extra args, this is ignored. |

Additionally, the `Extra Arguments` box can be used to provide any af the following options,
Expand All @@ -96,11 +96,11 @@ as if they were command line arguments:
| -------- | ------- | ----------- |
| **Dry Run** | `--dry-run` | Simulate the execution to validate the arguments, but do not actually run any query on BigQuery |
| **Start Date** | `--start-date 2020-01-01` | Used to indicate a date from which data will be collected, in ISO format `YYYY-MM-DD`. This overrides the `max-days` setting. |
| **Force** | `--force` | Ignore the data that has been previously collected and re-collect it. This is necessary to cover gaps that may exists in the previous data. |
| **Force** | `--force` | Ignore the data that has been previously collected and re-collect it. This is necessary to cover gaps that may exist in the previous data. |
| **Add Metrics** | `--add-metrics` | Apart from collecting the raw downloads, compute the aggregation metrics. This is activated and already entered in the box by default. |

For example, to force the collection of data for `sdv` and `ctgan` since `2021-01-01` while also
computing the metrics for all the existing data we would run using this configuration:
computing the metrics for all the existing data, we would run using this configuration:

| ![run-workflow-arguments](imgs/run-workflow-arguments.png "Run Workflow Arguments") |
| - |
Expand All @@ -111,23 +111,24 @@ computing the metrics for all the existing data we would run using this configur
### Debugging workflow errors.

If a workflow execution succeeds, a green tick will show up next to it. Otherwise, a red cross
will show up. In thise case, you can try to see the logs to understand what went wrong by
will show up. In this case, you can try to see the logs to understand what went wrong by
following these steps:

1. Click on the failed workflow execution.

| ![failed-workflow](imgs/failed-workflow.png "Failed Workflow") |
| - |

2. Click on the `collect` box in the center of the screen
2. Click on the `collect` box in the center of the screen.

| ![collect-box](imgs/collect-box.png "Collect Box") |
| - |

3. Expand the `Collect Downloads Data` section and scroll to the end to see the error
3. Expand the `Collect Downloads Data` section and scroll to the end to see the error.

| ![error-log](imgs/error-log.png "Error Log") |
| - |

4. In this case, we can see that the error was that the PyDrive credentials had expired and had
to be regenerated.

3 changes: 0 additions & 3 deletions download_analytics/output.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
import pathlib

import pandas as pd
from packaging.version import parse

from download_analytics import drive

Expand Down Expand Up @@ -178,8 +177,6 @@ def load_csv(csv_path, read_csv_kwargs=None):
data = pd.read_csv(stream, **read_csv_kwargs)
else:
data = pd.read_csv(csv_path, **read_csv_kwargs)
if 'version' in data.columns:
data['version'] = data['version'].apply(parse)
except FileNotFoundError:
LOGGER.info('Failed to load CSV file %s: not found', csv_path)
return None
Expand Down
Loading