Skip to content

Commit 32a3e43

Browse files
gshenirwedge
andauthored
Include pre-release download when summarizing download counts (#25)
* wip * lint * fix start date * add project print * fix print * update message * update to use pyarrow dtypes * fix string * update to ubuntu-latest-largeA * update to ubuntu * fix engine * docstring * use category dtype * remove pyarrow * fix ns * lint * use pyarrow everywhere * remove pyarrow dtypes * add readme instructions * fix manual * cleanup * fix manual * fix manual * fix max_days * fix docs * wip * wip * wip * fix workflow * fix workflow * add message to workflow * cleanup * fix repo * fix slack msg * fix slack msg * use extensions * summarize fix" * use uv * fix uv * use cache * change token * add unit tests * add unit workflow * add dry-run * remove unused arg * fix dry run * use uv in lint * add date * cleanup readme * Rename daily_collect.yaml to daily_collection.yaml * Update daily_collection.yaml * Update daily_summarize.yaml * Update dryrun.yaml * Update lint.yaml * Update manual.yaml * Update unit.yaml * wip * Address feedback 2 * add version parse * use object dtype * fix local write * lint * Update daily_summarize.yaml * cleanup * wip * exclude pre-releases * wip * cleanup * cleanup * cleanup * cleanup * update workflow * rename workflow * fix unit tests * define cache break * force reinstall * remove force install * lint * fix summarize config * cleanup * fix dry run * Update dryrun.yaml * Update dryrun.yaml * fix dry run * fix dry run * fix dry run * fix write * remove breakpoint * Update download_analytics/time_utils.py Co-authored-by: Roy Wedge <[email protected]> * fix tz * update readme and add parameter * cleanup * cleanup * cleanup * cleanup * img * fix based on feedback * fix workflow * Update summarize.py * Update README.md * Update summarize.py --------- Co-authored-by: Roy Wedge <[email protected]>
1 parent bf4ad02 commit 32a3e43

File tree

5 files changed

+153
-77
lines changed

5 files changed

+153
-77
lines changed

README.md

Lines changed: 81 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,96 @@
11
# Download Analytics
22

3-
The Download Analytics project allows you to extract download metrics from a Python library published on [PyPI](https://pypi.org/).
3+
The Download Analytics project allows you to extract download metrics for Python libraries published on [PyPI](https://pypi.org/) and [Anaconda](https://www.anaconda.com/).
44

5-
## Overview
5+
The DataCebo team uses these scripts to report download counts for the libraries in the [SDV ecosystem](https://sdv.dev/) and other libraries.
66

7+
## Overview
78
The Download Analytics project is a collection of scripts and tools to extract information
8-
about OSS project downloads from diffierent sources and to analyze them to produce user
9+
about OSS project downloads from different sources and to analyze them to produce user
910
engagement metrics.
1011

11-
### Data sources
12+
### Data Sources
13+
Currently, the download data is collected from the following distributions:
14+
* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
15+
obtained from the public BigQuery dataset, equivalent to the information shown on
16+
[pepy.tech](https://pepy.tech) and [ClickPy](https://clickpy.clickhouse.com/)
17+
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/)
1218

13-
Currently the download data is collected from the following distributions:
19+
* [Anaconda](https://www.anaconda.com/): Information about conda package downloads for default and select Anaconda channels.
20+
- The conda package download data is provided by Anaconda, Inc. It includes package download counts
21+
starting from January 2017. More information about this dataset can be found on the [official README.md](https://github.com/anaconda/anaconda-package-data/blob/master/README.md).
22+
- Additional conda package downloads are retrieved using the public API provided by Anaconda. This allows for the retrieval of the current number of downloads for each file served.
23+
- Anaconda API Endpoint: https://api.anaconda.org/package/{username}/{package_name}
24+
- Replace `{username}` with the Anaconda channel (`conda-forge`)
25+
- Replace `{package_name}` with the specific package (`sdv`) in the Anaconda channel
26+
- For each file returned by the API endpoint, the current number of downloads is saved. Over time, a historical download recording can be built.
27+
- Both of these sources were used to track Anaconda downloads because the package data for Anaconda does not match the download count on the website. This is due to missing download data. See: https://github.com/anaconda/anaconda-package-data/issues/45
1428

15-
* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
16-
obtained from the public Big Query dataset, equivalent to the information shown on
17-
[pepy.tech](https://pepy.tech).
18-
* [conda-forge](https://conda-forge.org/): Information about the project downloads from the
19-
`conda-forge` channel on `conda`.
20-
- The conda package download data provided by Anaconda. It includes package download counts
21-
starting from January 2017. More information:
22-
- https://github.com/anaconda/anaconda-package-data
23-
- The conda package metadata data provided by Anaconda. There is a public API which allows for
24-
the retrieval of package information, including current number of downloads.
25-
- https://api.anaconda.org/package/{username}/{package_name}
26-
- Replace {username} with the Anaconda username (`conda-forge`) and {package_name} with
27-
the specific package name (`sdv`).
28-
29-
In the future, we may also expand the source distributions to include:
30-
31-
* [github](https://github.com/): Information about the project downloads from github releases.
32-
33-
For more information about how to configure and use the software, or about the data that is being
34-
collected check the resources below.
35-
36-
### Add new libraries
37-
In order add new libraries, it is important to follow these steps to ensure that data is backfilled.
38-
1. Update `config.yaml` with the new libraries (pypi project names only for now)
39-
2. Run the [Manual collection workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml) on your branch.
40-
- Use workflow from **your branch name**.
41-
- List all project names from config.yaml
42-
- Remove `7` from max days to indicate you want all data
43-
- Pass any extra arguments (for example `--dry-run` to test your changes)
44-
3. Let the workflow finish and check that pypi.csv contains the right data.
45-
4. Get your pull request reviewed and merged into `main`. The daily collection workflow will fill the data for the last 30 days and future days.
46-
- Note: The collection script looks at timestamps and avoids adding overlapping data.
47-
48-
### Metrics
49-
This library collects the number of downloads for your chosen software. You can break these up along several dimensions:
50-
51-
- **By Month**: The number of downloads per month
52-
- **By Version**: The number of downloads per version of the software, as determine by the software maintainers
53-
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8)
54-
- **And more!** See the resources below for more information.
29+
### Future Data Sources
30+
In the future, we may expand the source distributions to include:
31+
* [GitHub Releases](https://github.com/): Information about the project downloads from GitHub releases.
32+
33+
## Workflows
34+
35+
### Daily Collection
36+
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published to Google Drive in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
37+
38+
#### Metrics
39+
This PyPI download metrics are computed along several dimensions:
40+
41+
- **By Month**: The number of downloads per month.
42+
- **By Version**: The number of downloads per version of the software, as determined by the software maintainers.
43+
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8).
44+
- **By Full Python Version**: The number of downloads per full Python version (eg. 3.9.1).
45+
- **And more!**
46+
47+
### Daily Summarize
48+
49+
On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries.
50+
51+
The summarized data is uploaded to a GitHub repo:
52+
- [Downloads_Summary.xlsx](https://github.com/sdv-dev/sdv-dev.github.io/blob/gatsby-home/assets/Downloads_Summary.xlsx)
53+
54+
#### SDV Calculation
55+
Installing the main SDV library also installs all the other libraries as dependencies. To calculate SDV downloads, we use an exclusive download methodology:
56+
57+
1. Get download counts for `sdgym` and `sdv`.
58+
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since sdgym depends on sdv).
59+
3. Get download counts for direct SDV dependencies: `rdt`, `copulas`, `ctgan`, `deepecho`, `sdmetrics`.
60+
4. Adjust downloads for each dependency by subtracting the `sdv` download count.
61+
5. Ensure no download count goes negative using `max(0, adjusted_count)` for each library.
62+
63+
This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.
5564

5665
## Resources
57-
For more information about the configuration, workflows and metrics, see the resources below.
66+
For more information about the configuration, workflows, and metrics, see the resources below.
5867
| | Document | Description |
5968
| ------------- | ----------------------------------- | ----------- |
60-
| :pilot: | [WORKFLOWS](docs/WORKFLOWS.md) | How to collect data and add new libraries to the Github actions. |
61-
| :gear: | [SETUP](docs/SETUP.md) | How to generate credentials to access BigQuery and Google Drive and add them to Github Actions. |
69+
| :pilot: | [WORKFLOWS](docs/WORKFLOWS.md) | How to collect data and add new libraries to the GitHub actions. |
70+
| :gear: | [SETUP](docs/SETUP.md) | How to generate credentials to access BigQuery and Google Drive and add them to GitHub Actions. |
6271
| :keyboard: | [DEVELOPMENT](docs/DEVELOPMENT.md) | How to install and run the scripts locally. Overview of the project implementation. |
6372
| :floppy_disk: | [COLLECTED DATA](docs/COLLECTED_DATA.md) | Explanation about the data that is being collected. |
73+
74+
75+
---
76+
77+
<div align="center">
78+
<a href="https://datacebo.com"><picture>
79+
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo-dark-mode.png">
80+
<img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo.png"></img>
81+
</picture></a>
82+
</div>
83+
<br/>
84+
<br/>
85+
86+
[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](
87+
https://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we
88+
created [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.
89+
Today, DataCebo is the proud developer of SDV, the largest ecosystem for
90+
synthetic data generation & evaluation. It is home to multiple libraries that support synthetic
91+
data, including:
92+
93+
* 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
94+
* 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,
95+
multi table and time series data.
96+
* 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data

docs/COLLECTED_DATA.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ the aggregations metrics that are computed on them.
88
## PyPI Downloads
99

1010
Download Analytics collects information about the downloads from PyPI by making queries to the
11-
[public PyPI download statistics dataset on Big Query](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset)
11+
[public PyPI download statistics dataset on BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset)
1212
by running the following query:
1313

1414
```

docs/WORKFLOWS.md

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22

33
This document describes how to perform the two most common workflows:
44

5-
- **Daily Collection**: Adding one or more libraries to the Github Actions configuration
5+
- **Daily PyPI Collection**: Adding one or more libraries to the GitHub Actions configuration
66
to download data daily.
7-
- **One Shot**: Collecting downloads data for one or more libraries over a specific period
8-
using Github Actions.
7+
- **One Shot**: Collecting download data for one or more libraries over a specific period
8+
using GitHub Actions.
99

10-
## Daily Collection - Adding libraries to the scheduled Github Actions workflow
10+
## Daily Collection - Adding libraries to the scheduled GitHub Actions workflow
1111

12-
Every day at 00:00 UTC time new data is collected from all the configured sources.
12+
Every day at 00:00 UTC time, new data is collected from all the configured sources.
1313

1414
The configuration about which libraries are collected is written in the [config.yaml](
1515
../config.yaml) file, which has the following format:
@@ -34,43 +34,43 @@ projects:
3434
- ydata-synthetic
3535
```
3636
37-
In order to add a new library to the collection you need to follow these steps:
37+
In order to add a new library to the collection, you need to follow these steps:
3838
39-
1. Open the [config.yaml](../config.yaml) file to edit it. If you do this directly on the Github
40-
UI you can edit the file directly by clicking on the pencil icon at the top right corner.
39+
1. Open the [config.yaml](../config.yaml) file to edit it. If you do this directly on the GitHub
40+
UI, you can edit the file directly by clicking on the pencil icon at the top right corner.
4141
4242
| ![edit-config](imgs/edit-config.png "Edit the config.yaml file") |
4343
| - |
4444
4545
2. Add the libraries that you want to track to the list under `projects`. Make sure to replicate
4646
the same indentation as the existing libraries and use the exact same name as the corresponding
47-
`https://pypi.org` and `https://pepy.tech` project pages, otherwise the project downloads
47+
`https://pypi.org` and `https://pepy.tech` project pages; otherwise, the project downloads
4848
will not be found.
4949

50-
For example, in the screenshot below we would be adding the library [DataSynthesizer](
50+
For example, in the screenshot below, we would be adding the library [DataSynthesizer](
5151
https://pypi.org/project/DataSynthesizer/):
5252

5353
| ![add-library](imgs/add-library.png "Add a new library to the list") |
5454
| - |
5555

56-
3. Save the file, commmit it and create a Pull Request. If you are editing the file directly
57-
on Github, all these steps can be done at the same time using the form at the bottom of
56+
3. Save the file, commit it, and create a Pull Request. If you are editing the file directly
57+
on GitHub, all these steps can be done at the same time using the form at the bottom of
5858
the file.
5959

6060
| ![create-pr](imgs/create-pr.png "Create a Pull Request") |
6161
| - |
6262

6363
After these steps are done, the Pull Request will be ready to be validated and merged, and
64-
after it is merged the downloads of the new library will start to be added to the output CSV file.
64+
after it is merged, the downloads of the new library will start to be added to the output CSV file.
6565

6666
## One Shot - Collecting data over a specific period.
6767

6868
Download Analytics is prepared to collect data for one or more libraries over a specific period
69-
using a [Github Actions Workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml).
69+
using a [GitHub Actions Workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml).
7070

7171
In order to do this, you will need to follow these steps:
7272

73-
1. Enter the [Github Actions Section of the repository](https://github.com/datacebo/download-analytics/actions)
73+
1. Enter the [GitHub Actions Section of the repository](https://github.com/datacebo/download-analytics/actions)
7474
and click on the [Manual Collection Workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml).
7575

7676
| ![manual-collection](imgs/manual-collection.png "Manual Collection Workflow") |
@@ -85,8 +85,8 @@ In order to do this, you will need to follow these steps:
8585

8686
| Argument | Example | Description |
8787
| -------- | ------- | ----------- |
88-
| **Project(s)** | `sdv copulas ctgan` | The project or projects that need to be collected, sparated by spaces. If left empty, all the projects included in the `config.yaml` file will be collectged. |
89-
| **Google Folder ID** | `10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n` | The ID of the folder where results will be stored. In most cases the default does not need to be changed. |
88+
| **Project(s)** | `sdv copulas ctgan` | The project or projects that need to be collected, separated by spaces. If left empty, all the projects included in the `config.yaml` file will be collected. |
89+
| **Google Folder ID** | `10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n` | The ID of the folder where results will be stored. In most cases, the default does not need to be changed. |
9090
| **Max Days** | 30 | Maximum number of historical days to collect. If a `start-date` is provided in the extra args, this is ignored. |
9191

9292
Additionally, the `Extra Arguments` box can be used to provide any af the following options,
@@ -96,11 +96,11 @@ as if they were command line arguments:
9696
| -------- | ------- | ----------- |
9797
| **Dry Run** | `--dry-run` | Simulate the execution to validate the arguments, but do not actually run any query on BigQuery |
9898
| **Start Date** | `--start-date 2020-01-01` | Used to indicate a date from which data will be collected, in ISO format `YYYY-MM-DD`. This overrides the `max-days` setting. |
99-
| **Force** | `--force` | Ignore the data that has been previously collected and re-collect it. This is necessary to cover gaps that may exists in the previous data. |
99+
| **Force** | `--force` | Ignore the data that has been previously collected and re-collect it. This is necessary to cover gaps that may exist in the previous data. |
100100
| **Add Metrics** | `--add-metrics` | Apart from collecting the raw downloads, compute the aggregation metrics. This is activated and already entered in the box by default. |
101101

102102
For example, to force the collection of data for `sdv` and `ctgan` since `2021-01-01` while also
103-
computing the metrics for all the existing data we would run using this configuration:
103+
computing the metrics for all the existing data, we would run using this configuration:
104104

105105
| ![run-workflow-arguments](imgs/run-workflow-arguments.png "Run Workflow Arguments") |
106106
| - |
@@ -111,23 +111,24 @@ computing the metrics for all the existing data we would run using this configur
111111
### Debugging workflow errors.
112112

113113
If a workflow execution succeeds, a green tick will show up next to it. Otherwise, a red cross
114-
will show up. In thise case, you can try to see the logs to understand what went wrong by
114+
will show up. In this case, you can try to see the logs to understand what went wrong by
115115
following these steps:
116116

117117
1. Click on the failed workflow execution.
118118

119119
| ![failed-workflow](imgs/failed-workflow.png "Failed Workflow") |
120120
| - |
121121

122-
2. Click on the `collect` box in the center of the screen
122+
2. Click on the `collect` box in the center of the screen.
123123

124124
| ![collect-box](imgs/collect-box.png "Collect Box") |
125125
| - |
126126

127-
3. Expand the `Collect Downloads Data` section and scroll to the end to see the error
127+
3. Expand the `Collect Downloads Data` section and scroll to the end to see the error.
128128

129129
| ![error-log](imgs/error-log.png "Error Log") |
130130
| - |
131131

132132
4. In this case, we can see that the error was that the PyDrive credentials had expired and had
133133
to be regenerated.
134+

download_analytics/output.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
import pathlib
66

77
import pandas as pd
8-
from packaging.version import parse
98

109
from download_analytics import drive
1110

@@ -178,8 +177,6 @@ def load_csv(csv_path, read_csv_kwargs=None):
178177
data = pd.read_csv(stream, **read_csv_kwargs)
179178
else:
180179
data = pd.read_csv(csv_path, **read_csv_kwargs)
181-
if 'version' in data.columns:
182-
data['version'] = data['version'].apply(parse)
183180
except FileNotFoundError:
184181
LOGGER.info('Failed to load CSV file %s: not found', csv_path)
185182
return None

0 commit comments

Comments
 (0)