Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 12 additions & 9 deletions .github/workflows/daily_collection.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ on:

jobs:
collect:
environment: prod
runs-on: ubuntu-latest-large
timeout-minutes: 25
steps:
Expand All @@ -34,28 +35,27 @@ jobs:
cache-dependency-glob: |
**/pyproject.toml
**/__main__.py
- name: Install pip and dependencies
- name: Install dependencies
run: |
uv pip install -U pip
uv pip install .
- name: Collect PyPI Downloads
run: |
uv run pymetrics collect-pypi \
--verbose \
--max-days ${{ inputs.max_days_pypi || 30 }} \
--add-metrics \
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }}
env:
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}
- name: Collect Anaconda Downloads
run: |
uv run pymetrics collect-anaconda \
--output-folder gdrive://1UnDYovLkL4gletOF5328BG1X59mSHF-Z \
--max-days ${{ inputs.max_days_anaconda || 90 }} \
--verbose
--output-folder ${{ secrets.ANACONDA_OUTPUT_FOLDER }} \
--max-days ${{ inputs.max_days_anaconda || 90 }}
env:
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
ANACONDA_OUTPUT_FOLDER: ${{ secrets.ANACONDA_OUTPUT_FOLDER }}
alert:
needs: [collect]
runs-on: ubuntu-latest
Expand All @@ -69,9 +69,12 @@ jobs:
activate-environment: true
- name: Install pip and dependencies
run: |
uv pip install -U pip
uv pip install -e .[dev]
- name: Slack alert if failure
run: uv run python -m pymetrics.slack_utils -r ${{ github.run_id }} -c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }}
run: |
uv run python -m pymetrics.slack_utils \
-r ${{ github.run_id }} \
-c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }} \
-m 'Daily Collection PyMetrics failed :fire: :dumpster-fire: :fire:'
env:
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
11 changes: 5 additions & 6 deletions .github/workflows/daily_summarize.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ on:

jobs:
summarize:
environment: prod
runs-on: ubuntu-latest-large
timeout-minutes: 10
steps:
Expand All @@ -25,15 +26,14 @@ jobs:
**/pyproject.toml
**/__main__.py
- name: Install pip and dependencies
run: |
uv pip install -U pip
uv pip install .
run: uv pip install .
- name: Run Summarize
run: |
uv run pymetrics summarize \
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }}
env:
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}
- uses: actions/checkout@v4
with:
repository: sdv-dev/sdv-dev.github.io
Expand Down Expand Up @@ -63,13 +63,12 @@ jobs:
activate-environment: true
- name: Install pip and dependencies
run: |
uv pip install -U pip
uv pip install .[dev]
- name: Slack alert if failure
run: |
uv run python -m pymetrics.slack_utils \
-r ${{ github.run_id }} \
-c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }} \
-m 'Summarize Analytics build failed :fire: :dumpster-fire: :fire:'
-m 'Daily Summarize PyMetrics failed :fire: :dumpster-fire: :fire:'
env:
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
14 changes: 7 additions & 7 deletions .github/workflows/dryrun.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ concurrency:
cancel-in-progress: true
jobs:
dry_run:
environment: stage
runs-on: ubuntu-latest-large
timeout-minutes: 25
steps:
Expand All @@ -24,33 +25,32 @@ jobs:
**/__main__.py
- name: Install pip and dependencies
run: |
uv pip install -U pip
uv pip install .
- name: Collect PyPI Downloads - Dry Run
run: |
uv run pymetrics collect-pypi \
--verbose \
--max-days 30 \
--add-metrics \
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n \
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }} \
--dry-run
env:
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}
- name: Collect Anaconda Downloads - Dry Run
run: |
uv run pymetrics collect-anaconda \
--output-folder gdrive://1UnDYovLkL4gletOF5328BG1X59mSHF-Z \
--max-days 90 \
--verbose \
--output-folder ${{ secrets.ANACONDA_OUTPUT_FOLDER }} \
--dry-run
env:
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
ANACONDA_OUTPUT_FOLDER: ${{ secrets.ANACONDA_OUTPUT_FOLDER }}
- name: Summarize - Dry Run
run: |
uv run pymetrics summarize \
--verbose \
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n \
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }} \
--dry-run
env:
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}
1 change: 0 additions & 1 deletion .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ jobs:
activate-environment: true
- name: Install pip and dependencies
run: |
uv pip install -U pip
uv pip install .[dev]
- name: Run lint checks
run: uv run invoke lint
1 change: 0 additions & 1 deletion .github/workflows/manual.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ jobs:
activate-environment: true
- name: Install pip and dependencies
run: |
uv pip install -U pip
uv pip install .
- name: Collect Downloads Data
run: |
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/unit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ jobs:
activate-environment: true
- name: Install pip and dependencies
run: |
uv pip install -U pip
uv pip install -e .[test,dev]
- name: Run summarize
run: |
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ bigquery_creds.json
client_secrets.json
credentials.json
sdv-dev.github.io/*
uv.lock

notebooks
*.xlsx
Expand Down
6 changes: 0 additions & 6 deletions MANIFEST.in

This file was deleted.

27 changes: 16 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# PyMetrics
<div align="center">
<br/>
<p align="center">
<i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a>, a project from <a href="https://datacebo.com">DataCebo</a>.</i>
</p>
<div align="left">

# PyMetrics
The PyMetrics project allows you to extract download metrics for Python libraries published on [PyPI](https://pypi.org/) and [Anaconda](https://www.anaconda.com/).

The DataCebo team uses these scripts to report download counts for the libraries in the [SDV ecosystem](https://sdv.dev/) and other libraries.
Expand All @@ -13,8 +19,8 @@ engagement metrics.
Currently, the download data is collected from the following distributions:
* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
obtained from the public BigQuery dataset, equivalent to the information shown on
[pepy.tech](https://pepy.tech) and [ClickPy](https://clickpy.clickhouse.com/)
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/)
[pepy.tech](https://pepy.tech), [ClickPy](https://clickpy.clickhouse.com/) or [pypistats](https://pypistats.org/).
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/).

* [Anaconda](https://www.anaconda.com/): Information about conda package downloads for default and select Anaconda channels.
- The conda package download data is provided by Anaconda, Inc. It includes package download counts
Expand All @@ -24,7 +30,6 @@ Currently, the download data is collected from the following distributions:
- Replace `{username}` with the Anaconda channel (`conda-forge`)
- Replace `{package_name}` with the specific package (`sdv`) in the Anaconda channel
- For each file returned by the API endpoint, the current number of downloads is saved. Over time, a historical download recording can be built.
- Both of these sources were used to track Anaconda downloads because the package data for Anaconda does not match the download count on the website. This is due to missing download data. See: https://github.com/anaconda/anaconda-package-data/issues/45

### Future Data Sources
In the future, we may expand the source distributions to include:
Expand All @@ -33,31 +38,28 @@ In the future, we may expand the source distributions to include:
## Workflows

### Daily Collection
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published to Google Drive in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).

#### Metrics
This PyPI download metrics are computed along several dimensions:

- **By Month**: The number of downloads per month.
- **By Version**: The number of downloads per version of the software, as determined by the software maintainers.
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8).
- **By Full Python Version**: The number of downloads per full Python version (eg. 3.9.1).
- **And more!**

### Daily Summarize

On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries.

The summarized data is uploaded to a GitHub repo:
On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries. The summarized data is published to a GitHub repo:
- [Downloads_Summary.xlsx](https://github.com/sdv-dev/sdv-dev.github.io/blob/gatsby-home/assets/Downloads_Summary.xlsx)

#### SDV Calculation
Installing the main SDV library also installs all the other libraries as dependencies. To calculate SDV downloads, we use an exclusive download methodology:

1. Get download counts for `sdgym` and `sdv`.
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since sdgym depends on sdv).
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since `sdgym` depends on `sdv`).
3. Get download counts for direct SDV dependencies: `rdt`, `copulas`, `ctgan`, `deepecho`, `sdmetrics`.
4. Adjust downloads for each dependency by subtracting the `sdv` download count.
4. Adjust downloads for each dependency by subtracting the `sdv` download count (since `sdv` has a direct dependency).
5. Ensure no download count goes negative using `max(0, adjusted_count)` for each library.

This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.
Expand All @@ -72,6 +74,9 @@ For more information about the configuration, workflows, and metrics, see the re
| :floppy_disk: | [COLLECTED DATA](docs/COLLECTED_DATA.md) | Explanation about the data that is being collected. |


## Known Issues
1. The conda package download data for Anaconda does not match the download count shown on the website. This is due to missing download data in the conda package download data. See this: https://github.com/anaconda/anaconda-package-data/issues/45

---

<div align="center">
Expand Down
1 change: 0 additions & 1 deletion config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
output-folder: gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
max-days: 7
projects:
- sdv
Expand Down
11 changes: 5 additions & 6 deletions docs/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ metric spreadsheets would look like this:

```bash
$ pymetrics collect-pypi --verbose --projects sdv ctgan --start-date 2021-01-01 \
--add-metrics --output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
--add-metrics --output-folder 'gdrive://{folder_id}'
```

For more details about the data that this would collect and which files would be generated
Expand All @@ -83,12 +83,12 @@ have a look at the [COLLECTED_DATA.md](COLLECTED_DATA.md) document.
## Python Interface

The Python entry point that is equivalent to the CLI explained above is the function
`pymetrics.main.collect_downloads`.
`pymetrics.main.collect_pypi_downloads`.

This function has the following interface:

```
collect_downloads(projects, output_folder, start_date=None, max_days=1, credentials_file=None,
collect_pypi_downloads(projects, output_folder, start_date=None, max_days=1, credentials_file=None,
dry_run=False, force=False, add_metrics=True)
Pull data about the downloads of a list of projects.

Expand All @@ -97,8 +97,7 @@ collect_downloads(projects, output_folder, start_date=None, max_days=1, credenti
List of projects to analyze.
output_folder (str):
Folder in which project downloads will be stored.
It can be passed as a local folder or as a Google Drive path in the format
`gdrive://{folder_id}`.
It can be passed as a local folder or as a Google Drive path in the format `gdrive://{folder_id}`.
start_date (datetime or None):
Date from which to start collecting data. If `None`,
start_date will be current date - `max_days`.
Expand Down Expand Up @@ -138,7 +137,7 @@ following modules:
* `bq.py`: Implements the code to run queries on Big Query.
* `drive.py`: Implements the functions to upload files to and download files from Google Drive.
* `__main__.py`: Implements the Command Line Interface of the project.
* `main.py`: Implements the `collect_downloads` function.
* `main.py`: Implements the `collect_pypi_downloads` function.
* `metrics.py`: Implements the functions to compute the aggregation metrics and trigger the
creation of the corresponding spreadsheets.
* `output.py`: Implements the functions to read and write CSV files and spreadsheets, both
Expand Down
2 changes: 1 addition & 1 deletion docs/WORKFLOWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The configuration about which libraries are collected is written in the [config.

```yaml
# Name or Google Drive ID of the output folder
output-path: gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
output-path: gdrive://{folder_id}

# Maximum number of days to include in the query
max-days: 7
Expand Down
Loading