Skip to content

Commit a340f18

Browse files
authored
Update README and use secrets for folder outputs (#29)
* renamed repo * renamed repo * renamed repo * rename folder * wip * fix dry run * fix output folder * fix output folder * fix dry run * force input/output folders * lint * speed up dry run * lint * fix secret * fix secret * fix anaconda * fix input * fix readme * fix input file * update docstring * update docstring
1 parent 4aae618 commit a340f18

18 files changed

+100
-86
lines changed

.github/workflows/daily_collection.yaml

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ on:
2222

2323
jobs:
2424
collect:
25+
environment: prod
2526
runs-on: ubuntu-latest-large
2627
timeout-minutes: 25
2728
steps:
@@ -34,28 +35,27 @@ jobs:
3435
cache-dependency-glob: |
3536
**/pyproject.toml
3637
**/__main__.py
37-
- name: Install pip and dependencies
38+
- name: Install dependencies
3839
run: |
39-
uv pip install -U pip
4040
uv pip install .
4141
- name: Collect PyPI Downloads
4242
run: |
4343
uv run pymetrics collect-pypi \
44-
--verbose \
4544
--max-days ${{ inputs.max_days_pypi || 30 }} \
4645
--add-metrics \
47-
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
46+
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }}
4847
env:
4948
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
5049
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
50+
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}
5151
- name: Collect Anaconda Downloads
5252
run: |
5353
uv run pymetrics collect-anaconda \
54-
--output-folder gdrive://1UnDYovLkL4gletOF5328BG1X59mSHF-Z \
55-
--max-days ${{ inputs.max_days_anaconda || 90 }} \
56-
--verbose
54+
--output-folder ${{ secrets.ANACONDA_OUTPUT_FOLDER }} \
55+
--max-days ${{ inputs.max_days_anaconda || 90 }}
5756
env:
5857
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
58+
ANACONDA_OUTPUT_FOLDER: ${{ secrets.ANACONDA_OUTPUT_FOLDER }}
5959
alert:
6060
needs: [collect]
6161
runs-on: ubuntu-latest
@@ -69,9 +69,12 @@ jobs:
6969
activate-environment: true
7070
- name: Install pip and dependencies
7171
run: |
72-
uv pip install -U pip
7372
uv pip install -e .[dev]
7473
- name: Slack alert if failure
75-
run: uv run python -m pymetrics.slack_utils -r ${{ github.run_id }} -c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }}
74+
run: |
75+
uv run python -m pymetrics.slack_utils \
76+
-r ${{ github.run_id }} \
77+
-c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }} \
78+
-m 'Daily Collection PyMetrics failed :fire: :dumpster-fire: :fire:'
7679
env:
7780
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}

.github/workflows/daily_summarize.yaml

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ on:
1212

1313
jobs:
1414
summarize:
15+
environment: prod
1516
runs-on: ubuntu-latest-large
1617
timeout-minutes: 10
1718
steps:
@@ -25,15 +26,14 @@ jobs:
2526
**/pyproject.toml
2627
**/__main__.py
2728
- name: Install pip and dependencies
28-
run: |
29-
uv pip install -U pip
30-
uv pip install .
29+
run: uv pip install .
3130
- name: Run Summarize
3231
run: |
3332
uv run pymetrics summarize \
34-
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
33+
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }}
3534
env:
3635
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
36+
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}
3737
- uses: actions/checkout@v4
3838
with:
3939
repository: sdv-dev/sdv-dev.github.io
@@ -63,13 +63,12 @@ jobs:
6363
activate-environment: true
6464
- name: Install pip and dependencies
6565
run: |
66-
uv pip install -U pip
6766
uv pip install .[dev]
6867
- name: Slack alert if failure
6968
run: |
7069
uv run python -m pymetrics.slack_utils \
7170
-r ${{ github.run_id }} \
7271
-c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }} \
73-
-m 'Summarize Analytics build failed :fire: :dumpster-fire: :fire:'
72+
-m 'Daily Summarize PyMetrics failed :fire: :dumpster-fire: :fire:'
7473
env:
7574
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}

.github/workflows/dryrun.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ concurrency:
1010
cancel-in-progress: true
1111
jobs:
1212
dry_run:
13+
environment: stage
1314
runs-on: ubuntu-latest-large
1415
timeout-minutes: 25
1516
steps:
@@ -24,33 +25,32 @@ jobs:
2425
**/__main__.py
2526
- name: Install pip and dependencies
2627
run: |
27-
uv pip install -U pip
2828
uv pip install .
2929
- name: Collect PyPI Downloads - Dry Run
3030
run: |
3131
uv run pymetrics collect-pypi \
32-
--verbose \
3332
--max-days 30 \
3433
--add-metrics \
35-
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n \
34+
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }} \
3635
--dry-run
3736
env:
3837
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
3938
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
39+
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}
4040
- name: Collect Anaconda Downloads - Dry Run
4141
run: |
4242
uv run pymetrics collect-anaconda \
43-
--output-folder gdrive://1UnDYovLkL4gletOF5328BG1X59mSHF-Z \
4443
--max-days 90 \
45-
--verbose \
44+
--output-folder ${{ secrets.ANACONDA_OUTPUT_FOLDER }} \
4645
--dry-run
4746
env:
4847
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
48+
ANACONDA_OUTPUT_FOLDER: ${{ secrets.ANACONDA_OUTPUT_FOLDER }}
4949
- name: Summarize - Dry Run
5050
run: |
5151
uv run pymetrics summarize \
52-
--verbose \
53-
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n \
52+
--output-folder ${{ secrets.PYPI_OUTPUT_FOLDER }} \
5453
--dry-run
5554
env:
5655
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
56+
PYPI_OUTPUT_FOLDER: ${{ secrets.PYPI_OUTPUT_FOLDER }}

.github/workflows/lint.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ jobs:
2020
activate-environment: true
2121
- name: Install pip and dependencies
2222
run: |
23-
uv pip install -U pip
2423
uv pip install .[dev]
2524
- name: Run lint checks
2625
run: uv run invoke lint

.github/workflows/manual.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@ jobs:
3232
activate-environment: true
3333
- name: Install pip and dependencies
3434
run: |
35-
uv pip install -U pip
3635
uv pip install .
3736
- name: Collect Downloads Data
3837
run: |

.github/workflows/unit.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ jobs:
2626
activate-environment: true
2727
- name: Install pip and dependencies
2828
run: |
29-
uv pip install -U pip
3029
uv pip install -e .[test,dev]
3130
- name: Run summarize
3231
run: |

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ bigquery_creds.json
33
client_secrets.json
44
credentials.json
55
sdv-dev.github.io/*
6+
uv.lock
67

78
notebooks
89
*.xlsx

MANIFEST.in

Lines changed: 0 additions & 6 deletions
This file was deleted.

README.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
1-
# PyMetrics
1+
<div align="center">
2+
<br/>
3+
<p align="center">
4+
<i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a>, a project from <a href="https://datacebo.com">DataCebo</a>.</i>
5+
</p>
6+
<div align="left">
27

8+
# PyMetrics
39
The PyMetrics project allows you to extract download metrics for Python libraries published on [PyPI](https://pypi.org/) and [Anaconda](https://www.anaconda.com/).
410

511
The DataCebo team uses these scripts to report download counts for the libraries in the [SDV ecosystem](https://sdv.dev/) and other libraries.
@@ -13,8 +19,8 @@ engagement metrics.
1319
Currently, the download data is collected from the following distributions:
1420
* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
1521
obtained from the public BigQuery dataset, equivalent to the information shown on
16-
[pepy.tech](https://pepy.tech) and [ClickPy](https://clickpy.clickhouse.com/)
17-
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/)
22+
[pepy.tech](https://pepy.tech), [ClickPy](https://clickpy.clickhouse.com/) or [pypistats](https://pypistats.org/).
23+
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/).
1824

1925
* [Anaconda](https://www.anaconda.com/): Information about conda package downloads for default and select Anaconda channels.
2026
- The conda package download data is provided by Anaconda, Inc. It includes package download counts
@@ -24,7 +30,6 @@ Currently, the download data is collected from the following distributions:
2430
- Replace `{username}` with the Anaconda channel (`conda-forge`)
2531
- Replace `{package_name}` with the specific package (`sdv`) in the Anaconda channel
2632
- For each file returned by the API endpoint, the current number of downloads is saved. Over time, a historical download recording can be built.
27-
- Both of these sources were used to track Anaconda downloads because the package data for Anaconda does not match the download count on the website. This is due to missing download data. See: https://github.com/anaconda/anaconda-package-data/issues/45
2833

2934
### Future Data Sources
3035
In the future, we may expand the source distributions to include:
@@ -33,31 +38,28 @@ In the future, we may expand the source distributions to include:
3338
## Workflows
3439

3540
### Daily Collection
36-
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published to Google Drive in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
41+
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
3742

3843
#### Metrics
3944
This PyPI download metrics are computed along several dimensions:
4045

4146
- **By Month**: The number of downloads per month.
4247
- **By Version**: The number of downloads per version of the software, as determined by the software maintainers.
4348
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8).
44-
- **By Full Python Version**: The number of downloads per full Python version (eg. 3.9.1).
4549
- **And more!**
4650

4751
### Daily Summarize
4852

49-
On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries.
50-
51-
The summarized data is uploaded to a GitHub repo:
53+
On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries. The summarized data is published to a GitHub repo:
5254
- [Downloads_Summary.xlsx](https://github.com/sdv-dev/sdv-dev.github.io/blob/gatsby-home/assets/Downloads_Summary.xlsx)
5355

5456
#### SDV Calculation
5557
Installing the main SDV library also installs all the other libraries as dependencies. To calculate SDV downloads, we use an exclusive download methodology:
5658

5759
1. Get download counts for `sdgym` and `sdv`.
58-
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since sdgym depends on sdv).
60+
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since `sdgym` depends on `sdv`).
5961
3. Get download counts for direct SDV dependencies: `rdt`, `copulas`, `ctgan`, `deepecho`, `sdmetrics`.
60-
4. Adjust downloads for each dependency by subtracting the `sdv` download count.
62+
4. Adjust downloads for each dependency by subtracting the `sdv` download count (since `sdv` has a direct dependency).
6163
5. Ensure no download count goes negative using `max(0, adjusted_count)` for each library.
6264

6365
This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.
@@ -72,6 +74,9 @@ For more information about the configuration, workflows, and metrics, see the re
7274
| :floppy_disk: | [COLLECTED DATA](docs/COLLECTED_DATA.md) | Explanation about the data that is being collected. |
7375

7476

77+
## Known Issues
78+
1. The conda package download data for Anaconda does not match the download count shown on the website. This is due to missing download data in the conda package download data. See this: https://github.com/anaconda/anaconda-package-data/issues/45
79+
7580
---
7681

7782
<div align="center">

config.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
output-folder: gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
21
max-days: 7
32
projects:
43
- sdv

0 commit comments

Comments
 (0)