Skip to content

Commit 538f270

Browse files
authored
Add ci column to determine if package download came from CI (#33)
* updates * updates * updates * fix * fix * fix parse version * fix readme * update filename
1 parent 0917f4f commit 538f270

File tree

10 files changed

+180
-59
lines changed

10 files changed

+180
-59
lines changed

.github/workflows/daily_summarize.yaml renamed to .github/workflows/daily_summarization.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Daily Summarize
1+
name: Daily Summarization
22

33
on:
44
workflow_dispatch:

README.md

Lines changed: 77 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -48,20 +48,37 @@ Currently, the download data is collected from the following distributions:
4848
In the future, we may expand the source distributions to include:
4949
* [GitHub Releases](https://github.com/): Information about the project downloads from GitHub releases.
5050

51+
# Install
52+
Install pymetrics using pip (or uv):
53+
```shell
54+
pip install git+ssh://[email protected]/sdv-dev/pymetrics
55+
```
56+
57+
## Local Usage
58+
Collect metrics from PyPI by running `pymetrics` on your computer. You need to provide the following:
59+
60+
1. BigQuery Credentials. In order to get PyPI download data, you need to execute queries on Google BigQuery.
61+
Therefore, you will need an authentication JSON file, which must be provided to you by a privileged admin.
62+
Once you have this JSON file, export the contents of the credentials file into a
63+
`BIGQUERY_CREDENTIALS` environment variable.
64+
2. A list of PyPI projects for which to collect the download metrics, defined in a YAML file.
65+
See [config.yaml](./config.yaml) for an example.
66+
3. Optional. A set of Google Drive Credentials can be provided in the format required by `PyDrive`. The
67+
credentials can be passed via the `PYDRIVE_CREDENTIALS` environment variable.
68+
- See [instructions from PyDrive](https://pythonhosted.org/PyDrive/quickstart.html).
69+
70+
You can run pymetrics with the following CLI command:
71+
72+
```shell
73+
pymetrics collect-pypi --max-days 30 --add-metrics --output-folder {OUTPUT_FOLDER}
74+
```
75+
5176
## Workflows
5277

5378
### Daily Collection
54-
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
55-
56-
#### Metrics
57-
This PyPI download metrics are computed along several dimensions:
79+
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see [#Aggregation Metrics](#aggregation-metrics))
5880

59-
- **By Month**: The number of downloads per month.
60-
- **By Version**: The number of downloads per version of the software, as determined by the software maintainers.
61-
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8).
62-
- **And more!**
63-
64-
### Daily Summarize
81+
### Daily Summarization
6582

6683
On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries. The summarized data is published to a GitHub repo:
6784
- [Downloads_Summary.xlsx](https://github.com/sdv-dev/sdv-dev.github.io/blob/gatsby-home/assets/Downloads_Summary.xlsx)
@@ -77,5 +94,55 @@ Installing the main SDV library also installs all the other libraries as depende
7794

7895
This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.
7996

97+
## PyPI Data
98+
PyMetrics collects download information from PyPI by querying the [public PyPI download statistics dataset on BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset). The following data fields are captured for each download event:
99+
100+
**Temporal & Geographic Data:**
101+
* `timestamp`: The timestamp at which the download happened
102+
* `country_code`: The 2-letter country code
103+
104+
**Package Information:**
105+
* `project`: The name of the PyPI project (library) that is being downloaded
106+
* `version`: The downloaded version
107+
* `type`: The type of file that was downloaded (source or wheel)
108+
109+
**Installation Environment:**
110+
* `installer_name`: The installer used for the download, like `pip` or `bandersnatch` or `uv`
111+
* `implementation_name`: The name of the Python implementation, such as `cpython`
112+
* `implementation_version`: The Python version
113+
* `ci`: A boolean flag indicating whether the download originated from a CI system (True, False, or null). This is determined by checking for specific environment variables set by CI platforms such as Azure Pipelines (`BUILD_BUILDID`), Jenkins (`BUILD_ID`), or general CI indicators (`CI`, `PIP_IS_CI`)
114+
115+
**System Information:**
116+
* `distro_name`: Name of the Linux or Mac distribution (empty if Windows)
117+
* `distro_version`: Distribution version (empty for Windows)
118+
* `system_name`: Type of OS, like Linux, Darwin (for Mac), or Windows
119+
* `system_release`: OS version in case of Windows, kernel version in case of Unix
120+
* `cpu`: CPU architecture used
121+
122+
## Aggregation Metrics
123+
124+
If the `--add-metrics` option is passed to `pymetrics`, a spreadsheet with aggregation
125+
metrics will be created alongside the raw PyPI downloads CSV file for each individual project.
126+
127+
The aggregation metrics spreasheets contain the following tabs:
128+
129+
* **By Month:** Number of downloads per month and increase in the number of downloads from month to month.
130+
* **By Version:** Absolute and relative number of downloads per version.
131+
* **By Country Code:** Absolute and relative number of downloads per Country.
132+
* **By Python Version:** Absolute and relative number of downloads per minor Python Version (X.Y, like 3.8).
133+
* **By Full Python Version:** Absolute and relative number of downloads per Python Version, including
134+
the patch number (X.Y.Z, like 3.8.1).
135+
* **By Installer Name:** Absolute and relative number of downloads per Installer (e.g. pip)
136+
* **By Distro Name:** Absolute and relative number of downloads per Distribution Name (e.g. Ubuntu)
137+
* **By Distro Name:** Absolute and relative number of downloads per Distribution Name AND Version (e.g. Ubuntu 20.04)
138+
* **By Distro Kernel:** Absolute and relative number of downloads per Distribution Name, Version AND Kernel (e.g. Ubuntu 18.04 - 5.4.104+)
139+
* **By OS Type:** Absolute and relative number of downloads per OS Type (e.g. Linux)
140+
* **By Cpu:** Absolute and relative number of downloads per CPU Version (e.g. AMD64)
141+
* **By CI**: Absolute and relative number of downloads by CI status (automated vs. manual installations)
142+
* **By Month and Version:** Absolute number of downloads per month and version.
143+
* **By Month and Python Version:** Absolute number of downloads per month and Python version.
144+
* **By Month and Country Code:** Absolute number of downloads per month and country.
145+
* **By Month and Installer Name:** Absolute number of downloads per month and Installer.
146+
80147
## Known Issues
81148
1. The conda package download data for Anaconda does not match the download count shown on the website. This is due to missing download data in the conda package download data. See this: https://github.com/anaconda/anaconda-package-data/issues/45

config.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
max-days: 7
21
projects:
32
- sdv
43
- ctgan

pymetrics/__main__.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def _collect_pypi(args):
4949
config = _load_config(args.config_file)
5050
projects = args.projects or config['projects']
5151
output_folder = args.output_folder
52-
max_days = args.max_days or config.get('max-days')
52+
max_days = args.max_days
5353

5454
collect_pypi_downloads(
5555
projects=projects,
@@ -175,7 +175,7 @@ def _get_parser():
175175
'--max-days',
176176
type=int,
177177
required=False,
178-
help='Max days of data to pull if start-date is not given.',
178+
help='Max days of data to pull if start-date is not given',
179179
)
180180
collect_pypi.add_argument(
181181
'-f',
@@ -241,7 +241,7 @@ def _get_parser():
241241
type=int,
242242
required=False,
243243
default=90,
244-
help='Max days of data to pull.',
244+
help='Max days of data to pull. Default to last 90 days.',
245245
)
246246
return parser
247247

pymetrics/drive.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def upload(content, filename, folder, convert=False):
9797

9898
drive_file.content = content
9999
drive_file.Upload({'convert': convert})
100-
LOGGER.info('Uploaded file %s', drive_file.metadata['alternateLink'])
100+
LOGGER.info(f'Uploaded filename {filename}')
101101

102102

103103
def download(folder, filename, xlsx=False):

pymetrics/metrics.py

Lines changed: 32 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
"""Functions to compute aggregation metrics over raw downloads."""
22

33
import logging
4-
import re
54

5+
import numpy as np
66
import pandas as pd
7+
from packaging.version import InvalidVersion, Version
78

89
from pymetrics.output import create_spreadsheet
910

1011
LOGGER = logging.getLogger(__name__)
1112

1213

1314
def _groupby(downloads, groupby, index_name=None, percent=True):
14-
grouped = downloads.groupby(groupby).size().reset_index()
15+
grouped = downloads.groupby(groupby, dropna=False).size().reset_index()
1516
grouped.columns = [index_name or groupby, 'downloads']
1617
if percent:
1718
grouped['percent'] = (grouped.downloads * 100 / grouped.downloads.sum()).round(3)
@@ -78,6 +79,7 @@ def _get_sheet_name(column):
7879
'distro_kernel',
7980
'OS_type',
8081
'cpu',
82+
'ci',
8183
]
8284
SORT_BY_DOWNLOADS = [
8385
'country_code',
@@ -104,34 +106,6 @@ def _get_sheet_name(column):
104106
]
105107

106108

107-
RE_NUMERIC = re.compile(r'^\d+')
108-
109-
110-
def _version_element_order_key(version):
111-
components = []
112-
last_component = None
113-
last_numeric = None
114-
for component in version.split('.', 2):
115-
if RE_NUMERIC.match(component):
116-
try:
117-
numeric = RE_NUMERIC.match(component).group(0)
118-
components.append(int(numeric))
119-
last_component = component
120-
last_numeric = numeric
121-
except AttributeError:
122-
# From time to time this errors out in github actions
123-
# while it shouldn't enter the `if`.
124-
pass
125-
126-
components.append(last_component[len(last_numeric) :])
127-
128-
return components
129-
130-
131-
def _version_order_key(version_column):
132-
return version_column.apply(_version_element_order_key)
133-
134-
135109
def _mangle_columns(downloads):
136110
downloads = downloads.rename(columns=RENAME_COLUMNS)
137111
for col in [
@@ -153,6 +127,32 @@ def _mangle_columns(downloads):
153127
return downloads
154128

155129

130+
def _safe_version_parse(version_str):
131+
if pd.isna(version_str):
132+
return np.nan
133+
134+
try:
135+
version = Version(str(version_str))
136+
except InvalidVersion:
137+
cleaned = str(version_str).rstrip('+~')
138+
try:
139+
version = Version(cleaned)
140+
except (InvalidVersion, TypeError):
141+
LOGGER.info(f'Unable to parse version: {version_str}')
142+
version = np.nan
143+
144+
return version
145+
146+
147+
def _version_order_key(version_column):
148+
return version_column.apply(_safe_version_parse)
149+
150+
151+
def _sort_by_version(data, column, ascending=False):
152+
data = data.sort_values(by=column, key=_version_order_key, ascending=ascending)
153+
return data
154+
155+
156156
def compute_metrics(downloads, output_path=None):
157157
"""Compute aggregation metrics over the given downloads.
158158
@@ -171,8 +171,7 @@ def compute_metrics(downloads, output_path=None):
171171
if column in SORT_BY_DOWNLOADS:
172172
sheet = sheet.sort_values('downloads', ascending=False)
173173
elif column in SORT_BY_VERSION:
174-
sheet = sheet.sort_values(column, ascending=False, key=_version_order_key)
175-
174+
sheet = _sort_by_version(sheet, column=column, ascending=False)
176175
sheets[name] = sheet
177176

178177
for column in HISTORICAL_COLUMNS:
@@ -181,7 +180,7 @@ def compute_metrics(downloads, output_path=None):
181180
sheets[name] = _historical_groupby(downloads, [column])
182181

183182
if output_path:
184-
create_spreadsheet(output_path, sheets)
183+
create_spreadsheet(output_path, sheets, na_rep='<NaN>')
185184
return None
186185

187186
return sheets

pymetrics/output.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,8 @@ def get_path(folder, filename):
3434
return str(pathlib.Path(folder) / filename)
3535

3636

37-
def _add_sheet(writer, data, sheet_name):
38-
data.to_excel(writer, sheet_name=sheet_name, index=False, engine='xlsxwriter')
37+
def _add_sheet(writer, data, sheet_name, na_rep=''):
38+
data.to_excel(writer, sheet_name=sheet_name, index=False, engine='xlsxwriter', na_rep=na_rep)
3939

4040
for column in data:
4141
column_length = None
@@ -51,7 +51,7 @@ def _add_sheet(writer, data, sheet_name):
5151
)
5252

5353

54-
def create_spreadsheet(output_path, sheets):
54+
def create_spreadsheet(output_path, sheets, na_rep=''):
5555
"""Create a spreadsheet with the indicated name and data.
5656
5757
If the ``output_path`` variable starts with ``gdrive://`` it is interpreted
@@ -74,11 +74,11 @@ def create_spreadsheet(output_path, sheets):
7474

7575
with pd.ExcelWriter(output, engine='xlsxwriter') as writer: # pylint: disable=E0110
7676
for title, data in sheets.items():
77-
_add_sheet(writer, data, title)
77+
_add_sheet(writer, data, title, na_rep=na_rep)
7878

7979
if drive.is_drive_path(output_path):
80-
LOGGER.info('Creating file %s', output_path)
8180
folder, filename = drive.split_drive_path(output_path)
81+
LOGGER.info(f'Creating filename {filename}')
8282
drive.upload(output, filename, folder, convert=True)
8383
else:
8484
if not output_path.endswith('.xlsx'):

pymetrics/pypi.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
details.system.name as system_name,
2626
details.system.release as system_release,
2727
details.cpu as cpu,
28+
details.ci as ci,
2829
FROM `bigquery-public-data.pypi.file_downloads`
2930
WHERE file.project in {projects}
3031
AND timestamp > '{start_date}'
@@ -44,6 +45,7 @@
4445
'system_name',
4546
'system_release',
4647
'cpu',
48+
'ci',
4749
]
4850

4951

@@ -129,9 +131,9 @@ def get_pypi_downloads(
129131
if previous is not None:
130132
if isinstance(projects, str):
131133
projects = (projects,)
132-
previous_projects = previous[previous.project.isin(projects)]
133-
min_date = previous_projects.timestamp.min().date()
134-
max_date = previous_projects.timestamp.max().date()
134+
previous_projects = previous[previous['project'].isin(projects)]
135+
min_date = previous_projects['timestamp'].min().date()
136+
max_date = previous_projects['timestamp'].max().date()
135137
else:
136138
previous = pd.DataFrame(columns=OUTPUT_COLUMNS)
137139
min_date = None

pymetrics/summarize.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -140,14 +140,14 @@ def get_previous_pypi_downloads(output_folder, dry_run=False):
140140
'system_name': pd.CategoricalDtype(),
141141
'system_release': pd.CategoricalDtype(),
142142
'cpu': pd.CategoricalDtype(),
143+
'ci': pd.BooleanDtype(),
143144
},
144145
}
145146
if dry_run:
146147
read_csv_kwargs['nrows'] = 10_000
147148
data = load_csv(csv_path, read_csv_kwargs=read_csv_kwargs)
148149
LOGGER.info('Parsing version column to Version class objects')
149-
if 'version' in data.columns:
150-
data['version'] = data['version'].apply(parse)
150+
data['version'] = data['version'].apply(parse)
151151
return data
152152

153153

0 commit comments

Comments
 (0)