Skip to content

Commit 90f4be5

Browse files
gshenirwedge
andauthored
Add daily workflow to collect anaconda downloads (#21)
* wip * lint * fix start date * add project print * fix print * update message * update to use pyarrow dtypes * fix string * update to ubuntu-latest-largeA * update to ubuntu * fix engine * docstring * use category dtype * remove pyarrow * fix ns * lint * use pyarrow everywhere * remove pyarrow dtypes * add readme instructions * fix manual * cleanup * fix manual * fix manual * fix max_days * fix docs * wip * wip * wip * fix workflow * fix workflow * add message to workflow * cleanup * fix repo * fix slack msg * fix slack msg * use extensions * summarize fix" * use uv * fix uv * use cache * change token * add unit tests * add unit workflow * add dry-run * remove unused arg * fix dry run * use uv in lint * add date * cleanup readme * Rename daily_collect.yaml to daily_collection.yaml * Update daily_collection.yaml * Update daily_summarize.yaml * Update dryrun.yaml * Update lint.yaml * Update manual.yaml * Update unit.yaml * wip * Address feedback 2 * add version parse * use object dtype * fix local write * lint * Update daily_summarize.yaml * cleanup * wip * exclude pre-releases * wip * cleanup * cleanup * cleanup * cleanup * update workflow * rename workflow * fix unit tests * define cache break * force reinstall * remove force install * lint * fix summarize config * cleanup * fix dry run * Update dryrun.yaml * Update dryrun.yaml * fix dry run * fix dry run * fix dry run * fix write * remove breakpoint * Update download_analytics/time_utils.py Co-authored-by: Roy Wedge <[email protected]> * fix tz * fix based on feedback * fix workflow --------- Co-authored-by: Roy Wedge <[email protected]>
1 parent 6f6dd33 commit 90f4be5

File tree

12 files changed

+468
-72
lines changed

12 files changed

+468
-72
lines changed

.github/workflows/daily_collection.yaml

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,35 +7,55 @@ on:
77
description: Slack channel to post the error message to if the builds fail.
88
required: false
99
default: "sdv-alerts-debug"
10+
max_days_pypi:
11+
description: 'Maximum number of days to collect, starting from today for PyPI.'
12+
required: false
13+
type: number
14+
default: 30
15+
max_days_anaconda:
16+
description: 'Maximum number of days to collect, starting from today for Anaconda'
17+
required: false
18+
type: number
19+
default: 90
1020
schedule:
1121
- cron: '0 0 * * *'
1222

1323
jobs:
1424
collect:
1525
runs-on: ubuntu-latest-large
16-
timeout-minutes: 20
26+
timeout-minutes: 25
1727
steps:
1828
- uses: actions/checkout@v4
1929
- name: Install uv
2030
uses: astral-sh/setup-uv@v6
2131
with:
2232
enable-cache: true
2333
activate-environment: true
34+
cache-dependency-glob: |
35+
**/pyproject.toml
36+
**/__main__.py
2437
- name: Install pip and dependencies
2538
run: |
2639
uv pip install -U pip
27-
uv pip install -e .
28-
- name: Collect Downloads Data
40+
uv pip install .
41+
- name: Collect PyPI Downloads
2942
run: |
30-
uv run download-analytics collect \
43+
uv run download-analytics collect-pypi \
3144
--verbose \
32-
--max-days 30 \
45+
--max-days ${{ inputs.max_days_pypi || 30 }} \
3346
--add-metrics \
3447
--output-folder gdrive://***REMOVED***
3548
env:
3649
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
3750
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
38-
51+
- name: Collect Anaconda Downloads
52+
run: |
53+
uv run download-analytics collect-anaconda \
54+
--output-folder gdrive://***REMOVED***-Z \
55+
--max-days ${{ inputs.max_days_anaconda || 90 }} \
56+
--verbose
57+
env:
58+
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
3959
alert:
4060
needs: [collect]
4161
runs-on: ubuntu-latest

.github/workflows/daily_summarize.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ jobs:
2121
with:
2222
enable-cache: true
2323
activate-environment: true
24+
cache-dependency-glob: |
25+
**/pyproject.toml
26+
**/__main__.py
2427
- name: Install pip and dependencies
2528
run: |
2629
uv pip install -U pip

.github/workflows/dryrun.yaml

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,25 @@ concurrency:
1010
cancel-in-progress: true
1111
jobs:
1212
dry_run:
13-
runs-on: ubuntu-latest
13+
runs-on: ubuntu-latest-large
14+
timeout-minutes: 25
1415
steps:
1516
- uses: actions/checkout@v4
1617
- name: Install uv
1718
uses: astral-sh/setup-uv@v6
1819
with:
1920
enable-cache: true
2021
activate-environment: true
22+
cache-dependency-glob: |
23+
**/pyproject.toml
24+
**/__main__.py
2125
- name: Install pip and dependencies
2226
run: |
2327
uv pip install -U pip
2428
uv pip install .
25-
- name: Collect Downloads Data - Dry Run
29+
- name: Collect PyPI Downloads - Dry Run
2630
run: |
27-
uv run download-analytics collect \
31+
uv run download-analytics collect-pypi \
2832
--verbose \
2933
--max-days 30 \
3034
--add-metrics \
@@ -33,7 +37,16 @@ jobs:
3337
env:
3438
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
3539
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
36-
- name: Run Summarize - Dry Run
40+
- name: Collect Anaconda Downloads - Dry Run
41+
run: |
42+
uv run download-analytics collect-anaconda \
43+
--output-folder gdrive://***REMOVED***-Z \
44+
--max-days 90 \
45+
--verbose \
46+
--dry-run
47+
env:
48+
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
49+
- name: Summarize - Dry Run
3750
run: |
3851
uv run download-analytics summarize \
3952
--verbose \

README.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,24 @@ engagement metrics.
1010

1111
### Data sources
1212

13-
Currently the download data is coming from the following distributions:
13+
Currently the download data is collected from the following distributions:
1414

1515
* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
1616
obtained from the public Big Query dataset, equivalent to the information shown on
1717
[pepy.tech](https://pepy.tech).
18+
* [conda-forge](https://conda-forge.org/): Information about the project downloads from the
19+
`conda-forge` channel on `conda`.
20+
- The conda package download data provided by Anaconda. It includes package download counts
21+
starting from January 2017. More information:
22+
- https://github.com/anaconda/anaconda-package-data
23+
- The conda package metadata data provided by Anaconda. There is a public API which allows for
24+
the retrieval of package information, including current number of downloads.
25+
- https://api.anaconda.org/package/{username}/{package_name}
26+
- Replace {username} with the Anaconda username (`conda-forge`) and {package_name} with
27+
the specific package name (`sdv`).
1828

1929
In the future, we may also expand the source distributions to include:
2030

21-
* [conda-forge](https://conda-forge.org/): Information about the project downloads from the
22-
`conda-forge` channel on `conda`.
2331
* [github](https://github.com/): Information about the project downloads from github releases.
2432

2533
For more information about how to configure and use the software, or about the data that is being

download_analytics/__main__.py

Lines changed: 63 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
import yaml
1111

12+
from download_analytics.anaconda import collect_anaconda_downloads
1213
from download_analytics.main import collect_downloads
1314
from download_analytics.summarize import summarize_downloads
1415

@@ -44,7 +45,7 @@ def _load_config(config_path):
4445
return config
4546

4647

47-
def _collect(args):
48+
def _collect_pypi(args):
4849
config = _load_config(args.config_file)
4950
projects = args.projects or config['projects']
5051
output_folder = args.output_folder or config.get('output-folder', '.')
@@ -62,6 +63,19 @@ def _collect(args):
6263
)
6364

6465

66+
def _collect_anaconda(args):
67+
config = _load_config(args.config_file)
68+
projects = config['projects']
69+
output_folder = args.output_folder or config.get('output-folder', '.')
70+
collect_anaconda_downloads(
71+
projects=projects,
72+
output_folder=output_folder,
73+
max_days=args.max_days,
74+
dry_run=args.dry_run,
75+
verbose=args.verbose,
76+
)
77+
78+
6579
def _summarize(args):
6680
config = _load_config(args.config_file)
6781
projects = config['projects']
@@ -98,7 +112,12 @@ def _get_parser():
98112
logging_args.add_argument(
99113
'-l', '--logfile', help='If given, file where the logs will be written.'
100114
)
101-
115+
logging_args.add_argument(
116+
'-d',
117+
'--dry-run',
118+
action='store_true',
119+
help='Do not upload the results. Just calculate them.',
120+
)
102121
parser = argparse.ArgumentParser(
103122
prog='download-analytics',
104123
description='Download Analytics Command Line Interface',
@@ -109,10 +128,12 @@ def _get_parser():
109128
action.required = True
110129

111130
# collect
112-
collect = action.add_parser('collect', help='Collect downloads data.', parents=[logging_args])
113-
collect.set_defaults(action=_collect)
131+
collect_pypi = action.add_parser(
132+
'collect-pypi', help='Collect download data from PyPi.', parents=[logging_args]
133+
)
134+
collect_pypi.set_defaults(action=_collect_pypi)
114135

115-
collect.add_argument(
136+
collect_pypi.add_argument(
116137
'-o',
117138
'--output-folder',
118139
type=str,
@@ -122,54 +143,48 @@ def _get_parser():
122143
' Google Drive folder path in the format gdrive://<folder-id>'
123144
),
124145
)
125-
collect.add_argument(
146+
collect_pypi.add_argument(
126147
'-a',
127148
'--authentication-credentials',
128149
type=str,
129150
required=False,
130151
help='Path to the GCP (BigQuery) credentials file to use.',
131152
)
132-
collect.add_argument(
153+
collect_pypi.add_argument(
133154
'-c',
134155
'--config-file',
135156
type=str,
136157
default='config.yaml',
137158
help='Path to the configuration file.',
138159
)
139-
collect.add_argument(
160+
collect_pypi.add_argument(
140161
'-p',
141162
'--projects',
142163
nargs='*',
143164
help='List of projects to collect. If not given use the configured ones.',
144165
default=None,
145166
)
146-
collect.add_argument(
167+
collect_pypi.add_argument(
147168
'-s',
148169
'--start-date',
149170
type=_valid_date,
150171
required=False,
151172
help='Date from which to start pulling data.',
152173
)
153-
collect.add_argument(
174+
collect_pypi.add_argument(
154175
'-m',
155176
'--max-days',
156177
type=int,
157178
required=False,
158179
help='Max days of data to pull if start-date is not given.',
159180
)
160-
collect.add_argument(
161-
'-d',
162-
'--dry-run',
163-
action='store_true',
164-
help='Do not run the actual query, only simulate it.',
165-
)
166-
collect.add_argument(
181+
collect_pypi.add_argument(
167182
'-f',
168183
'--force',
169184
action='store_true',
170185
help='Force the download even if the data already exists or there is a gap',
171186
)
172-
collect.add_argument(
187+
collect_pypi.add_argument(
173188
'-M',
174189
'--add-metrics',
175190
action='store_true',
@@ -205,11 +220,36 @@ def _get_parser():
205220
' Google Drive folder path in the format gdrive://<folder-id>'
206221
),
207222
)
208-
summarize.add_argument(
209-
'-d',
210-
'--dry-run',
211-
action='store_true',
212-
help='Do not upload the summary results. Just calculate them.',
223+
224+
# collect
225+
collect_anaconda = action.add_parser(
226+
'collect-anaconda', help='Collect download data from Anaconda.', parents=[logging_args]
227+
)
228+
collect_anaconda.set_defaults(action=_collect_anaconda)
229+
collect_anaconda.add_argument(
230+
'-c',
231+
'--config-file',
232+
type=str,
233+
default='config.yaml',
234+
help='Path to the configuration file.',
235+
)
236+
collect_anaconda.add_argument(
237+
'-o',
238+
'--output-folder',
239+
type=str,
240+
required=False,
241+
help=(
242+
'Path to the folder where data will be outputted. It can be a local path or a'
243+
' Google Drive folder path in the format gdrive://<folder-id>'
244+
),
245+
)
246+
collect_anaconda.add_argument(
247+
'-m',
248+
'--max-days',
249+
type=int,
250+
required=False,
251+
default=90,
252+
help='Max days of data to pull.',
213253
)
214254
return parser
215255

0 commit comments

Comments
 (0)