Skip to content

Commit bf4ad02

Browse files
gshenirwedge
andauthored
Add daily workflow to collect anaconda downloads (#21)
* wip * lint * fix start date * add project print * fix print * update message * update to use pyarrow dtypes * fix string * update to ubuntu-latest-largeA * update to ubuntu * fix engine * docstring * use category dtype * remove pyarrow * fix ns * lint * use pyarrow everywhere * remove pyarrow dtypes * add readme instructions * fix manual * cleanup * fix manual * fix manual * fix max_days * fix docs * wip * wip * wip * fix workflow * fix workflow * add message to workflow * cleanup * fix repo * fix slack msg * fix slack msg * use extensions * summarize fix" * use uv * fix uv * use cache * change token * add unit tests * add unit workflow * add dry-run * remove unused arg * fix dry run * use uv in lint * add date * cleanup readme * Rename daily_collect.yaml to daily_collection.yaml * Update daily_collection.yaml * Update daily_summarize.yaml * Update dryrun.yaml * Update lint.yaml * Update manual.yaml * Update unit.yaml * wip * Address feedback 2 * add version parse * use object dtype * fix local write * lint * Update daily_summarize.yaml * cleanup * wip * exclude pre-releases * wip * cleanup * cleanup * cleanup * cleanup * update workflow * rename workflow * fix unit tests * define cache break * force reinstall * remove force install * lint * fix summarize config * cleanup * fix dry run * Update dryrun.yaml * Update dryrun.yaml * fix dry run * fix dry run * fix dry run * fix write * remove breakpoint * Update download_analytics/time_utils.py Co-authored-by: Roy Wedge <[email protected]> * fix tz * fix based on feedback * fix workflow --------- Co-authored-by: Roy Wedge <[email protected]>
1 parent 06366fe commit bf4ad02

File tree

15 files changed

+476
-80
lines changed

15 files changed

+476
-80
lines changed

.github/workflows/daily_collection.yaml

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,35 +7,55 @@ on:
77
description: Slack channel to post the error message to if the builds fail.
88
required: false
99
default: "sdv-alerts-debug"
10+
max_days_pypi:
11+
description: 'Maximum number of days to collect, starting from today for PyPI.'
12+
required: false
13+
type: number
14+
default: 30
15+
max_days_anaconda:
16+
description: 'Maximum number of days to collect, starting from today for Anaconda'
17+
required: false
18+
type: number
19+
default: 90
1020
schedule:
1121
- cron: '0 0 * * *'
1222

1323
jobs:
1424
collect:
1525
runs-on: ubuntu-latest-large
16-
timeout-minutes: 20
26+
timeout-minutes: 25
1727
steps:
1828
- uses: actions/checkout@v4
1929
- name: Install uv
2030
uses: astral-sh/setup-uv@v6
2131
with:
2232
enable-cache: true
2333
activate-environment: true
34+
cache-dependency-glob: |
35+
**/pyproject.toml
36+
**/__main__.py
2437
- name: Install pip and dependencies
2538
run: |
2639
uv pip install -U pip
27-
uv pip install -e .
28-
- name: Collect Downloads Data
40+
uv pip install .
41+
- name: Collect PyPI Downloads
2942
run: |
30-
uv run download-analytics collect \
43+
uv run download-analytics collect-pypi \
3144
--verbose \
32-
--max-days 30 \
45+
--max-days ${{ inputs.max_days_pypi || 30 }} \
3346
--add-metrics \
3447
--output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
3548
env:
3649
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
3750
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
38-
51+
- name: Collect Anaconda Downloads
52+
run: |
53+
uv run download-analytics collect-anaconda \
54+
--output-folder gdrive://1UnDYovLkL4gletOF5328BG1X59mSHF-Z \
55+
--max-days ${{ inputs.max_days_anaconda || 90 }} \
56+
--verbose
57+
env:
58+
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
3959
alert:
4060
needs: [collect]
4161
runs-on: ubuntu-latest

.github/workflows/daily_summarize.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ jobs:
2121
with:
2222
enable-cache: true
2323
activate-environment: true
24+
cache-dependency-glob: |
25+
**/pyproject.toml
26+
**/__main__.py
2427
- name: Install pip and dependencies
2528
run: |
2629
uv pip install -U pip

.github/workflows/dryrun.yaml

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,25 @@ concurrency:
1010
cancel-in-progress: true
1111
jobs:
1212
dry_run:
13-
runs-on: ubuntu-latest
13+
runs-on: ubuntu-latest-large
14+
timeout-minutes: 25
1415
steps:
1516
- uses: actions/checkout@v4
1617
- name: Install uv
1718
uses: astral-sh/setup-uv@v6
1819
with:
1920
enable-cache: true
2021
activate-environment: true
22+
cache-dependency-glob: |
23+
**/pyproject.toml
24+
**/__main__.py
2125
- name: Install pip and dependencies
2226
run: |
2327
uv pip install -U pip
2428
uv pip install .
25-
- name: Collect Downloads Data - Dry Run
29+
- name: Collect PyPI Downloads - Dry Run
2630
run: |
27-
uv run download-analytics collect \
31+
uv run download-analytics collect-pypi \
2832
--verbose \
2933
--max-days 30 \
3034
--add-metrics \
@@ -33,7 +37,16 @@ jobs:
3337
env:
3438
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
3539
BIGQUERY_CREDENTIALS: ${{ secrets.BIGQUERY_CREDENTIALS }}
36-
- name: Run Summarize - Dry Run
40+
- name: Collect Anaconda Downloads - Dry Run
41+
run: |
42+
uv run download-analytics collect-anaconda \
43+
--output-folder gdrive://1UnDYovLkL4gletOF5328BG1X59mSHF-Z \
44+
--max-days 90 \
45+
--verbose \
46+
--dry-run
47+
env:
48+
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
49+
- name: Summarize - Dry Run
3750
run: |
3851
uv run download-analytics summarize \
3952
--verbose \

.github/workflows/manual.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636
uv pip install .
3737
- name: Collect Downloads Data
3838
run: |
39-
uv run download-analytics collect \
39+
uv run download-analytics collect-pypi \
4040
--verbose \
4141
--projects ${{ github.event.inputs.projects }} \
4242
${{ github.event.inputs.max_days && '--max-days ' || '' }} \

README.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,24 @@ engagement metrics.
1010

1111
### Data sources
1212

13-
Currently the download data is coming from the following distributions:
13+
Currently the download data is collected from the following distributions:
1414

1515
* [PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
1616
obtained from the public Big Query dataset, equivalent to the information shown on
1717
[pepy.tech](https://pepy.tech).
18+
* [conda-forge](https://conda-forge.org/): Information about the project downloads from the
19+
`conda-forge` channel on `conda`.
20+
- The conda package download data provided by Anaconda. It includes package download counts
21+
starting from January 2017. More information:
22+
- https://github.com/anaconda/anaconda-package-data
23+
- The conda package metadata data provided by Anaconda. There is a public API which allows for
24+
the retrieval of package information, including current number of downloads.
25+
- https://api.anaconda.org/package/{username}/{package_name}
26+
- Replace {username} with the Anaconda username (`conda-forge`) and {package_name} with
27+
the specific package name (`sdv`).
1828

1929
In the future, we may also expand the source distributions to include:
2030

21-
* [conda-forge](https://conda-forge.org/): Information about the project downloads from the
22-
`conda-forge` channel on `conda`.
2331
* [github](https://github.com/): Information about the project downloads from github releases.
2432

2533
For more information about how to configure and use the software, or about the data that is being

docs/DEVELOPMENT.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,14 @@ For development, run `make install-develop` instead.
3333
## Command Line Interface
3434

3535
After the installation, a new `download-analytics` command will have been registered inside your
36-
`virtualenv`. This command can be used in conjunction with the `collect` action to collect
36+
`virtualenv`. This command can be used in conjunction with the `collect-pypi` action to collect
3737
downloads data from BigQuery and store the output locally or in Google Drive.
3838

3939
Here is the entire list of arguments that the command line has:
4040

4141
```bash
42-
$ download-analytics collect --help
43-
usage: download-analytics collect [-h] [-v] [-l LOGFILE] [-o OUTPUT_FOLDER] [-a AUTHENTICATION_CREDENTIALS]
42+
$ download-analytics collect-pypi --help
43+
usage: download-analytics collect-pypi [-h] [-v] [-l LOGFILE] [-o OUTPUT_FOLDER] [-a AUTHENTICATION_CREDENTIALS]
4444
[-c CONFIG_FILE] [-p [PROJECTS [PROJECTS ...]]] [-s START_DATE]
4545
[-m MAX_DAYS] [-d] [-f] [-M]
4646

@@ -73,7 +73,7 @@ and store the downloads data into a Google Drive folder alongside the correspond
7373
metric spreadsheets would look like this:
7474
7575
```bash
76-
$ download-analytics collect --verbose --projects sdv ctgan --start-date 2021-01-01 \
76+
$ download-analytics collect-pypi --verbose --projects sdv ctgan --start-date 2021-01-01 \
7777
--add-metrics --output-folder gdrive://10QHbqyvptmZX4yhu2Y38YJbVHqINRr0n
7878
```
7979

docs/SETUP.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@ if contains the application KEY which should never be made public.
3131
3232
Once the file is created, you can follow these steps:
3333
34-
1. Run the `download-analytics collect` command. If the `settings.yaml` file has been properly
34+
1. Run the `download-analytics collect-pypi` command. If the `settings.yaml` file has been properly
3535
created, this will **open a new tab on your web browser**, where you need to authenticate.
3636

37-
| ![pydrive-collect](imgs/pydrive-collect.png "Run the `download-analytics collect` Command") |
37+
| ![pydrive-collect](imgs/pydrive-collect.png "Run the `download-analytics collect-pypi` Command") |
3838
| - |
3939

4040
2. Click on the Google account which you which to authenticate with. Notice that the account that
@@ -67,7 +67,7 @@ be provided to you by a privileged admin.
6767
Once you have this JSON file, you have two options:
6868

6969
1. Pass the path to the authentication file with the `-a` or `--authentication-credentials`
70-
argument to the `download-analytics collect` command.
70+
argument to the `download-analytics collect-pypi` command.
7171

7272
| ![bigquery-a](imgs/bigquery-a.png "Pass the credentials on command line") |
7373
| - |

download_analytics/__main__.py

Lines changed: 63 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
import yaml
1111

12+
from download_analytics.anaconda import collect_anaconda_downloads
1213
from download_analytics.main import collect_downloads
1314
from download_analytics.summarize import summarize_downloads
1415

@@ -44,7 +45,7 @@ def _load_config(config_path):
4445
return config
4546

4647

47-
def _collect(args):
48+
def _collect_pypi(args):
4849
config = _load_config(args.config_file)
4950
projects = args.projects or config['projects']
5051
output_folder = args.output_folder or config.get('output-folder', '.')
@@ -62,6 +63,19 @@ def _collect(args):
6263
)
6364

6465

66+
def _collect_anaconda(args):
67+
config = _load_config(args.config_file)
68+
projects = config['projects']
69+
output_folder = args.output_folder or config.get('output-folder', '.')
70+
collect_anaconda_downloads(
71+
projects=projects,
72+
output_folder=output_folder,
73+
max_days=args.max_days,
74+
dry_run=args.dry_run,
75+
verbose=args.verbose,
76+
)
77+
78+
6579
def _summarize(args):
6680
config = _load_config(args.config_file)
6781
projects = config['projects']
@@ -98,7 +112,12 @@ def _get_parser():
98112
logging_args.add_argument(
99113
'-l', '--logfile', help='If given, file where the logs will be written.'
100114
)
101-
115+
logging_args.add_argument(
116+
'-d',
117+
'--dry-run',
118+
action='store_true',
119+
help='Do not upload the results. Just calculate them.',
120+
)
102121
parser = argparse.ArgumentParser(
103122
prog='download-analytics',
104123
description='Download Analytics Command Line Interface',
@@ -109,10 +128,12 @@ def _get_parser():
109128
action.required = True
110129

111130
# collect
112-
collect = action.add_parser('collect', help='Collect downloads data.', parents=[logging_args])
113-
collect.set_defaults(action=_collect)
131+
collect_pypi = action.add_parser(
132+
'collect-pypi', help='Collect download data from PyPi.', parents=[logging_args]
133+
)
134+
collect_pypi.set_defaults(action=_collect_pypi)
114135

115-
collect.add_argument(
136+
collect_pypi.add_argument(
116137
'-o',
117138
'--output-folder',
118139
type=str,
@@ -122,54 +143,48 @@ def _get_parser():
122143
' Google Drive folder path in the format gdrive://<folder-id>'
123144
),
124145
)
125-
collect.add_argument(
146+
collect_pypi.add_argument(
126147
'-a',
127148
'--authentication-credentials',
128149
type=str,
129150
required=False,
130151
help='Path to the GCP (BigQuery) credentials file to use.',
131152
)
132-
collect.add_argument(
153+
collect_pypi.add_argument(
133154
'-c',
134155
'--config-file',
135156
type=str,
136157
default='config.yaml',
137158
help='Path to the configuration file.',
138159
)
139-
collect.add_argument(
160+
collect_pypi.add_argument(
140161
'-p',
141162
'--projects',
142163
nargs='*',
143164
help='List of projects to collect. If not given use the configured ones.',
144165
default=None,
145166
)
146-
collect.add_argument(
167+
collect_pypi.add_argument(
147168
'-s',
148169
'--start-date',
149170
type=_valid_date,
150171
required=False,
151172
help='Date from which to start pulling data.',
152173
)
153-
collect.add_argument(
174+
collect_pypi.add_argument(
154175
'-m',
155176
'--max-days',
156177
type=int,
157178
required=False,
158179
help='Max days of data to pull if start-date is not given.',
159180
)
160-
collect.add_argument(
161-
'-d',
162-
'--dry-run',
163-
action='store_true',
164-
help='Do not run the actual query, only simulate it.',
165-
)
166-
collect.add_argument(
181+
collect_pypi.add_argument(
167182
'-f',
168183
'--force',
169184
action='store_true',
170185
help='Force the download even if the data already exists or there is a gap',
171186
)
172-
collect.add_argument(
187+
collect_pypi.add_argument(
173188
'-M',
174189
'--add-metrics',
175190
action='store_true',
@@ -205,11 +220,36 @@ def _get_parser():
205220
' Google Drive folder path in the format gdrive://<folder-id>'
206221
),
207222
)
208-
summarize.add_argument(
209-
'-d',
210-
'--dry-run',
211-
action='store_true',
212-
help='Do not upload the summary results. Just calculate them.',
223+
224+
# collect
225+
collect_anaconda = action.add_parser(
226+
'collect-anaconda', help='Collect download data from Anaconda.', parents=[logging_args]
227+
)
228+
collect_anaconda.set_defaults(action=_collect_anaconda)
229+
collect_anaconda.add_argument(
230+
'-c',
231+
'--config-file',
232+
type=str,
233+
default='config.yaml',
234+
help='Path to the configuration file.',
235+
)
236+
collect_anaconda.add_argument(
237+
'-o',
238+
'--output-folder',
239+
type=str,
240+
required=False,
241+
help=(
242+
'Path to the folder where data will be outputted. It can be a local path or a'
243+
' Google Drive folder path in the format gdrive://<folder-id>'
244+
),
245+
)
246+
collect_anaconda.add_argument(
247+
'-m',
248+
'--max-days',
249+
type=int,
250+
required=False,
251+
default=90,
252+
help='Max days of data to pull.',
213253
)
214254
return parser
215255

0 commit comments

Comments
 (0)