Skip to content

Commit 0177847

Browse files
authored
Add gretel-client and mostlyai-mock (#19)
* wip * lint * fix start date * add project print * fix print * update message * update to use pyarrow dtypes * fix string * update to ubuntu-latest-largeA * update to ubuntu * fix engine * docstring * use category dtype * remove pyarrow * fix ns * lint * use pyarrow everywhere * remove pyarrow dtypes * add readme instructions * fix manual * cleanup * fix manual * fix manual * fix max_days * fix docs
1 parent 2188976 commit 0177847

File tree

13 files changed

+125
-35
lines changed

13 files changed

+125
-35
lines changed

.github/workflows/daily.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ on:
1212

1313
jobs:
1414
collect:
15-
runs-on: ubuntu-latest
15+
runs-on: ubuntu-latest-large
16+
timeout-minutes: 30
1617
steps:
1718
- uses: actions/checkout@v4
1819
- name: Set up Python ${{ matrix.python-version }}

.github/workflows/dryrun.yaml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,19 @@
11
name: Health-check Dry Run
2-
32
on:
43
workflow_dispatch:
54
inputs:
65
slack_channel:
76
description: Slack channel to post the error message to if the builds fail.
87
required: false
98
default: "sdv-alerts-debug"
10-
11-
push:
129
pull_request:
13-
10+
types:
11+
- opened
12+
- synchronize
13+
- ready_for_review
14+
concurrency:
15+
group: ${{ github.workflow }}-${{ github.ref }}
16+
cancel-in-progress: true
1417
jobs:
1518
dry_run:
1619
runs-on: ubuntu-latest

.github/workflows/lint.yaml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
name: Style Checks
2-
32
on:
4-
push:
53
pull_request:
6-
types: [opened, reopened]
7-
4+
types:
5+
- opened
6+
- synchronize
7+
- ready_for_review
8+
concurrency:
9+
group: ${{ github.workflow }}-${{ github.ref }}
10+
cancel-in-progress: true
811
jobs:
912
lint:
1013
runs-on: ubuntu-latest

.github/workflows/manual.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ on:
2222

2323
jobs:
2424
collect:
25-
runs-on: ubuntu-latest
25+
runs-on: ubuntu-latest-large
2626
steps:
2727
- uses: actions/checkout@v4
2828
- name: Set up Python ${{ matrix.python-version }}
@@ -38,7 +38,8 @@ jobs:
3838
download-analytics collect \
3939
--verbose \
4040
--projects ${{ github.event.inputs.projects }} \
41-
--max-days ${{ github.event.inputs.max_days }} \
41+
${{ github.event.inputs.max_days && '--max-days ' || '' }} \
42+
${{ github.event.inputs.max_days && github.event.inputs.max_days || '' }} \
4243
--output-folder gdrive://${{ github.event.inputs.output_folder }} \
4344
${{ github.event.inputs.extras }}
4445
env:

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,18 @@ In the future, these sources may also be added:
2525
For more information about how to configure and use the software, or about the data that is being
2626
collected check the resources below.
2727

28+
### Add new libraries
29+
In order add new libraries, it is important to follow these steps to ensure that data is backfilled.
30+
1. Update `config.yaml` with the new libraries (pypi project names only for now)
31+
2. Run the [Manual collection workflow](https://github.com/datacebo/download-analytics/actions/workflows/manual.yaml) on your branch.
32+
- Use workflow from **your branch name**.
33+
- List all project names from config.yaml
34+
- Remove `7` from max days to indicate you want all data
35+
- Pass any extra arguments (for example `--dry-run` to test your changes)
36+
3. Let the workflow finish and check that pypi.csv contains the right data.
37+
4. Get your pull request reviewed and merged into `main`. The daily collection workflow will fill the data for the last 30 days and future days.
38+
- Note: The collection script looks at timestamps and avoids adding overlapping data.
39+
2840
## Resources
2941

3042
| | Document | Description |

config.yaml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,16 @@ projects:
88
- deepecho
99
- sdmetrics
1010
- sdgym
11-
- gretel-synthetics
12-
- ydata-synthetic
1311
- synthesized
1412
- datomize
15-
- gretel-trainer
16-
- ydata-sdk
17-
- mostlyai
1813
- synthcity
1914
- smartnoise-synth
2015
- realtabformer
2116
- be-great
17+
- ydata-synthetic
18+
- ydata-sdk
19+
- gretel-synthetics
20+
- gretel-trainer
21+
- gretel-client
22+
- mostlyai
23+
- mostlyai-mock

download_analytics/__main__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ def _get_parser():
124124
'--projects',
125125
nargs='*',
126126
help='List of projects to collect. If not given use the configured ones.',
127+
default=None,
127128
)
128129
collect.add_argument(
129130
'-s',

download_analytics/bq.py

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,19 @@ def _get_bq_client(credentials_file):
2424

2525
LOGGER.info('Loading BigQuery credentials from BIGQUERY_CREDENTIALS envvar')
2626

27-
service_account_info = json.loads(credentials_contents)
28-
credentials = service_account.Credentials.from_service_account_info(
29-
service_account_info,
30-
scopes=['https://www.googleapis.com/auth/cloud-platform'],
31-
)
27+
if os.path.exists(credentials_contents):
28+
LOGGER.info('Loading BigQuery credentials from service account file')
29+
credentials = service_account.Credentials.from_service_account_file(
30+
credentials_contents,
31+
scopes=['https://www.googleapis.com/auth/cloud-platform'],
32+
)
33+
else:
34+
LOGGER.info('Loading BigQuery credentials from service account info')
35+
service_account_info = json.loads(credentials_contents)
36+
credentials = service_account.Credentials.from_service_account_info(
37+
service_account_info,
38+
scopes=['https://www.googleapis.com/auth/cloud-platform'],
39+
)
3240

3341
return bigquery.Client(
3442
credentials=credentials,
@@ -44,7 +52,14 @@ def run_query(query, dry_run=False, credentials_file=None):
4452

4553
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
4654
dry_run_job = client.query(query, job_config=job_config)
47-
LOGGER.info('Estimated processed GBs: %.2f', dry_run_job.total_bytes_processed / 1024**3)
55+
data_processed_gbs = dry_run_job.total_bytes_processed / 1024**3
56+
LOGGER.info('Estimated data processed in query (GBs): %.2f', data_processed_gbs)
57+
# https://cloud.google.com/bigquery/pricing#on_demand_pricing
58+
# assuming have hit 1 terabyte processed in month
59+
cost_per_terabyte = 6.15
60+
bytes = dry_run_job.total_bytes_processed
61+
cost = cost_per_terabyte * bytes_to_terabytes(bytes)
62+
LOGGER.info('Estimated cost for query: $%.2f', cost)
4863

4964
if dry_run:
5065
return None
@@ -53,5 +68,21 @@ def run_query(query, dry_run=False, credentials_file=None):
5368
data = query_job.to_dataframe()
5469
LOGGER.info('Total processed GBs: %.2f', query_job.total_bytes_processed / 1024**3)
5570
LOGGER.info('Total billed GBs: %.2f', query_job.total_bytes_billed / 1024**3)
56-
71+
cost = cost_per_terabyte * bytes_to_terabytes(query_job.total_bytes_billed)
72+
LOGGER.info('Total cost for query: $%.2f', cost)
5773
return data
74+
75+
76+
def bytes_to_megabytes(bytes):
77+
"""Convert bytes to megabytes."""
78+
return bytes / 1024 / 1024
79+
80+
81+
def bytes_to_gigabytes(bytes):
82+
"""Convert bytes to gigabytes."""
83+
return bytes_to_megabytes(bytes) / 1024
84+
85+
86+
def bytes_to_terabytes(bytes):
87+
"""Convert bytes to terabytes."""
88+
return bytes_to_gigabytes(bytes) / 1024

download_analytics/main.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,10 @@ def collect_downloads(
4747
if not projects:
4848
raise ValueError('No projects have been passed')
4949

50+
LOGGER.info(f'Collecting downloads for projects={projects}')
51+
5052
csv_path = get_path(output_folder, 'pypi.csv')
51-
previous = load_csv(csv_path)
53+
previous = load_csv(csv_path, dry_run=dry_run)
5254

5355
pypi_downloads = get_pypi_downloads(
5456
projects=projects,
@@ -63,7 +65,11 @@ def collect_downloads(
6365
if pypi_downloads.empty:
6466
LOGGER.info('Not creating empty CSV file %s', csv_path)
6567
elif pypi_downloads.equals(previous):
66-
LOGGER.info('Skipping update of unmodified CSV file %s', csv_path)
68+
msg = f'Skipping update of unmodified CSV file {csv_path}'
69+
if dry_run:
70+
msg += f' because dry_run={dry_run}, meaning no downloads were returned from BigQuery'
71+
LOGGER.info(msg)
72+
6773
else:
6874
create_csv(csv_path, pypi_downloads)
6975

download_analytics/metrics.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,16 @@ def _version_order_key(version_column):
134134

135135
def _mangle_columns(downloads):
136136
downloads = downloads.rename(columns=RENAME_COLUMNS)
137+
for col in [
138+
'python_version',
139+
'project',
140+
'version',
141+
'distro_name',
142+
'distro_version',
143+
'distro_kernel',
144+
]:
145+
downloads[col] = downloads[col].astype('string')
146+
137147
downloads['full_python_version'] = downloads['python_version']
138148
downloads['python_version'] = downloads['python_version'].str.rsplit('.', n=1).str[0]
139149
downloads['project_version'] = downloads['project'] + '-' + downloads['version']

0 commit comments

Comments
 (0)