Skip to content

Commit 2ca7e70

Browse files
nikhilwoodruffjuaristi22
authored andcommitted
Update data files in Google Cloud Buckets on publish (#249)
* Update data files in Google Cloud Buckets on publish Fixes #248 * Remove testing code * Add Google auth * Adjust job permissions * Update permissions * Remove auth step from PR action * Update data files in Google Cloud Buckets on publish Fixes #248
1 parent 03c33ae commit 2ca7e70

File tree

4 files changed

+29
-3
lines changed

4 files changed

+29
-3
lines changed

.github/workflows/code_changes.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ jobs:
2121
with:
2222
args: ". -l 79 --check"
2323
Test:
24+
permissions:
25+
contents: "read"
26+
# Required to auth against gcp
27+
id-token: "write"
2428
runs-on: larger-runner
2529
steps:
2630
- name: Checkout repo
@@ -32,6 +36,10 @@ jobs:
3236
uses: actions/setup-python@v2
3337
with:
3438
python-version: '3.11'
39+
- uses: "google-github-actions/auth@v2"
40+
with:
41+
workload_identity_provider: "projects/322898545428/locations/global/workloadIdentityPools/policyengine-research-id-pool/providers/prod-github-provider"
42+
service_account: "policyengine-research@policyengine-research.iam.gserviceaccount.com"
3543

3644
- name: Install package
3745
run: uv pip install -e .[dev] --system

changelog_entry.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
- bump: patch
22
changes:
3-
changed:
4-
- Methodology to directly impute auto loan interest instead of assuming a 2% interest rate on auto loan balance.
3+
fixed:
4+
- Upload to GCP on dataset build.

policyengine_us_data/storage/upload_completed_datasets.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,20 @@
55
)
66
from policyengine_us_data.storage import STORAGE_FOLDER
77
from policyengine_us_data.utils.huggingface import upload
8+
from google.cloud import storage
89

910

1011
def upload_datasets():
11-
for dataset in [EnhancedCPS_2024, Pooled_3_Year_CPS_2023, CPS_2023]:
12+
storage_client = storage.Client()
13+
bucket = storage_client.bucket("policyengine-us-data")
14+
15+
datasets_to_upload = [
16+
EnhancedCPS_2024,
17+
Pooled_3_Year_CPS_2023,
18+
CPS_2023,
19+
]
20+
21+
for dataset in datasets_to_upload:
1222
dataset = dataset()
1323
if not dataset.exists:
1424
raise ValueError(
@@ -21,6 +31,13 @@ def upload_datasets():
2131
dataset.file_path.name,
2232
)
2333

34+
blob = dataset.file_path.name
35+
blob = bucket.blob(blob)
36+
blob.upload_from_filename(dataset.file_path)
37+
print(
38+
f"Uploaded {dataset.file_path.name} to GCS bucket policyengine-us-data."
39+
)
40+
2441

2542
if __name__ == "__main__":
2643
upload_datasets()

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ dependencies = [
2424
"microdf_python>=0.4.3",
2525
"microimpute",
2626
"pip-system-certs",
27+
"google-cloud-storage",
2728
]
2829

2930
[project.optional-dependencies]

0 commit comments

Comments
 (0)