Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions policyengine_api/jobs/calculate_economy_simulation_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,23 @@ def run(
comment = lambda x: set_comment_on_job(x, *identifiers)
comment("Computing baseline")

# Get the current dataset version

version_file = download_huggingface_dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, blocking: I thought we weren't using Hugging Face within the API. I'd expect this information to come from GCP.

Feel free to correct me if I'm wrong. My understanding was that we're keeping the HF code in the -data packages, -core, and any non-API packages.

repo_name=f"policyengine/policyengine-{country_id}-data",
repo_filename="version.json",
)
with open(version_file, "r") as f:
version = json.load(f).get("version")

data_versions = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment, blocking: I think there are some concerns with this approach

First, data_versions would imply that there are multiple keys and values stored, but I'm only seeing one here. Second, I didn't realize that dataset args always had versions embedded; I'd be concerned that this code might break down if and when we ever pass non-versioned values; if the filepath also contains / characters, I imagine we'd start lopping off filename metadata if the version weren't present.

I'd recommend a different approach, perhaps using the code being added as part of https://github.com/PolicyEngine/issues/issues/378 to pull the version number from GCP metadata. Since we're committing to versioning all dataset files at once in a given dataset repo, I think we could just pass a version value as a value associated with a data_version key in the logging output.

dataset.split("/")[-1]: version
}

country_package_versions = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This duplicates some of what we already have with v1_country_package_version and v2_country_package_version. I'd recommend moving to one model_version key and implementing after merging the entire chain proposed in https://github.com/PolicyEngine/issues/issues/378.

f"policyengine-{country_id}": COUNTRY_PACKAGE_VERSIONS[country_id]
}

# If comparing against API v2, start job
if check_against_api_v2:

Expand All @@ -172,6 +189,8 @@ def run(
"baseline_policy_id": baseline_policy_id,
"time_period": time_period,
"dataset": dataset,
"package_versions": country_package_versions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue, blocking: I believe this will fail a validation check for the V2V1Comparison schema, which isn't modified as part of this PR

"data_versions": data_versions,
"v1_country_package_version": COUNTRY_PACKAGE_VERSIONS[
country_id
],
Expand Down
Loading