Skip to content

Conversation

@nikhilwoodruff
Copy link
Collaborator

Copy link
Collaborator

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @nikhilwoodruff. TL;DR: there are a couple of breaking changes here, and I wonder if there isn't a better way of doing some of this, given expected changes as part of https://github.com/PolicyEngine/issues/issues/378.


# Get the current dataset version

version_file = download_huggingface_dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, blocking: I thought we weren't using Hugging Face within the API. I'd expect this information to come from GCP.

Feel free to correct me if I'm wrong. My understanding was that we're keeping the HF code in the -data packages, -core, and any non-API packages.

with open(version_file, "r") as f:
version = json.load(f).get("version")

data_versions = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment, blocking: I think there are some concerns with this approach

First, data_versions would imply that there are multiple keys and values stored, but I'm only seeing one here. Second, I didn't realize that dataset args always had versions embedded; I'd be concerned that this code might break down if and when we ever pass non-versioned values; if the filepath also contains / characters, I imagine we'd start lopping off filename metadata if the version weren't present.

I'd recommend a different approach, perhaps using the code being added as part of https://github.com/PolicyEngine/issues/issues/378 to pull the version number from GCP metadata. Since we're committing to versioning all dataset files at once in a given dataset repo, I think we could just pass a version value as a value associated with a data_version key in the logging output.

dataset.split("/")[-1]: version
}

country_package_versions = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This duplicates some of what we already have with v1_country_package_version and v2_country_package_version. I'd recommend moving to one model_version key and implementing after merging the entire chain proposed in https://github.com/PolicyEngine/issues/issues/378.

"baseline_policy_id": baseline_policy_id,
"time_period": time_period,
"dataset": dataset,
"package_versions": country_package_versions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue, blocking: I believe this will fail a validation check for the V2V1Comparison schema, which isn't modified as part of this PR

@nikhilwoodruff
Copy link
Collaborator Author

Yes sorry this is out of date as per recent changes in the PRs to uk-data and .py and so will change.

@anth-volk
Copy link
Collaborator

Closing, as we fixed via #2528.

@anth-volk anth-volk closed this Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pass data and country package versions to APIv2

3 participants