Skip to content

Conversation

@nikhilwoodruff
Copy link
Collaborator

No description provided.

@nikhilwoodruff nikhilwoodruff self-assigned this May 21, 2025
@nikhilwoodruff nikhilwoodruff added the enhancement New feature or request label May 22, 2025
huggingface_org=huggingface_org,
huggingface_repo=huggingface_repo,
gcs_bucket=gcs_bucket,
)
Copy link
Contributor

@anth-volk anth-volk May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This comment should be attached to the DataFile initialization above, GH review limitations disallowed)
question: Is this method structure too limiting and envision ourselves as the only customer?

We've had quite a few issues around prioritization HF/Google Cloud data storage. In order to make it easier for ourselves and to make .py more open to any user, I'd suggest implementing download() with an OOP strategy pattern, passing in two args: a string literal corresponding to one permitted strategy, perhaps providing a default (e.g., "google_cloud_bucket"), then another arg called "options" that was one of a variety of options schemas, e.g., gcs_bucket for Google downloads and the two HF options for Hugging Face.

This would do the following:

  • Enable us to easily override the default behavior wherever we want without being tied to a particular prioritization
  • Enable us to implement certain strategy-agnostic file management structures, like versioning and caching, that we would not need to tie to one vendor
  • Enable us to build new download strategies as needed, e.g., if we ever hosted something on GitHub again or on BitBucket
  • Enable external users to build their own download strategies and plug them in as needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.

Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.

)

logging.info = print
# NOTE: tests will break on build if you don't default to huggingface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, blocking: Is the UK non-public data hosted in a bucket? I thought I remembered it being GCP-only

)
return filepath

# NOTE: tests will break on build if you don't default to huggingface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This comment would no longer be true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link
Contributor

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikhilwoodruff had some thoughts for you around shifting how we do downloads to prevent future bugginess. Curious to hear your thoughts.

Copy link
Collaborator

@mikesmit mikesmit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss why hugginface at all. otherwise let's go.

huggingface_org=huggingface_org,
huggingface_repo=huggingface_repo,
gcs_bucket=gcs_bucket,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.

Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.

)
return filepath

# NOTE: tests will break on build if you don't default to huggingface.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

gcs_bucket=gcs_bucket,
)

logging.info = print
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion While we're in here, could this use the Python logger please? I'd make this blocking, but the switch to gcp is probably more important.

@nikhilwoodruff
Copy link
Collaborator Author

To do: just don't do huggingface in policyengine.py

@anth-volk
Copy link
Contributor

Closing, as my understanding is that this is superseded by #148. Feel free to reopen if I'm wrong @nikhilwoodruff.

@anth-volk anth-volk closed this May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants