-
Notifications
You must be signed in to change notification settings - Fork 8
Use Google Cloud Storage by default #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| huggingface_org=huggingface_org, | ||
| huggingface_repo=huggingface_repo, | ||
| gcs_bucket=gcs_bucket, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This comment should be attached to the DataFile initialization above, GH review limitations disallowed)
question: Is this method structure too limiting and envision ourselves as the only customer?
We've had quite a few issues around prioritization HF/Google Cloud data storage. In order to make it easier for ourselves and to make .py more open to any user, I'd suggest implementing download() with an OOP strategy pattern, passing in two args: a string literal corresponding to one permitted strategy, perhaps providing a default (e.g., "google_cloud_bucket"), then another arg called "options" that was one of a variety of options schemas, e.g., gcs_bucket for Google downloads and the two HF options for Hugging Face.
This would do the following:
- Enable us to easily override the default behavior wherever we want without being tied to a particular prioritization
- Enable us to implement certain strategy-agnostic file management structures, like versioning and caching, that we would not need to tie to one vendor
- Enable us to build new download strategies as needed, e.g., if we ever hosted something on GitHub again or on BitBucket
- Enable external users to build their own download strategies and plug them in as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.
Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.
| ) | ||
|
|
||
| logging.info = print | ||
| # NOTE: tests will break on build if you don't default to huggingface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question, blocking: Is the UK non-public data hosted in a bucket? I thought I remembered it being GCP-only
| ) | ||
| return filepath | ||
|
|
||
| # NOTE: tests will break on build if you don't default to huggingface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This comment would no longer be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
anth-volk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhilwoodruff had some thoughts for you around shifting how we do downloads to prevent future bugginess. Curious to hear your thoughts.
mikesmit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss why hugginface at all. otherwise let's go.
| huggingface_org=huggingface_org, | ||
| huggingface_repo=huggingface_repo, | ||
| gcs_bucket=gcs_bucket, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.
Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.
| ) | ||
| return filepath | ||
|
|
||
| # NOTE: tests will break on build if you don't default to huggingface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
| gcs_bucket=gcs_bucket, | ||
| ) | ||
|
|
||
| logging.info = print |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion While we're in here, could this use the Python logger please? I'd make this blocking, but the switch to gcp is probably more important.
|
To do: just don't do huggingface in policyengine.py |
|
Closing, as my understanding is that this is superseded by #148. Feel free to reopen if I'm wrong @nikhilwoodruff. |
No description provided.