-
Notifications
You must be signed in to change notification settings - Fork 18
Environment variable related changes #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM - huge improvement over our scattered env variables! Just would like to get @felipemello1 's quick thoughts on the perf_tracker stuff.
|
||
|
||
@dataclass | ||
class EnvVar: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprising to me that an abstraction like this doesn't exist in the Python world.
src/forge/env.py
Outdated
) | ||
|
||
|
||
@functools.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for caching this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if we do:
all_env_vars()
Some code change an envvar
all_env_vars()
Would we get the first cached or the updated one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so I assumed that we wouldn't change env vars in the run itself. So to avoid having to create this list every time we create a proc mesh, we cache it
So in your example, we would get the first cached. I think to avoid confusion I'll remove the cache for now...
os.getenv(METRIC_TIMER_USES_GPU, str(self.time_with_gpu)).lower() == "true" | ||
) and torch.cuda.is_available() | ||
|
||
# TODO - follow up on if this env var behavior makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly follow this - maybe @felipemello1 can weigh in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted a way to shutdown the cuda timing, in case we were worried that it was causing OOM or blocking GPU. Currently it lets you make: everything cpu, everything gpu, keep as it is (none)
we could reduce it to 'make everything cpu', 'keep as it is (none)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I understand why it's implemented this way -- I don't think we need to change it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! I like how we can add descriptions to them :)
Just a small question on the cache portion
This PR does a few things:
Provisioner changes
VLLM_HOST_IP
,world_size
andrank
as environment variables inproc_mesh
creation.TORCHSTORE_USE_RDMA
) in the provisioner, and so I added in:Environment variable related changes