Skip to content

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Oct 10, 2025

This PR does a few things:

Provisioner changes

  • This PR originally started to add in VLLM_HOST_IP, world_size and rank as environment variables in proc_mesh creation.
  • But then there was a clear need to inherit a few relevant environment variables (like TORCHSTORE_USE_RDMA) in the provisioner, and so I added in:

Environment variable related changes

  • Renames env_constants.py to env.py
  • Introduces an EnvVar pattern where you can set the name, default value, and description and easily resolve its logical value within code. Reduces boilerplate we had for checking strings etc.
  • Applies the changes to the relevant spots in the codebase

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025
Copy link
Member

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM - huge improvement over our scattered env variables! Just would like to get @felipemello1 's quick thoughts on the perf_tracker stuff.



@dataclass
class EnvVar:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprising to me that an abstraction like this doesn't exist in the Python world.

src/forge/env.py Outdated
)


@functools.cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for caching this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we do:

all_env_vars()
Some code change an envvar
all_env_vars()

Would we get the first cached or the updated one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so I assumed that we wouldn't change env vars in the run itself. So to avoid having to create this list every time we create a proc mesh, we cache it

So in your example, we would get the first cached. I think to avoid confusion I'll remove the cache for now...

os.getenv(METRIC_TIMER_USES_GPU, str(self.time_with_gpu)).lower() == "true"
) and torch.cuda.is_available()

# TODO - follow up on if this env var behavior makes sense.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly follow this - maybe @felipemello1 can weigh in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted a way to shutdown the cuda timing, in case we were worried that it was causing OOM or blocking GPU. Currently it lets you make: everything cpu, everything gpu, keep as it is (none)

we could reduce it to 'make everything cpu', 'keep as it is (none)'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I understand why it's implemented this way -- I don't think we need to change it now

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! I like how we can add descriptions to them :)

Just a small question on the cache portion

@allenwang28 allenwang28 merged commit 3303af5 into meta-pytorch:main Oct 10, 2025
8 checks passed
@allenwang28 allenwang28 deleted the vllm_multinode branch October 10, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants