Commit bb256dc
Fix broken component integration test due to compute_world_size app not respecting env vars set by torchrun (#1029)
Summary:
`compute_world_size` is run as an integration test in a `-j 2x2` configuration using `torchrun` which sets `MASTER_ADDR` and `MASTER_PORT`. However, it was ignoring those env vars and overriding them with the ones in the hydra config (added to make `compute_world_size` work as a single process without `torchrun`).
Integ tests are failing in CI because `localhost:0` (pick random free port) is used as the `MASTER_ADDR:MASTER_PORT` on all 4 workers, hence all 4 workers are deadlocked waiting for each other to join the job.
This diff fixes this by only setting the env vars if one is not already set.
Differential Revision: D719199031 parent 0647560 commit bb256dc
1 file changed
+6
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
21 | | - | |
22 | | - | |
23 | | - | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
| |||
0 commit comments