Skip to content

Commit ac5552a

Browse files
authored
[Fix] warmup communication right after init process group. (#1380)
1 parent c42f327 commit ac5552a

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

xtuner/v1/train/trainer.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1323,6 +1323,13 @@ def _init_dist(self, backend: str | None = None):
13231323
if not dist.is_initialized():
13241324
init_process_group(backend=backend)
13251325
torch.accelerator.set_device_index(int(os.environ["LOCAL_RANK"]))
1326+
# In some cases, the datasets can perform massive numpy loading before the first communication.
1327+
# After build dataset, massive numpy loading causing a lot of anonymous mmap allocation.
1328+
# THP(transparent huge page) kernel thread would continuously scan and merge these anonymous mmap to huge page.
1329+
# At the same time, if we perform communication for the first time, backend (e.g., NCCL) may register
1330+
# and lock address that might have been changed by THP, which causes a crash. So we should warmup first.
1331+
warmup_tensor = torch.ones(4, 4, device=torch.accelerator.current_accelerator())
1332+
dist.all_reduce(warmup_tensor)
13261333

13271334
def _init_xtuner_meta(self, work_dir: Path, auto_resume: bool) -> XTunerMeta:
13281335
if not work_dir.exists():

0 commit comments

Comments
 (0)