-
Notifications
You must be signed in to change notification settings - Fork 350
Open
Description
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_stage2.py FAILED
Failures:
[1]:
time : 2025-10-13_16:22:09
host : tyut-PowerEdge-R750
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 163472)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-10-13_16:22:09
host : tyut-PowerEdge-R750
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 163471)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Metadata
Metadata
Assignees
Labels
No labels