-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmulti_node_job.err
More file actions
283 lines (276 loc) · 24.1 KB
/
multi_node_job.err
File metadata and controls
283 lines (276 loc) · 24.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
+ cat /var/spool/slurmd/job2607740/slurm_script
+ export MASTER_PORT=25678
+ MASTER_PORT=25678
++ hostname
+ export MASTER_ADDR=i44
+ MASTER_ADDR=i44
+ export WANDB_API_KEY=c80687eb51acc4024f6907e16bcf29fd0f9862c1
+ WANDB_API_KEY=c80687eb51acc4024f6907e16bcf29fd0f9862c1
+ export NCCL_DEBUG=INFO
+ NCCL_DEBUG=INFO
+ srun bash -c '
TORCHRUN_ARGS="--node-rank=${SLURM_PROCID} --master-addr=${MASTER_ADDR} --master-port=${MASTER_PORT} --nnodes=${SLURM_NNODES} --nproc-per-node=2"
echo ${SLURM_PROCID}
echo ${TORCHRUN_ARGS}
echo ${SLURMD_NODENAME}
torchrun ${TORCHRUN_ARGS} run_training.py --config cfgs/nano4M/multiclevr_caption_d6-6w512.yaml
'
W0424 00:38:46.240000 1630799 site-packages/torch/distributed/run.py:792]
W0424 00:38:46.240000 1630799 site-packages/torch/distributed/run.py:792] *****************************************
W0424 00:38:46.240000 1630799 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0424 00:38:46.240000 1630799 site-packages/torch/distributed/run.py:792] *****************************************
W0424 00:38:47.017000 1589273 site-packages/torch/distributed/run.py:792]
W0424 00:38:47.017000 1589273 site-packages/torch/distributed/run.py:792] *****************************************
W0424 00:38:47.017000 1589273 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0424 00:38:47.017000 1589273 site-packages/torch/distributed/run.py:792] *****************************************
[rank0]:[W424 00:38:49.217552890 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank1]:[W424 00:38:49.232861491 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank3]:[W424 00:38:51.257061249 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank2]:[W424 00:38:51.257151815 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: rayane-charifchefchaouni (rayane-charifchefchaouni-epfl) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in /home/rcharif/nano4M-model/wandb/run-20250424_003852-hi1udhvf
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run multiclevr_caption_d6-6w512
wandb: ⭐️ View project at https://wandb.ai/rayane-charifchefchaouni-epfl/COM304_nano4M
wandb: 🚀 View run at https://wandb.ai/rayane-charifchefchaouni-epfl/COM304_nano4M/runs/hi1udhvf
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py:624: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py:624: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py:624: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(
/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py:624: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/rcharif/nano4M-model/run_training.py", line 398, in <module>
[rank2]: main(args)
[rank2]: File "/home/rcharif/nano4M-model/run_training.py", line 220, in main
[rank2]: train_stats = train_loop(
[rank2]: File "/home/rcharif/nano4M-model/run_training.py", line 267, in train_loop
[rank2]: for step, data_dict in enumerate(metric_logger.log_every(data_loader_train, print_freq, iter_len=args.total_iters, header=header, start_iter=args.start_iteration)):
[rank2]: File "/home/rcharif/nano4M-model/nanofm/utils/logger.py", line 155, in log_every
[rank2]: for obj in iterable:
[rank2]: File "/home/rcharif/nano4M-model/nanofm/data/utils.py", line 23, in infinite_iterator
[rank2]: for batch in loader:
[rank2]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
[rank2]: data = self._next_data()
[rank2]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data
[rank2]: return self._process_data(data)
[rank2]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
[rank2]: data.reraise()
[rank2]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
[rank2]: raise exception
[rank2]: ValueError: Caught ValueError in DataLoader worker process 0.
[rank2]: Original Traceback (most recent call last):
[rank2]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank2]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank2]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank2]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank2]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank2]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank2]: File "/home/rcharif/nano4M-model/nanofm/data/multimodal/simple_multimodal_dataset.py", line 127, in __getitem__
[rank2]: raise ValueError(f"Unknown modality: {modality}")
[rank2]: ValueError: Unknown modality: caption
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/rcharif/nano4M-model/run_training.py", line 398, in <module>
[rank3]: main(args)
[rank3]: File "/home/rcharif/nano4M-model/run_training.py", line 220, in main
[rank3]: train_stats = train_loop(
[rank3]: File "/home/rcharif/nano4M-model/run_training.py", line 267, in train_loop
[rank3]: for step, data_dict in enumerate(metric_logger.log_every(data_loader_train, print_freq, iter_len=args.total_iters, header=header, start_iter=args.start_iteration)):
[rank3]: File "/home/rcharif/nano4M-model/nanofm/utils/logger.py", line 155, in log_every
[rank3]: for obj in iterable:
[rank3]: File "/home/rcharif/nano4M-model/nanofm/data/utils.py", line 23, in infinite_iterator
[rank3]: for batch in loader:
[rank3]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
[rank3]: data = self._next_data()
[rank3]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data
[rank3]: return self._process_data(data)
[rank3]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
[rank3]: data.reraise()
[rank3]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
[rank3]: raise exception
[rank3]: ValueError: Caught ValueError in DataLoader worker process 0.
[rank3]: Original Traceback (most recent call last):
[rank3]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank3]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank3]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank3]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank3]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank3]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank3]: File "/home/rcharif/nano4M-model/nanofm/data/multimodal/simple_multimodal_dataset.py", line 127, in __getitem__
[rank3]: raise ValueError(f"Unknown modality: {modality}")
[rank3]: ValueError: Unknown modality: caption
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/rcharif/nano4M-model/run_training.py", line 398, in <module>
[rank1]: main(args)
[rank1]: File "/home/rcharif/nano4M-model/run_training.py", line 220, in main
[rank1]: train_stats = train_loop(
[rank1]: File "/home/rcharif/nano4M-model/run_training.py", line 267, in train_loop
[rank1]: for step, data_dict in enumerate(metric_logger.log_every(data_loader_train, print_freq, iter_len=args.total_iters, header=header, start_iter=args.start_iteration)):
[rank1]: File "/home/rcharif/nano4M-model/nanofm/utils/logger.py", line 155, in log_every
[rank1]: for obj in iterable:
[rank1]: File "/home/rcharif/nano4M-model/nanofm/data/utils.py", line 23, in infinite_iterator
[rank1]: for batch in loader:
[rank1]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
[rank1]: data = self._next_data()
[rank1]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data
[rank1]: return self._process_data(data)
[rank1]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
[rank1]: data.reraise()
[rank1]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
[rank1]: raise exception
[rank1]: ValueError: Caught ValueError in DataLoader worker process 0.
[rank1]: Original Traceback (most recent call last):
[rank1]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank1]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: File "/home/rcharif/nano4M-model/nanofm/data/multimodal/simple_multimodal_dataset.py", line 127, in __getitem__
[rank1]: raise ValueError(f"Unknown modality: {modality}")
[rank1]: ValueError: Unknown modality: caption
Traceback (most recent call last):
File "/home/rcharif/nano4M-model/run_training.py", line 398, in <module>
main(args)
File "/home/rcharif/nano4M-model/run_training.py", line 220, in main
train_stats = train_loop(
File "/home/rcharif/nano4M-model/run_training.py", line 267, in train_loop
for step, data_dict in enumerate(metric_logger.log_every(data_loader_train, print_freq, iter_len=args.total_iters, header=header, start_iter=args.start_iteration)):
File "/home/rcharif/nano4M-model/nanofm/utils/logger.py", line 155, in log_every
for obj in iterable:
File "/home/rcharif/nano4M-model/nanofm/data/utils.py", line 23, in infinite_iterator
for batch in loader:
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
data = self._next_data()
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data
return self._process_data(data)
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
data.reraise()
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/rcharif/nano4M-model/nanofm/data/multimodal/simple_multimodal_dataset.py", line 127, in __getitem__
raise ValueError(f"Unknown modality: {modality}")
ValueError: Unknown modality: caption
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/rcharif/nano4M-model/run_training.py", line 398, in <module>
[rank0]: main(args)
[rank0]: File "/home/rcharif/nano4M-model/run_training.py", line 220, in main
[rank0]: train_stats = train_loop(
[rank0]: File "/home/rcharif/nano4M-model/run_training.py", line 267, in train_loop
[rank0]: for step, data_dict in enumerate(metric_logger.log_every(data_loader_train, print_freq, iter_len=args.total_iters, header=header, start_iter=args.start_iteration)):
[rank0]: File "/home/rcharif/nano4M-model/nanofm/utils/logger.py", line 155, in log_every
[rank0]: for obj in iterable:
[rank0]: File "/home/rcharif/nano4M-model/nanofm/data/utils.py", line 23, in infinite_iterator
[rank0]: for batch in loader:
[rank0]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
[rank0]: data = self._next_data()
[rank0]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data
[rank0]: return self._process_data(data)
[rank0]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
[rank0]: data.reraise()
[rank0]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
[rank0]: raise exception
[rank0]: ValueError: Caught ValueError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: File "/home/rcharif/nano4M-model/nanofm/data/multimodal/simple_multimodal_dataset.py", line 127, in __getitem__
[rank0]: raise ValueError(f"Unknown modality: {modality}")
[rank0]: ValueError: Unknown modality: caption
[rank2]:[W424 00:39:21.473397938 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0424 00:39:22.510000 1630799 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1630820 closing signal SIGTERM
E0424 00:39:22.627000 1589273 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1589323) of binary: /work/com-304/new_environment/anaconda3/envs/nanofm/bin/python
Traceback (most recent call last):
File "/work/com-304/new_environment/anaconda3/envs/nanofm/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
E0424 00:39:22.630000 1630799 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1630821) of binary: /work/com-304/new_environment/anaconda3/envs/nanofm/bin/python
return f(*args, **kwargs)
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-04-24_00:39:22
host : ixl01.izar.cluster
rank : 3 (local_rank: 1)
exitcode : 1 (pid: 1589324)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-04-24_00:39:22
host : ixl01.izar.cluster
rank : 2 (local_rank: 0)
exitcode : 1 (pid: 1589323)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/work/com-304/new_environment/anaconda3/envs/nanofm/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/work/com-304/new_environment/anaconda3/envs/nanofm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_training.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-04-24_00:39:22
host : i44.izar.cluster
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1630821)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: ixl01: task 1: Exited with exit code 1
srun: Terminating StepId=2607740.0
slurmstepd: error: *** STEP 2607740.0 ON i44 CANCELLED AT 2025-04-24T00:39:22 ***
srun: error: i44: task 0: Terminated
srun: Force Terminated StepId=2607740.0