-
Notifications
You must be signed in to change notification settings - Fork 133
Closed
Description
With PyTorch.
I see this exception in the log:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/config.py", line 72, in __setstate__
self.typed_dict = unpickler.load()
^^^^^^^^^^^^^^^^
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/pickle.py", line 1213, in load
dispatch[key[0]](self)
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/pickle.py", line 1347, in load_binstring
self.append(self._decode_string(data))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/pickle.py", line 1329, in _decode_string
return value.decode(self.encoding, self.errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
But then the trainer hangs.
Ps tree:
90521 \_ python3.11
90657 | \_ python3.11
90658 | \_ watch memory
90777 | \_ MPD worker 0
90779 | \_ MPD worker 1
90781 | \_ MPD worker 2
90782 | \_ MPD worker 3
91244 | \_ python3.11
91362 | \_ MPD worker 0
91364 | \_ MPD worker 1
91365 | \_ MPD worker 2
91369 | \_ MPD worker 3
91801 | \_ MPD worker 0
91802 | \_ MPD worker 1
91803 | \_ MPD worker 2
91804 | \_ MPD worker 3
92265 | \_ python3.11 <defunct>
92265 should be the sub proc which got the exception.
Py-spy on the parent:
% py-spy dump -p 90521
Process 90521: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wlbfTg0IM0nc/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)
Thread 90521 (idle): "MainThread"
_launch (multiprocessing/popen_spawn_posix.py:62)
__init__ (multiprocessing/popen_fork.py:19)
__init__ (multiprocessing/popen_spawn_posix.py:32)
_Popen (multiprocessing/context.py:288)
start (multiprocessing/process.py:121)
start (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:55)
__init__ (torch/utils/data/dataloader.py:1039)
_get_iterator (torch/utils/data/dataloader.py:386)
__iter__ (torch/utils/data/dataloader.py:433)
train_epoch (returnn/returnn/torch/engine.py:342)
train (returnn/returnn/torch/engine.py:236)
execute_main_task (returnn/returnn/__main__.py:543)
main (returnn/returnn/__main__.py:737)
<module> (returnn/rnn.py:11)
Or pystack:
...
(Python) File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/util/multi_proc_non_daemonic_spawn.py", line 55, in start
super().start()
Arguments:
self: <NonDaemonicSpawnProcess at 0x7fdda657d890>
Locals:
__class__: <cell at 0x7fddda571300>
(Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
Arguments:
self: <NonDaemonicSpawnProcess at 0x7fdda657d890>
(Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
Arguments:
process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
Locals:
Popen: <type at 0x113085f0>
(Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
Arguments:
process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
self: <Popen at 0x7fddad9ecbd0>
Locals:
__class__: <cell at 0x7fde1b92e620>
(Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
Arguments:
process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
self: <Popen at 0x7fddad9ecbd0>
(Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 62, in _launch
f.write(fp.getbuffer())
Arguments:
process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
self: <Popen at 0x7fddad9ecbd0>
Locals:
cmd: [b"/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11", "-c", ...]
f: <_io.BufferedWriter at 0x7fdda65605c0>
parent_w: 126
child_r: 125
parent_r: 123
fp: <_io.BytesIO at 0x7fdda657ba60>
child_w: 124
prep_data: {"log_to_stderr": False, "authkey": <BINARY>, ...}
tracker_fd: 39
resource_tracker: <module at 0x7fde1b96f830>
Looking at the code in popen_spawn_posix
in _launch
:
...
parent_r, child_w = os.pipe()
child_r, parent_w = os.pipe()
cmd = spawn.get_command_line(tracker_fd=tracker_fd,
pipe_handle=child_r)
self._fds.extend([child_r, child_w])
self.pid = util.spawnv_passfds(spawn.get_executable(),
cmd, self._fds)
self.sentinel = parent_r
with open(parent_w, 'wb', closefd=False) as f:
f.write(fp.getbuffer()) # <---- it hangs here
...
So, the sub proc dies because of some pickling error. But then the parent hangs because the fd is not closed yet because the sub proc is now in zombie state, not cleaned up yet (no os.wait
yet)?
Metadata
Metadata
Assignees
Labels
No labels