Skip to content

Hang in dataset iterator #1514

@albertz

Description

@albertz

With PyTorch.

I see this exception in the log:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/config.py", line 72, in __setstate__
    self.typed_dict = unpickler.load()
                      ^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/pickle.py", line 1213, in load
    dispatch[key[0]](self)
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/pickle.py", line 1347, in load_binstring
    self.append(self._decode_string(data))
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/pickle.py", line 1329, in _decode_string
    return value.decode(self.encoding, self.errors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

But then the trainer hangs.

Ps tree:

  90521          \_ python3.11                                                  
  90657          |   \_ python3.11
  90658          |   \_ watch memory
  90777          |   \_ MPD worker 0
  90779          |   \_ MPD worker 1
  90781          |   \_ MPD worker 2
  90782          |   \_ MPD worker 3
  91244          |   \_ python3.11  
  91362          |   \_ MPD worker 0
  91364          |   \_ MPD worker 1
  91365          |   \_ MPD worker 2
  91369          |   \_ MPD worker 3
  91801          |   \_ MPD worker 0
  91802          |   \_ MPD worker 1
  91803          |   \_ MPD worker 2
  91804          |   \_ MPD worker 3
  92265          |   \_ python3.11 <defunct>

92265 should be the sub proc which got the exception.

Py-spy on the parent:

% py-spy dump -p 90521     
Process 90521: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wlbfTg0IM0nc/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)

Thread 90521 (idle): "MainThread"
    _launch (multiprocessing/popen_spawn_posix.py:62)
    __init__ (multiprocessing/popen_fork.py:19)
    __init__ (multiprocessing/popen_spawn_posix.py:32)
    _Popen (multiprocessing/context.py:288)
    start (multiprocessing/process.py:121)
    start (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:55)
    __init__ (torch/utils/data/dataloader.py:1039)
    _get_iterator (torch/utils/data/dataloader.py:386)
    __iter__ (torch/utils/data/dataloader.py:433)
    train_epoch (returnn/returnn/torch/engine.py:342)
    train (returnn/returnn/torch/engine.py:236)
    execute_main_task (returnn/returnn/__main__.py:543)
    main (returnn/returnn/__main__.py:737)
    <module> (returnn/rnn.py:11)

Or pystack:

...
    (Python) File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/util/multi_proc_non_daemonic_spawn.py", line 55, in start
        super().start()
      Arguments:
        self: <NonDaemonicSpawnProcess at 0x7fdda657d890>
      Locals:
        __class__: <cell at 0x7fddda571300>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/process.py", line 121, in start
        self._popen = self._Popen(self)
      Arguments:
        self: <NonDaemonicSpawnProcess at 0x7fdda657d890>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
        return Popen(process_obj)
      Arguments:
        process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
      Locals:
        Popen: <type at 0x113085f0>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
        super().__init__(process_obj)
      Arguments:
        process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
        self: <Popen at 0x7fddad9ecbd0>
      Locals:
        __class__: <cell at 0x7fde1b92e620>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
        self._launch(process_obj)
      Arguments:
        process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
        self: <Popen at 0x7fddad9ecbd0>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 62, in _launch
        f.write(fp.getbuffer())
      Arguments:
        process_obj: <NonDaemonicSpawnProcess at 0x7fdda657d890>
        self: <Popen at 0x7fddad9ecbd0>
      Locals:
        cmd: [b"/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11", "-c", ...]
        f: <_io.BufferedWriter at 0x7fdda65605c0>
        parent_w: 126
        child_r: 125
        parent_r: 123
        fp: <_io.BytesIO at 0x7fdda657ba60>
        child_w: 124
        prep_data: {"log_to_stderr": False, "authkey": <BINARY>, ...}
        tracker_fd: 39
        resource_tracker: <module at 0x7fde1b96f830>

Looking at the code in popen_spawn_posix in _launch:

...
            parent_r, child_w = os.pipe()
            child_r, parent_w = os.pipe()
            cmd = spawn.get_command_line(tracker_fd=tracker_fd,
                                         pipe_handle=child_r)
            self._fds.extend([child_r, child_w])
            self.pid = util.spawnv_passfds(spawn.get_executable(),
                                           cmd, self._fds)
            self.sentinel = parent_r
            with open(parent_w, 'wb', closefd=False) as f:
                f.write(fp.getbuffer())   # <---- it hangs here
...

So, the sub proc dies because of some pickling error. But then the parent hangs because the fd is not closed yet because the sub proc is now in zombie state, not cleaned up yet (no os.wait yet)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions