ValueError: bad value(s) in fds_to_keep, when using NeighborLoader in a subprocess. #6919

alexbarev · 2023-03-15T14:20:32Z

alexbarev
Mar 15, 2023

I encountered error: ValueError: bad value(s) in fds_to_keep when I try to iterate over torch_geometric.loader.NeighborLoader object in a subprocess. Why does it happen?

I simplified code to reproduce error. Also I commented how to bypass this error with manipulating train_idx tensor that is feeded to input_nodes param of NeighborLoader. But I have no idea why it works and it looks for me as bug. data_ptr() shows the same value for both train_idx versions, and both have same shape and values.

import torch
from torch_geometric.loader import NeighborLoader
import torch.multiprocessing as mp
from torch_geometric.data import Data


def run(rank, data) -> None:
    # This line will later produce error
    train_idx = data.train_idx.split(data.train_idx.size(0))[0]

    # Uncomment to fix error. Though `data_ptr()` and all tensor values are the same for tensor above.
    # train_idx = data.train_idx
    
    train_loader = NeighborLoader(
        data=data,
        input_nodes=train_idx,
        num_neighbors=[5, 5],
        shuffle=True,
        drop_last=True,
        batch_size=1,
        num_workers=1,
        persistent_workers=True
    )

    # here error will occur
    print(next(iter(train_loader)))

    return None

# Basic data initialization and process spawning
if __name__ == '__main__':
    torch.manual_seed(0)

    num_nodes = 1000000
    features_dim = 10
    num_edges = num_nodes * 2
    train_size = num_nodes // 5

    data = Data(
        x=torch.rand(num_nodes, features_dim),
        edge_index=torch.randint(0, num_nodes, (2, num_edges)),
        train_idx=torch.randperm(train_size)
    )

    mp.spawn(run, args=(data, ), nprocs=1, join=True)

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspaces/gcn_test/test_loader.py", line 29, in run
    print(next(iter(train_loader)))
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 430, in __iter__
    self._iterator = self._get_iterator()
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric/loader/node_loader.py", line 165, in _get_iterator
    return DataLoaderIterator(super()._get_iterator(), self.filter_fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1034, in __init__
    w.start()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 58, in _launch
    self.pid = util.spawnv_passfds(spawn.get_executable(),
  File "/usr/lib/python3.8/multiprocessing/util.py", line 452, in spawnv_passfds
    return _posixsubprocess.fork_exec(
  File "/root/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 833, in new_fork_exec
    return getattr(_posixsubprocess, original_name)(args, *other_args)
ValueError: bad value(s) in fds_to_keep
  File "/workspaces/gcn_test/test_loader.py", line 48, in <module>
    mp.spawn(run, args=(data, ), nprocs=1, join=True)

Environment

PyG version: 2.2.0
PyTorch version: 1.13.0a0+936e930
OS: 20.04.5 LTS (Focal Fossa)
Python version: 3.8.10
CUDA/cuDNN version: 11.8
How you installed PyTorch and PyG (conda, pip, source): pip
Any other relevant information (e.g., version of torch-scatter):

Thank you

Answered by rusty1s

Mar 16, 2023

I think your example crashes because split creates a view, and this view is corrupted since multiple processes are trying to access it.

train_idx = data.train_idx.split(data.train_idx.size(0))[0]
train_idx = train_idx.clone()

fixes this for me.

View full answer

LukeLIN-web · 2023-03-15T18:51:16Z

LukeLIN-web
Mar 15, 2023

The following codes can work

import torch
from torch_geometric.loader import NeighborLoader
import torch.multiprocessing as mp
from torch_geometric.data import Data


def run(rank, data) -> None:
    # This line will later produce error
    train_idx = data.train_idx.split(data.train_idx.size(0))[0]

    # Uncomment to fix error. Though `data_ptr()` and all tensor values are the same for tensor above.
    # train_idx = data.train_idx
    
    train_loader = NeighborLoader(
        data=data,
        input_nodes=train_idx,
        num_neighbors=[5, 5],
        shuffle=True,
        drop_last=True,
        batch_size=1,
        num_workers=0,
    )

    # here error will occur
    print(next(iter(train_loader)))

    return None

# Basic data initialization and process spawning
if __name__ == '__main__':
    torch.manual_seed(0)

    num_nodes = 1000000
    features_dim = 10
    num_edges = num_nodes * 2
    train_size = num_nodes // 5

    data = Data(
        x=torch.rand(num_nodes, features_dim),
        edge_index=torch.randint(0, num_nodes, (2, num_edges)),
        train_idx=torch.randperm(train_size)
    )

    mp.spawn(run, args=(data, ), nprocs=1, join=True)

I remember Matthias mentioned that num_workers can not > 0 because cuda library limit. But I cannot find the source.

2 replies

alexbarev Mar 16, 2023
Author

Why then in example on using DDP and NeighborLoader here num_workers > 0?

rusty1s Mar 16, 2023
Maintainer

Yes, for num_workers > 0, your edge_index needs to sit on CPU. See that in the example, we only move features to the GPU for faster feature fetching, while the edge_index stays on the CPU.

rusty1s · 2023-03-16T08:45:29Z

rusty1s
Mar 16, 2023
Maintainer

I think your example crashes because split creates a view, and this view is corrupted since multiple processes are trying to access it.

train_idx = data.train_idx.split(data.train_idx.size(0))[0]
train_idx = train_idx.clone()

fixes this for me.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ValueError: bad value(s) in fds_to_keep, when using NeighborLoader in a subprocess. #6919

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

ValueError: bad value(s) in fds_to_keep, when using NeighborLoader in a subprocess. #6919

Uh oh!

Uh oh!

alexbarev Mar 15, 2023

Environment

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

LukeLIN-web Mar 15, 2023

Uh oh!

Uh oh!

alexbarev Mar 16, 2023 Author

Uh oh!

rusty1s Mar 16, 2023 Maintainer

Uh oh!

Uh oh!

rusty1s Mar 16, 2023 Maintainer

alexbarev
Mar 15, 2023

Replies: 2 comments 2 replies

LukeLIN-web
Mar 15, 2023

alexbarev Mar 16, 2023
Author

rusty1s Mar 16, 2023
Maintainer

rusty1s
Mar 16, 2023
Maintainer