Skip to content

Debug segmentation faultย #344

@denadai2

Description

@denadai2

๐Ÿ› Describe the bug

Dear pyg-lib team,

I encountered an error when I call.

out = torch.ops.pyg.merge_sampler_outputs(
            sampled_nodes_with_dupl,
            edge_ids,
            cumm_sampled_nbrs_per_node,
            partition_ids,
            partition_orders,
            partitions_num,
            one_hop_num,
            src_batch,
            self.disjoint,
        )

the error is:

(TrainerActor pid=15637) *** SIGSEGV received at time=1724074372 ***
(TrainerActor pid=15637) PC: @        0x110b277b0  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637)     @        0x1080ecde0  (unknown)  absl::lts_20230125::WriteFailureInfo()
(TrainerActor pid=15637)     @        0x1080ecb2c  (unknown)  absl::lts_20230125::AbslFailureSignalHandler()
(TrainerActor pid=15637)     @        0x190087584  (unknown)  _sigtramp
(TrainerActor pid=15637)     @        0x110b27778  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637)     @        0x110b29a80  (unknown)  c10::impl::call_functor_with_args_from_stack_<>()
(TrainerActor pid=15637)     @        0x110b29958  (unknown)  c10::impl::make_boxed_from_unboxed_functor<>::call()
(TrainerActor pid=15637)     @        0x383c8b4fc  (unknown)  torch::autograd::basicAutogradNotImplementedFallbackImpl()
(TrainerActor pid=15637)     @        0x38002e664  (unknown)  c10::Dispatcher::callBoxed()
(TrainerActor pid=15637)     @        0x10bb8af14  (unknown)  torch::jit::invokeOperatorFromPython()
(TrainerActor pid=15637)     @        0x10bb8b668  (unknown)  torch::jit::_get_operation_for_overload_or_packet()
(TrainerActor pid=15637)     @        0x10bad43a8  (unknown)  pybind11::detail::argument_loader<>::call<>()
(TrainerActor pid=15637)     @        0x10bad41f0  (unknown)  pybind11::cpp_function::initialize<>()::{lambda()#1}::__invoke()
(TrainerActor pid=15637)     @        0x10b4a9b7c  (unknown)  pybind11::cpp_function::dispatcher()
(TrainerActor pid=15637)     @        0x104f82d88  (unknown)  cfunction_call
(TrainerActor pid=15637)     @        0x104f32060  (unknown)  _PyObject_Call
(TrainerActor pid=15637)     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f3116c  (unknown)  _PyObject_FastCallDictTstate
(TrainerActor pid=15637)     @        0x104f326a0  (unknown)  _PyObject_Call_Prepend
(TrainerActor pid=15637)     @        0x104fa63c0  (unknown)  slot_tp_call
(TrainerActor pid=15637)     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637)     @        0x105025580  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f4af60  (unknown)  gen_send_ex2
(TrainerActor pid=15637)     @        0x104cbcfac  (unknown)  task_step_impl
(TrainerActor pid=15637)     @        0x104cbcd84  (unknown)  task_step
(TrainerActor pid=15637)     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637)     @        0x105044c04  (unknown)  context_run
(TrainerActor pid=15637)     @        0x104f824b8  (unknown)  cfunction_vectorcall_FASTCALL_KEYWORDS
(TrainerActor pid=15637)     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637)     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637)     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637)     @        0x1050f6558  (unknown)  thread_run
(TrainerActor pid=15637)     @ ... and at least 3 more frames
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: *** SIGSEGV received at time=1724074372 ***
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: PC: @        0x110b277b0  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x1080ecde0  (unknown)  absl::lts_20230125::WriteFailureInfo()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x1080ecb44  (unknown)  absl::lts_20230125::AbslFailureSignalHandler()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x190087584  (unknown)  _sigtramp
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x110b27778  (unknown)  pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x110b29a80  (unknown)  c10::impl::call_functor_with_args_from_stack_<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440:     @        0x110b29958  (unknown)  c10::impl::make_boxed_from_unboxed_functor<>::call()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x383c8b4fc  (unknown)  torch::autograd::basicAutogradNotImplementedFallbackImpl()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x38002e664  (unknown)  c10::Dispatcher::callBoxed()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bb8af14  (unknown)  torch::jit::invokeOperatorFromPython()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bb8b668  (unknown)  torch::jit::_get_operation_for_overload_or_packet()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bad43a8  (unknown)  pybind11::detail::argument_loader<>::call<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10bad41f0  (unknown)  pybind11::cpp_function::initialize<>()::{lambda()#1}::__invoke()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x10b4a9b7c  (unknown)  pybind11::cpp_function::dispatcher()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f82d88  (unknown)  cfunction_call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f32060  (unknown)  _PyObject_Call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f3116c  (unknown)  _PyObject_FastCallDictTstate
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f326a0  (unknown)  _PyObject_Call_Prepend
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104fa63c0  (unknown)  slot_tp_call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105025580  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f4af60  (unknown)  gen_send_ex2
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104cbcfac  (unknown)  task_step_impl
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104cbcd84  (unknown)  task_step
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f31348  (unknown)  _PyObject_MakeTpCall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105044c04  (unknown)  context_run
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f824b8  (unknown)  cfunction_vectorcall_FASTCALL_KEYWORDS
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x105026f20  (unknown)  _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x104f34410  (unknown)  method_vectorcall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @        0x1050f6558  (unknown)  thread_run
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440:     @ ... and at least 3 more frames
(TrainerActor pid=15637) Fatal Python error: Segmentation fault
(TrainerActor pid=15637)
(TrainerActor pid=15637) Stack (most recent call first):
(TrainerActor pid=15637)   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/torch/_ops.py", line 854 in __call__
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 842 in _merge_sampler_outputs
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 948 in sample_one_hop
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 315 in node_sample
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 618 in edge_sample
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 193 in _sample_from
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/events.py", line 88 in _run
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 1987 in _run_once
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 641 in run_forever
(TrainerActor pid=15637)   File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/event_loop.py", line 108 in _run_loop
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1010 in run
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
(TrainerActor pid=15637)   File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1030 in _bootstrap
(TrainerActor pid=15637)
(TrainerActor pid=15637) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_osx, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, scipy._lib._ccallback_c, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.cluster._vq, scipy.cluster._hierarchy, scipy.cluster._optimal_leaf_ordering, markupsafe._speedups, pyarrow.lib, pyarrow._json (total: 74)

do you have a suggestion on how to debug this?

thx

Environment

  • pyg-lib version:
  • PyTorch version:
  • OS:
  • Python version:
  • CUDA/cuDNN version:
  • How you installed PyTorch and pyg-lib (conda, pip, source):
  • Any other relevant information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions