-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Hello, when I finetune VGGT on multiple GPUs, random errors occur at different steps as follows
[rank1]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f94df59c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe0 (0x7f94df551bb6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f94e97d2e12 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1914e (0x7f94e97a014e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x1b7ed (0x7f94e97a27ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x1bbc5 (0x7f94e97a2bc5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x483b00 (0x7f94de3f0b00 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f94df578419 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10d::Reducer::~Reducer() + 0x32a (0x7f94d7f8be1a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0xbaf2a2 (0x7f94deb1c2a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x3af29a (0x7f94de31c29a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: + 0xbb3a51 (0x7f94deb20a51 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: + 0x3b9cdd (0x7f94de326cdd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #13: + 0x3babc1 (0x7f94de327bc1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #14: + 0x1368b3 (0x55d67247f8b3 in /usr/bin/python)
frame #15: + 0x172aa0 (0x55d6724bbaa0 in /usr/bin/python)
frame #16: + 0x12815f (0x55d67247115f in /usr/bin/python)
frame #17: _PyObject_GenericSetAttrWithDict + 0x160 (0x55d6724768d0 in /usr/bin/python)
frame #18: PyObject_SetAttr + 0x70 (0x55d672476620 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x1041 (0x55d67248cbf1 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #21: + 0x159dc9 (0x55d6724a2dc9 in /usr/bin/python)
frame #22: PyObject_CallFunctionObjArgs + 0xa3 (0x55d6724a2c83 in /usr/bin/python)
frame #23: + 0x236d65 (0x55d67257fd65 in /usr/bin/python)
frame #24: _PyObject_GenericSetAttrWithDict + 0x73b (0x55d672476eab in /usr/bin/python)
frame #25: PyObject_SetAttr + 0x70 (0x55d672476620 in /usr/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x1041 (0x55d67248cbf1 in /usr/bin/python)
frame #27: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x8ac (0x55d67248c45c in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x8ac (0x55d67248c45c in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x614a (0x55d672491cfa in /usr/bin/python)
frame #33: + 0x1687f1 (0x55d6724b17f1 in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #37: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #39: + 0x1687f1 (0x55d6724b17f1 in /usr/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #41: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #43: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #45: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #47: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #49: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #50: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #51: + 0x13f9c6 (0x55d6724889c6 in /usr/bin/python)
frame #52: PyEval_EvalCode + 0x86 (0x55d67257e256 in /usr/bin/python)
frame #53: + 0x23ae2d (0x55d672583e2d in /usr/bin/python)
frame #54: + 0x15ac59 (0x55d6724a3c59 in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #56: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #58: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #59: + 0x252c2d (0x55d67259bc2d in /usr/bin/python)
frame #60: Py_RunMain + 0x128 (0x55d67259a8c8 in /usr/bin/python)
frame #61: Py_BytesMain + 0x2d (0x55d67257102d in /usr/bin/python)
frame #62: + 0x29d90 (0x7f94eb5a0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #63: __libc_start_main + 0x80 (0x7f94eb5a0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Upon checking, the error occurs in loss.backward().
Have you encountered this problem during training? It is worth noting that since I want the input image resolution to be 266*266, I performed interpolation on aggregator.patch_embed.pos_embed when loading the pretrained model. I wonder if this could have an impact.