Skip to content

problem about fintuning VGGT #210

@YitongD

Description

@YitongD

Hello, when I finetune VGGT on multiple GPUs, random errors occur at different steps as follows

[rank1]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f94df59c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe0 (0x7f94df551bb6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f94e97d2e12 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1914e (0x7f94e97a014e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x1b7ed (0x7f94e97a27ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x1bbc5 (0x7f94e97a2bc5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x483b00 (0x7f94de3f0b00 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f94df578419 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10d::Reducer::~Reducer() + 0x32a (0x7f94d7f8be1a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0xbaf2a2 (0x7f94deb1c2a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x3af29a (0x7f94de31c29a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: + 0xbb3a51 (0x7f94deb20a51 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: + 0x3b9cdd (0x7f94de326cdd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #13: + 0x3babc1 (0x7f94de327bc1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #14: + 0x1368b3 (0x55d67247f8b3 in /usr/bin/python)
frame #15: + 0x172aa0 (0x55d6724bbaa0 in /usr/bin/python)
frame #16: + 0x12815f (0x55d67247115f in /usr/bin/python)
frame #17: _PyObject_GenericSetAttrWithDict + 0x160 (0x55d6724768d0 in /usr/bin/python)
frame #18: PyObject_SetAttr + 0x70 (0x55d672476620 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x1041 (0x55d67248cbf1 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #21: + 0x159dc9 (0x55d6724a2dc9 in /usr/bin/python)
frame #22: PyObject_CallFunctionObjArgs + 0xa3 (0x55d6724a2c83 in /usr/bin/python)
frame #23: + 0x236d65 (0x55d67257fd65 in /usr/bin/python)
frame #24: _PyObject_GenericSetAttrWithDict + 0x73b (0x55d672476eab in /usr/bin/python)
frame #25: PyObject_SetAttr + 0x70 (0x55d672476620 in /usr/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x1041 (0x55d67248cbf1 in /usr/bin/python)
frame #27: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x8ac (0x55d67248c45c in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x8ac (0x55d67248c45c in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x614a (0x55d672491cfa in /usr/bin/python)
frame #33: + 0x1687f1 (0x55d6724b17f1 in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #37: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #39: + 0x1687f1 (0x55d6724b17f1 in /usr/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #41: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #43: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #45: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #47: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x198c (0x55d67248d53c in /usr/bin/python)
frame #49: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #50: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #51: + 0x13f9c6 (0x55d6724889c6 in /usr/bin/python)
frame #52: PyEval_EvalCode + 0x86 (0x55d67257e256 in /usr/bin/python)
frame #53: + 0x23ae2d (0x55d672583e2d in /usr/bin/python)
frame #54: + 0x15ac59 (0x55d6724a3c59 in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #56: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x6bd (0x55d67248c26d in /usr/bin/python)
frame #58: _PyFunction_Vectorcall + 0x7c (0x55d6724a39fc in /usr/bin/python)
frame #59: + 0x252c2d (0x55d67259bc2d in /usr/bin/python)
frame #60: Py_RunMain + 0x128 (0x55d67259a8c8 in /usr/bin/python)
frame #61: Py_BytesMain + 0x2d (0x55d67257102d in /usr/bin/python)
frame #62: + 0x29d90 (0x7f94eb5a0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #63: __libc_start_main + 0x80 (0x7f94eb5a0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Upon checking, the error occurs in loss.backward().
Have you encountered this problem during training? It is worth noting that since I want the input image resolution to be 266*266, I performed interpolation on aggregator.patch_embed.pos_embed when loading the pretrained model. I wonder if this could have an impact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions