Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) #6498

wangtianxiang · 2025-09-10T02:13:46Z

🐞 Bug Behavior

When compiling with CMAKE_BUILD_TYPE=Debug and running any GPU test case via mpirun -np 2 abacus using 2 GPUs, the program crashes at a cudaMemset call with cudaErrorInvalidValue: invalid argument:

Additionally, even when not using CMAKE_BUILD_TYPE=Debug, although the program runs without crashing, no device information is printed in device.log.

🔍 Root Cause

Through debugging, we found that the constructor Psi<T, Device>::Psi calls base_device::information::print_device_info, which internally invokes cudaSetDevice(0) — forcing all MPI ranks to use GPU 0.

This creates a critical inconsistency in multi-GPU runs:

Initially, rank 0 and rank 1 are correctly bound to GPU 0 and GPU 1 respectively, and allocate device memory on their assigned GPUs.
However, when print_device_info is called (e.g., for logging), rank 1 is forcibly switched to GPU 0 via cudaSetDevice(0).
Later, when cudaMemcpy is called on rank 1, it attempts to set memory that resides on GPU 1, while the current device context is GPU 0 → resulting in cudaErrorInvalidValue.

The root issue: Hard-coded cudaSetDevice(0) inside a shared utility function breaks multi-GPU context isolation in MPI environments.

⚠️ To legally access memory across different GPUs (e.g., GPU 1 accessing GPU 0’s memory), Peer-to-Peer access must be explicitly enabled via cudaDeviceEnablePeerAccess.

Additionally, in non-Debug builds (Release with -O3), the program does not crash — but produces no output in device.log. This is due to an undefined behavior caused by missing template specialization declaration:

A specialization print_device_info<DEVICE_GPU> is defined in output_device.cpp, but not declared in device.h.
As a result, whether the compiler instantiates the empty primary template or the specialized version becomes undefined behavior.
In Debug mode (-O0), the linker often resolves to the specialization → cudaSetDevice(0) is called → crash.
In Release mode (-O3), the compiler aggressively inlines/optimizes and often picks the empty primary template → no cudaSetDevice(0) → no crash, but no logging output.

🛠️ Solution

To fix this issue, we apply two key changes:

Replace cudaSetDevice(0) with cudaGetDevice() in print_device_info
Declare the template specialization in device.h

…ice(0) Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

mohanchen · 2025-09-10T07:43:32Z

Good catch! Thanks a lot for your contribution!

…ice(0) (deepmodeling#6498) Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

…w. (#6490) * Feature: add DFT-1/2 and shell DFT-1/2, currently only support PW esolver_ks_pw. * Added Sep, Sep_Cell, and VSep to organize the self-energy potential of DFT-1/2 * Added a new effective potential pot_sep for calculating the self-energy potential * Added initialization of the self-energy potential in the esolver_ks_pw control flow * Added the keyword SEP_FILES in the STRU file for reading self-energy potential files * Added the dfthalf_type keyword in INPUT to enable DFT-1/2 and shell DFT-1/2 * Fix: Compilation error in DeepKS unit tests after adding DFT-1/2 * Fix: Add the additional files to Makefile.Objects * Build(deps): Bump actions/setup-python from 5 to 6 (#6492) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v5...v6) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Refactor] Move hardware initializer out from esolver code (#6494) * Move hardware initializer out from esolver * Remove useless codes * Remove finalize code out * Feature: support NVTX profiling via timer_enable_nvtx flag (#6495) * Feature: support NVTX profiling via timer_enable_nvtx flag Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Add timer_enable_nvtx section in markdown Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix: Use __USE_NVTX macro to avoid NVTX linking errors in tests. Clarify in docs that timer_enable_nvtx parameter only takes effect on CUDA platforms. Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: Optimize Davidson by fusing operators, offloading CPU computation to GPU, and reducing memory transfers (#6493) * Perf: Optimize Diago_DavSubspace with GPU operators by adding and fusing custom kernels. Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: reduce memory allocation and copy in Diago_DavSubspace::diag_zhegvx Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: Replace loop-based 2D copy and memset with memcpy_2d_op, memset_2d_op Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: use warp reduce instead of shared memory for better efficiency Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix compilation error Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON and switch to nvtx3 headers when CUDA_VERSION >= 12090 (#6497) * Fix: switch to nvtx3 headers when CUDA_VERSION >= 12090 Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix dsp compilation problem (#6499) * Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) (#6498) Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Removed the temporary variable DMRGint_full when transitioning from 2D block parallelism to serial in Hcontainer(develop) (#6489) * delete tem Hcontainer to reduce memory usage * simplify the compute code * change DM2D_tmp to dm2d_tmp, use vector instead of new * Update version to 3.9.0.14 (#6504) * Refactor: Remove the GlobalC from sep_cell and vsep_cell * Removed GlobalC::sep_cell and GlobalC::vsep_cell from GlobalC * Integrated sep_cell into UnitCell * Integrated vsep_cell into esolver_ks_pw * Added empty constructors and destructors for Sep_Pot and Sep_Cell to facilitate unit testing compilation --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Critsium <[email protected]> Co-authored-by: Tianxiang Wang <[email protected]> Co-authored-by: zgn-26714 <[email protected]> Co-authored-by: Erjie Wu <[email protected]> Co-authored-by: Mohan Chen <[email protected]>

…ice(0) (deepmodeling#6498) Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

…w. (deepmodeling#6490) * Feature: add DFT-1/2 and shell DFT-1/2, currently only support PW esolver_ks_pw. * Added Sep, Sep_Cell, and VSep to organize the self-energy potential of DFT-1/2 * Added a new effective potential pot_sep for calculating the self-energy potential * Added initialization of the self-energy potential in the esolver_ks_pw control flow * Added the keyword SEP_FILES in the STRU file for reading self-energy potential files * Added the dfthalf_type keyword in INPUT to enable DFT-1/2 and shell DFT-1/2 * Fix: Compilation error in DeepKS unit tests after adding DFT-1/2 * Fix: Add the additional files to Makefile.Objects * Build(deps): Bump actions/setup-python from 5 to 6 (deepmodeling#6492) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v5...v6) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Refactor] Move hardware initializer out from esolver code (deepmodeling#6494) * Move hardware initializer out from esolver * Remove useless codes * Remove finalize code out * Feature: support NVTX profiling via timer_enable_nvtx flag (deepmodeling#6495) * Feature: support NVTX profiling via timer_enable_nvtx flag Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Add timer_enable_nvtx section in markdown Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix: Use __USE_NVTX macro to avoid NVTX linking errors in tests. Clarify in docs that timer_enable_nvtx parameter only takes effect on CUDA platforms. Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: Optimize Davidson by fusing operators, offloading CPU computation to GPU, and reducing memory transfers (deepmodeling#6493) * Perf: Optimize Diago_DavSubspace with GPU operators by adding and fusing custom kernels. Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: reduce memory allocation and copy in Diago_DavSubspace::diag_zhegvx Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: Replace loop-based 2D copy and memset with memcpy_2d_op, memset_2d_op Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Perf: use warp reduce instead of shared memory for better efficiency Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix compilation error Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON and switch to nvtx3 headers when CUDA_VERSION >= 12090 (deepmodeling#6497) * Fix: switch to nvtx3 headers when CUDA_VERSION >= 12090 Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Fix dsp compilation problem (deepmodeling#6499) * Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) (deepmodeling#6498) Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd. * Removed the temporary variable DMRGint_full when transitioning from 2D block parallelism to serial in Hcontainer(develop) (deepmodeling#6489) * delete tem Hcontainer to reduce memory usage * simplify the compute code * change DM2D_tmp to dm2d_tmp, use vector instead of new * Update version to 3.9.0.14 (deepmodeling#6504) * Refactor: Remove the GlobalC from sep_cell and vsep_cell * Removed GlobalC::sep_cell and GlobalC::vsep_cell from GlobalC * Integrated sep_cell into UnitCell * Integrated vsep_cell into esolver_ks_pw * Added empty constructors and destructors for Sep_Pot and Sep_Cell to facilitate unit testing compilation --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Critsium <[email protected]> Co-authored-by: Tianxiang Wang <[email protected]> Co-authored-by: zgn-26714 <[email protected]> Co-authored-by: Erjie Wu <[email protected]> Co-authored-by: Mohan Chen <[email protected]>

Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDev…

0425eb8

…ice(0) Signed-off-by：Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

mohanchen approved these changes Sep 10, 2025

View reviewed changes

mohanchen added Bugs Bugs that only solvable with sufficient knowledge of DFT GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes labels Sep 10, 2025

mohanchen merged commit 305bbf6 into deepmodeling:develop Sep 10, 2025
14 checks passed

wangtianxiang deleted the fix_print_device_info_bug branch October 10, 2025 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) #6498

Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) #6498

Uh oh!

wangtianxiang commented Sep 10, 2025 •

edited

Loading

Uh oh!

mohanchen commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) #6498

Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) #6498

Uh oh!

Conversation

wangtianxiang commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐞 Bug Behavior

🔍 Root Cause

🛠️ Solution

Uh oh!

mohanchen commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangtianxiang commented Sep 10, 2025 •

edited

Loading