Skip to content

Conversation

@wangtianxiang
Copy link

@wangtianxiang wangtianxiang commented Sep 10, 2025

🐞 Bug Behavior

When compiling with CMAKE_BUILD_TYPE=Debug and running any GPU test case via mpirun -np 2 abacus using 2 GPUs, the program crashes at a cudaMemset call with cudaErrorInvalidValue: invalid argument:
1757410867214_4D10AF07-444F-464d-AEE4-3BEB08A53D97

Additionally, even when not using CMAKE_BUILD_TYPE=Debug, although the program runs without crashing, no device information is printed in device.log.

🔍 Root Cause

Through debugging, we found that the constructor Psi<T, Device>::Psi calls base_device::information::print_device_info, which internally invokes cudaSetDevice(0) — forcing all MPI ranks to use GPU 0.

This creates a critical inconsistency in multi-GPU runs:

  • Initially, rank 0 and rank 1 are correctly bound to GPU 0 and GPU 1 respectively, and allocate device memory on their assigned GPUs.
  • However, when print_device_info is called (e.g., for logging), rank 1 is forcibly switched to GPU 0 via cudaSetDevice(0).
  • Later, when cudaMemcpy is called on rank 1, it attempts to set memory that resides on GPU 1, while the current device context is GPU 0 → resulting in cudaErrorInvalidValue.

The root issue: Hard-coded cudaSetDevice(0) inside a shared utility function breaks multi-GPU context isolation in MPI environments.

⚠️ To legally access memory across different GPUs (e.g., GPU 1 accessing GPU 0’s memory), Peer-to-Peer access must be explicitly enabled via cudaDeviceEnablePeerAccess.

Additionally, in non-Debug builds (Release with -O3), the program does not crash — but produces no output in device.log. This is due to an undefined behavior caused by missing template specialization declaration:

  • A specialization print_device_info<DEVICE_GPU> is defined in output_device.cpp, but not declared in device.h.
  • As a result, whether the compiler instantiates the empty primary template or the specialized version becomes undefined behavior.
  • In Debug mode (-O0), the linker often resolves to the specialization → cudaSetDevice(0) is called → crash.
  • In Release mode (-O3), the compiler aggressively inlines/optimizes and often picks the empty primary template → no cudaSetDevice(0) → no crash, but no logging output.

🛠️ Solution

To fix this issue, we apply two key changes:

  1. Replace cudaSetDevice(0) with cudaGetDevice() in print_device_info
  2. Declare the template specialization in device.h

…ice(0)

Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.
@mohanchen
Copy link
Collaborator

Good catch! Thanks a lot for your contribution!

@mohanchen mohanchen added Bugs Bugs that only solvable with sufficient knowledge of DFT GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes labels Sep 10, 2025
@mohanchen mohanchen merged commit 305bbf6 into deepmodeling:develop Sep 10, 2025
14 checks passed
Wuming-HUST pushed a commit to Wuming-HUST/abacus-develop that referenced this pull request Sep 12, 2025
…ice(0) (deepmodeling#6498)

Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.
mohanchen added a commit that referenced this pull request Sep 20, 2025
…w. (#6490)

* Feature: add DFT-1/2 and shell DFT-1/2, currently only support PW esolver_ks_pw.

* Added Sep, Sep_Cell, and VSep to organize the self-energy potential of
DFT-1/2

* Added a new effective potential pot_sep for calculating the
self-energy potential

* Added initialization of the self-energy potential in the esolver_ks_pw
control flow

* Added the keyword SEP_FILES in the STRU file for reading self-energy
potential files

* Added the dfthalf_type keyword in INPUT to enable DFT-1/2 and shell
DFT-1/2

* Fix: Compilation error in DeepKS unit tests after adding DFT-1/2

* Fix: Add the additional files to Makefile.Objects

* Build(deps): Bump actions/setup-python from 5 to 6 (#6492)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Refactor] Move hardware initializer out from esolver code (#6494)

* Move hardware initializer out from esolver

* Remove useless codes

* Remove finalize code out

* Feature: support NVTX profiling via timer_enable_nvtx flag (#6495)

* Feature: support NVTX profiling via timer_enable_nvtx flag
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Add timer_enable_nvtx section in markdown
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix: Use __USE_NVTX macro to avoid NVTX linking errors in tests.
Clarify in docs that timer_enable_nvtx parameter only takes effect on CUDA platforms.
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: Optimize Davidson by fusing operators, offloading CPU computation to GPU, and reducing memory transfers (#6493)

* Perf: Optimize Diago_DavSubspace with GPU operators by adding and fusing custom kernels.
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: reduce memory allocation and copy in Diago_DavSubspace::diag_zhegvx
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: Replace loop-based 2D copy and memset with memcpy_2d_op, memset_2d_op
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: use warp reduce instead of shared memory for better efficiency
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix compilation error
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON and switch to nvtx3 headers when CUDA_VERSION >= 12090 (#6497)

* Fix: switch to nvtx3 headers when CUDA_VERSION >= 12090
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix dsp compilation problem (#6499)

* Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) (#6498)

Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Removed the temporary variable DMRGint_full when transitioning from 2D block parallelism to serial in Hcontainer(develop) (#6489)

* delete tem Hcontainer to reduce memory usage

* simplify the compute code

* change DM2D_tmp to dm2d_tmp, use vector instead of new

* Update version to 3.9.0.14 (#6504)

* Refactor: Remove the GlobalC from sep_cell and vsep_cell

* Removed GlobalC::sep_cell and GlobalC::vsep_cell from GlobalC

* Integrated sep_cell into UnitCell

* Integrated vsep_cell into esolver_ks_pw

* Added empty constructors and destructors for Sep_Pot and Sep_Cell to
facilitate unit testing compilation

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Critsium <[email protected]>
Co-authored-by: Tianxiang Wang <[email protected]>
Co-authored-by: zgn-26714 <[email protected]>
Co-authored-by: Erjie Wu <[email protected]>
Co-authored-by: Mohan Chen <[email protected]>
kluonj pushed a commit to kluonj/abacus-develop that referenced this pull request Sep 28, 2025
…ice(0) (deepmodeling#6498)

Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.
kluonj pushed a commit to kluonj/abacus-develop that referenced this pull request Sep 28, 2025
…w. (deepmodeling#6490)

* Feature: add DFT-1/2 and shell DFT-1/2, currently only support PW esolver_ks_pw.

* Added Sep, Sep_Cell, and VSep to organize the self-energy potential of
DFT-1/2

* Added a new effective potential pot_sep for calculating the
self-energy potential

* Added initialization of the self-energy potential in the esolver_ks_pw
control flow

* Added the keyword SEP_FILES in the STRU file for reading self-energy
potential files

* Added the dfthalf_type keyword in INPUT to enable DFT-1/2 and shell
DFT-1/2

* Fix: Compilation error in DeepKS unit tests after adding DFT-1/2

* Fix: Add the additional files to Makefile.Objects

* Build(deps): Bump actions/setup-python from 5 to 6 (deepmodeling#6492)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Refactor] Move hardware initializer out from esolver code (deepmodeling#6494)

* Move hardware initializer out from esolver

* Remove useless codes

* Remove finalize code out

* Feature: support NVTX profiling via timer_enable_nvtx flag (deepmodeling#6495)

* Feature: support NVTX profiling via timer_enable_nvtx flag
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Add timer_enable_nvtx section in markdown
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix: Use __USE_NVTX macro to avoid NVTX linking errors in tests.
Clarify in docs that timer_enable_nvtx parameter only takes effect on CUDA platforms.
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: Optimize Davidson by fusing operators, offloading CPU computation to GPU, and reducing memory transfers (deepmodeling#6493)

* Perf: Optimize Diago_DavSubspace with GPU operators by adding and fusing custom kernels.
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: reduce memory allocation and copy in Diago_DavSubspace::diag_zhegvx
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: Replace loop-based 2D copy and memset with memcpy_2d_op, memset_2d_op
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Perf: use warp reduce instead of shared memory for better efficiency
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix compilation error
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON and switch to nvtx3 headers when CUDA_VERSION >= 12090 (deepmodeling#6497)

* Fix: switch to nvtx3 headers when CUDA_VERSION >= 12090
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix: resolve compile error with USE_ELPA=OFF + BUILD_TESTING=ON
Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Fix dsp compilation problem (deepmodeling#6499)

* Fix: Fix crash in Debug build with multi-GPU due to forced cudaSetDevice(0) (deepmodeling#6498)

Signed-off-by:Tianxiang Wang<[email protected]>, Contributed under MetaX Integrated Circuits (Shanghai) Co., Ltd.

* Removed the temporary variable DMRGint_full when transitioning from 2D block parallelism to serial in Hcontainer(develop) (deepmodeling#6489)

* delete tem Hcontainer to reduce memory usage

* simplify the compute code

* change DM2D_tmp to dm2d_tmp, use vector instead of new

* Update version to 3.9.0.14 (deepmodeling#6504)

* Refactor: Remove the GlobalC from sep_cell and vsep_cell

* Removed GlobalC::sep_cell and GlobalC::vsep_cell from GlobalC

* Integrated sep_cell into UnitCell

* Integrated vsep_cell into esolver_ks_pw

* Added empty constructors and destructors for Sep_Pot and Sep_Cell to
facilitate unit testing compilation

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Critsium <[email protected]>
Co-authored-by: Tianxiang Wang <[email protected]>
Co-authored-by: zgn-26714 <[email protected]>
Co-authored-by: Erjie Wu <[email protected]>
Co-authored-by: Mohan Chen <[email protected]>
@wangtianxiang wangtianxiang deleted the fix_print_device_info_bug branch October 10, 2025 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bugs Bugs that only solvable with sufficient knowledge of DFT GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants