RT-TDDFT GPU Acceleration: RT-TD now fully support GPU computation #5773

AsTonyshment · 2024-12-26T13:04:07Z

Phase 1: Rewriting existing code using `Tensor` (complete)

This is merely a draft and does not represent the final code. Since Tensor can effectively support heterogeneous computing, the goal of the first phase is to rewrite the existing algorithms using Tensor. Currently, all memory is still explicitly allocated on the CPU (the parameter of the Tensor constructor is container::DeviceType::CpuDevice).

Phase 2: Adding needed BLAS and LAPACK support for `Tensor` on CPU and refactoring linear algebra operations in TDDFT (complete)

Key Changes:

Added template structs lapack_getrf and lapack_getri in module_base/module_container/ATen/kernels/lapack.h to support matrix LU factorization (getrf) and matrix inversion (getri) operations for Tensor objects.
Fixed original LAPACK function (zgetrf_ and zgetri_) declarations in module_base/lapack_connector.h to comply with standard conventions.
Fully implemented CPU-based BLAS and LAPACK support for Tensor operations in TDDFT. These linear algebra operations in container::kernels module from module_base/module_container/ATen include a Device parameter, enabling seamless support for heterogeneous computing (GPU acceleration in future phases).

Phase 3: RT-TDDFT GPU acceleration core algorithm (complete)

Added linear solver interfaces:
- Implemented CPU-based linear solver (getrs) using LAPACK.
- Added GPU-based LU decomposition (getrf) and linear solver (getrs) using cuSOLVER.
Refactored RT-TDDFT I/O and parameters:
- Removed static member variables (td_force_dt, td_vext, td_vext_dire_case, out_dipole, out_efield) from the Evolve_elec class.
- Unified parameter access through the PARAM.inp input interface to simplify template class usage with Device parameter.
Heterogeneous computing support:
- Added Device template parameter to RT-TDDFT core algorithm classes and functions.
- Introduced memory synchronization (e.g., base_device::memory::synchronize_memory_op) to ensure proper data handling across devices.
- Replaced BlasConnector::copy operations with memory synchronization functions.
GPU acceleration for RT-TDDFT:
- Implemented core algorithm optimizations for GPU acceleration in RT-TDDFT.

Phase 4: MPI multi-process compatibility (complete)

Removed all ctx parameters in memory synchronization operations.
Added support for outputting intermediate debugging information (e.g., Hamiltonian matrix H, overlap matrix S, and wave function psi_k) when device=gpu.
Added MPI multi-process gather and distribute operations, enabling compatibility of the GPU version of the RT-TDDFT algorithm with multi-process and multi-threaded MPI+OpenMP.

…ensor

source/module_hamilt_lcao/module_tddft/bandenergy.cpp

…velop into TDDFT_GPU_phase_1

AsTonyshment · 2024-12-31T08:31:27Z

The current program has some bugs that cause the data in psi to become all zeros after evolution. Through debugging, we found that this issue arises because the original implementation of deep copying psi_k to tmp1 in source/module_hamilt_lcao/module_tddft/norm_psi.cpp was inadvertently replaced with Tensor's CopyFrom.

Useful information:

CopyFrom Method:
The CopyFrom method currently performs a shallow copy, meaning it shares the underlying data buffer between the source and destination tensors. This can lead to unintended side effects if modifications are made to one tensor, as they will reflect in the other.
Assignment Operator (=):
The assignment operator (=) is correctly implemented to perform a deep copy, ensuring that the destination tensor gets its own independent copy of the data buffer. This behavior is consistent with expectations for a deep copy operation.

…assignment operator overload) instead

AsTonyshment · 2025-01-02T09:38:42Z

After testing several parallel parameter combinations for Si-2 (small system) and Si-64 (large system), we conclude that the Tensor implementation on the CPU incurs almost no performance loss. In fact, it appears to be slightly faster than the previous implementation especially for large systems. The test results are as follows:

…pport for Tensor on CPU and refactoring linear algebra operations in TDDFT

Critsium-xy · 2025-01-08T03:10:21Z

LGTM👍, a good example showing the possibility of using tensor.

source/module_base/module_container/base/third_party/lapack.h

…om pass-by-reference to pass-by-value

…velop into TDDFT_GPU_phase_1

source/module_hamilt_lcao/module_tddft/upsi.cpp

source/module_hamilt_lcao/module_tddft/bandenergy.cpp

source/module_hamilt_lcao/module_tddft/bandenergy.h

source/module_hamilt_lcao/module_tddft/evolve_elec.cpp

source/module_hamilt_lcao/module_tddft/evolve_psi.cpp

source/module_hamilt_lcao/module_tddft/propagator.cpp

source/module_hamilt_lcao/module_tddft/upsi.cpp

…velop into TDDFT_GPU_phase_1

…er code organization

…s it as an input parameter instead

source/module_esolver/esolver_ks_lcao_tddft.cpp

source/module_hamilt_lcao/module_tddft/band_energy.cpp

source/module_hamilt_lcao/module_tddft/upsi.cpp

…n nested loops

…eepmodeling#5773) * Phase 1 of RT-TDDFT GPU Acceleration: Rewriting existing code using Tensor * [pre-commit.ci lite] apply automatic fixes * Initialize int info in bandenergy.cpp * Initialize double aa, bb in bandenergy.cpp * Fix a bug where CopyFrom caused shared data between tensors, using =(assignment operator overload) instead * RT-TDDFT GPU Acceleration (Phase 2): Adding needed BLAS and LAPACK support for Tensor on CPU and refactoring linear algebra operations in TDDFT * LAPACK wrapper functions: change const basic-type input parameters from pass-by-reference to pass-by-value * Did nothing, just formatting esolver.cpp * Core algorithm: RT-TD now has preliminary support for GPU computation * Fix GitHub CI CUDA build bug due to deleted variable * Refactor some files * Getting ready for gathering MPI processes * MPI multi-process compatibility * Fix GitHub CI MPI compilation bug * Minor fix and refactor * Initialize double aa, bb and one line for one variable * Rename bandenergy.cpp to band_energy.cpp and corresponding adjustments * Fix compile error and change CMakeLists accordingly * Initialize int naroc * Initialize MPI related variables: myid, num_procs and root_proc * Refactor Propagator class implementation into multiple files for better code organization * Remove all GlobalV::ofs_running from RT-TDDFT core algorithms and pass it as an input parameter instead * Add assert in some places and optimize redundant index calculations in nested loops --------- Co-authored-by: pre-commit-ci-lite[bot] <117423508+pre-commit-ci-lite[bot]@users.noreply.github.com>

Phase 1 of RT-TDDFT GPU Acceleration: Rewriting existing code using T…

eee8b75

…ensor

AsTonyshment requested review from Cstandardlib and mohanchen December 26, 2024 13:04

[pre-commit.ci lite] apply automatic fixes

aa4ceb1

AsTonyshment marked this pull request as draft December 27, 2024 02:30

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

069c434

mohanchen reviewed Dec 27, 2024

View reviewed changes

source/module_hamilt_lcao/module_tddft/bandenergy.cpp Outdated Show resolved Hide resolved

source/module_hamilt_lcao/module_tddft/bandenergy.cpp Outdated Show resolved Hide resolved

AsTonyshment added 4 commits December 27, 2024 10:40

Initialize int info in bandenergy.cpp

e45398a

Initialize double aa, bb in bandenergy.cpp

a6040ec

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

0bebb32

Merge branch 'TDDFT_GPU_phase_1' of github.com:AsTonyshment/abacus-de…

ac4e737

…velop into TDDFT_GPU_phase_1

mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues Features Needed The features are indeed needed, and developers should have sophisticated knowledge labels Dec 31, 2024

AsTonyshment added 2 commits December 31, 2024 14:35

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

8ed6407

Merge branch 'TDDFT_GPU_phase_1' of github.com:AsTonyshment/abacus-de…

e67b42f

…velop into TDDFT_GPU_phase_1

AsTonyshment added 2 commits December 31, 2024 16:52

Fix a bug where CopyFrom caused shared data between tensors, using =(…

9e4b889

…assignment operator overload) instead

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

9ca053d

AsTonyshment added 2 commits January 3, 2025 08:53

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

ba12e92

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

3110720

AsTonyshment changed the title ~~RT-TDDFT GPU Acceleration (Phase 1): Rewriting existing code using Tensor~~ RT-TDDFT GPU Acceleration (Phase 2): Adding needed BLAS and LAPACK support for Tensor on CPU and refactoring linear algebra operations in TDDFT Jan 3, 2025

AsTonyshment added 3 commits January 3, 2025 15:08

RT-TDDFT GPU Acceleration (Phase 2): Adding needed BLAS and LAPACK su…

eda3add

…pport for Tensor on CPU and refactoring linear algebra operations in TDDFT

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

4685fb8

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

717c164

Cstandardlib reviewed Jan 8, 2025

View reviewed changes

source/module_base/module_container/base/third_party/lapack.h Outdated Show resolved Hide resolved

AsTonyshment added 3 commits January 10, 2025 15:53

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

e3c493d

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

d89f9a3

LAPACK wrapper functions: change const basic-type input parameters fr…

7f94b4d

…om pass-by-reference to pass-by-value

Fix GitHub CI CUDA build bug due to deleted variable

20fd170

AsTonyshment marked this pull request as ready for review January 17, 2025 12:00

Refactor some files

1d9e60f

AsTonyshment marked this pull request as draft January 18, 2025 06:40

AsTonyshment added 4 commits January 18, 2025 15:39

Getting ready for gathering MPI processes

c6559dd

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

698bec2

Merge branch 'TDDFT_GPU_phase_1' of github.com:AsTonyshment/abacus-de…

38ad956

…velop into TDDFT_GPU_phase_1

MPI multi-process compatibility

4f24415

AsTonyshment changed the title ~~RT-TDDFT GPU Acceleration: RT-TD now has preliminary support for GPU computation~~ RT-TDDFT GPU Acceleration: RT-TD now fully support GPU computation Jan 19, 2025

AsTonyshment marked this pull request as ready for review January 19, 2025 07:02

AsTonyshment added 2 commits January 19, 2025 15:08

Fix GitHub CI MPI compilation bug

cca5fa9

Minor fix and refactor

62df525

mohanchen reviewed Jan 20, 2025

View reviewed changes

AsTonyshment added 9 commits January 20, 2025 22:08

Merge branch 'deepmodeling:develop' into TDDFT_GPU_phase_1

8b526a9

Initialize double aa, bb and one line for one variable

fde9d05

Rename bandenergy.cpp to band_energy.cpp and corresponding adjustments

87893a9

Fix compile error and change CMakeLists accordingly

a02a352

Merge branch 'TDDFT_GPU_phase_1' of github.com:AsTonyshment/abacus-de…

2bdc83f

…velop into TDDFT_GPU_phase_1

Initialize int naroc

214bdb8

Initialize MPI related variables: myid, num_procs and root_proc

e4ab72a

Refactor Propagator class implementation into multiple files for bett…

dc54ffd

…er code organization

Remove all GlobalV::ofs_running from RT-TDDFT core algorithms and pas…

079f791

…s it as an input parameter instead

AsTonyshment requested a review from mohanchen January 21, 2025 09:12

mohanchen reviewed Jan 21, 2025

View reviewed changes

Add assert in some places and optimize redundant index calculations i…

c0ca245

…n nested loops

AsTonyshment requested a review from mohanchen January 21, 2025 12:41

mohanchen approved these changes Jan 22, 2025

View reviewed changes

mohanchen merged commit 3f8fe4f into deepmodeling:develop Jan 22, 2025
14 checks passed

AsTonyshment deleted the TDDFT_GPU_phase_1 branch January 22, 2025 04:30

RT-TDDFT GPU Acceleration: RT-TD now fully support GPU computation #5773

RT-TDDFT GPU Acceleration: RT-TD now fully support GPU computation #5773

Uh oh!

Conversation

AsTonyshment commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Phase 1: Rewriting existing code using Tensor (complete)

Phase 2: Adding needed BLAS and LAPACK support for Tensor on CPU and refactoring linear algebra operations in TDDFT (complete)

Phase 3: RT-TDDFT GPU acceleration core algorithm (complete)

Phase 4: MPI multi-process compatibility (complete)

Uh oh!

Uh oh!

Uh oh!

AsTonyshment commented Dec 31, 2024

Uh oh!

AsTonyshment commented Jan 2, 2025

Uh oh!

Critsium-xy commented Jan 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AsTonyshment commented Dec 26, 2024 •

edited

Loading

Phase 1: Rewriting existing code using `Tensor` (complete)

Phase 2: Adding needed BLAS and LAPACK support for `Tensor` on CPU and refactoring linear algebra operations in TDDFT (complete)