Fix: EXX PW occupation number issue in nspin==2 && support for CUDA-Aware MPI marco (#6750)

Flying-dragon-boxing · mohanchen · WHUweiqingzhou · web-flow · commit a4f46782ebb2 · 2025-12-01T12:59:44.000+08:00
* correct nspin related behaviors * add efermi passthrough * Revert "add efermi passthrough" This reverts commit d7b402d. * commits to resolve conversations related to codes * DM and EDM pointers in pexsi now handled by diagopexsi, and copying h s matrices no longer needed * add pexsi examples * fix pexsi unit test (original version shouldn't run) * add building docs for pexsi * set cxx standard to c++14, which is required in make_unique * Fix: Fix typo related to pexsi * update to PPEXSIDFTDriver2 * default npoints to 1, so single core pexsi will work * Feature: exx operator for pw basis, single kpt * apply pexsi changes(?) * q-e style exx_div * Correct exxdiv * Fix Compile errors * refactor to abandon `pdiagh` * Fix mu_buffer and nspin * HSE examples * Feature: Multi-K exx * Feature: Multi-K exx * Updates with latest * Remove redundant global vars * Update to v3.9.0 * Update to v3.9.0, now code works * Remove Redundant cal_exx_energy in esolver_ks_pw.cpp * Some mess * Minor Fixes * Fix separate loop and screening * Add EXX stress * EXX Energy??? * Multi-K is broken??? * Fix: Multi-K and stress * Feature: ACE for single-K * Feature: ACE should work for multi-K, but not for sure * Feature: ACE works. Next step is ACE energy. * Fix: adapt to the latest instruction for variable `conv_esolver` * Reconstruct: move exx_helper to hamilt_pwdft * Refactor: in ESolver_KS_PW, calculate deband in iter_finish, not in hamilt2density * Fix: make files in consistent with upstream * Fix: Now EXX PW doesn't depend on LibRI * Fix: Add input constraints for EXX PW * Fix: Remove redundant mpi barrier * Fix: Clean irrelevant files * Fix: Clean irrelevant files * Feature: add ace flag, exit on using gpu * Refactor: Phase 1 for refactoring exx energy * Feature: now ace calculates energy * Feature: enable exx energy * Fix: fix makefile compilation error * Fix: One minor fix for a segmentation fault * Tests: one integrate test for exx pw, only for verifying whether exx pw works * Revert "Tests: one integrate test for exx pw, only for verifying whether exx pw works" This reverts commit e7b606f. * Fix: EXX PW ACE open only when separate_loop is on * add timer * Feature: Double Grid method of EXX PW * Feature: Double Grid method of EXX PW Stress * Fix: Double Grid method of EXX PW Stress * Feature: add double grid variable * Feature: add double grid variable * Fis: HSE stress * Fix: HSE Stress * Fix: Timer * Fix: Timer * For non mp sampling, disable extrapolation * Modify test * Modify mp * Format * Format * Feature: nspin == 2 scf * Fix: nspin == 2 scf * Docs: EXX PW Docs * Feature: EXX PW for nspin=2 * Docs: EXX PW Docs * Docs: EXX PW Docs * Docs: EXX PW Docs, minor fixes * Refactor * Refactor * Refactor * Refactor * Refactor * Refactor: fix unit test * Refactor: fix unit test * Refactor: fix unit test * Refactor: fix unit test * Bump version v3.9.0.7 * Refactor: Remove set kvec funcs in `K_Vectors` * Refactor: Remove final_scf * Refactor: Fix kvecc2d/d2c * Fix: Tests * Fix: Tests * Fix: Tests * Fix: Tests * Refactor: Final? * Fix * Fix * Fix * Fix * GPU EXX PW Support * Fix: Compile Error on CUDA > 12.9 * Fix: Compile Error on CUDA > 12.9 * NVTX3 * F***ing new version * Feature: Support linear combination of coulomb_param for EXX PW * Fix: Fix compile issue * F***ing new version * F***ing new version * F***ing new version * Uploading hybrid gauge tddft (#6369) * hybrid gague * update tests * update * update * update * update * update unit test * fix tests * update tests * fix read_wfc * fix catch_properties.sh * fix restart * update gpu test * update tests * fix * fix input_conv * Improve md calculation stress output in running log (#6366) * Improve md calculation stress output in running log * Module_IO Unittest modify * ModuleMD Unittests modify * modify code comment in fire_test.cpp * maintain setprecision(8) for md stress output * Refactor: Remove redundant Input_para from ESolver Class (#6370) * Refactor: Replace PARAM.inp with inp in ESolver classes for consistency * Refactor: Replace local input parameters with PARAM.inp in ESolver classes for consistency * Refactor: Use PARAM.inp.scf_ene_thr in ESolver_KS_LCAO iter_finish method * Revert "Refactor: Use PARAM.inp.scf_ene_thr in ESolver_KS_LCAO iter_finish method" This reverts commit b1bd0fd. * Revert "Refactor: Replace local input parameters with PARAM.inp in ESolver classes for consistency" This reverts commit f4f81e3. * Fix: Fix memory leak introduced by new gint module (#6375) * fix memory leak * delete copy assignment * refactor Exx_Opt_Orb (#6378) Co-authored-by: linpz <linpz@mail.ustc.edu.cn> * Add use sw and fix Floating point exception (#6372) * remove float error in sunway * fix ig=0 * add the sw * change the make_dir * unify the gg use * fix compile bug * add init * temporarily remove the sunway define * add the pesduo * fix compile bug * fix bug in the betar * modify the test * Update the output formats of rt-TDDFT (#6381) * update the output formats of rt-TDDFT * update the output formats of rt-TDDFT * fix a bug * update initialized velocities * found some output information is still lacking in MD module * [Refactor] Rename grid to module_grid and genelpa to module_genelpa (#6386) * Rename grid to module_grid * Rename genelpa to module_genelpa * Fix cmake * Update the outputs of geometry relaxation (#6387) * update the output formats of rt-TDDFT * update the output formats of rt-TDDFT * fix a bug * update initialized velocities * found some output information is still lacking in MD module * update output information * remove some global variables in relax_driver * update outputs * update relaxation outputs * update relaxation output messages * update tests of print info * fix a test * fix cg outputs * udpate cg test * update relax tests * update LCAO output stress format * change update_cell.cpp algorithm, when the ion move is larger than the cell length, it is fine to proceed the relaxation calculations * fix tests for unitcells * update cell * Feature: support the output of matrix representation of symm_ops (#6390) * Feature: support output the matrix representation of symmetry operation * Feature: support the output of matrix representation of symm_ops * update the document * Feature: Output real space wavefunction and partial charge density when `device=gpu` (#6391) * Fix GPU output of out_pchg and out_wfc_norm, out_wfc_re_im * GPU integrate test is functional again * Optimize RT-TDDFT dipole output (#6393) * Perf: support GPU version of cal_force_cc with LCAO basis (#6392) * support GPU version of cal_force_cc with LCAO basis * fix a bug * [Refactor] Move module_lr to source_lcao and add a new folder module_external in source_base (#6388) * Move module_lr to source_lcao * Fix test build * Move blas_connector to module_external * Fix header use * Fix internal header use * A fierce battle with Makefile😡 * Move blacs_connector.h to module_external * Move lapack_connector.h and lapack_wrapper.h to module_external * Fix header usage * Move scalapack_connector.h to module_external * Fix a bug for the output information after relaxation (#6395) * update the output formats of rt-TDDFT * update the output formats of rt-TDDFT * fix a bug * update initialized velocities * found some output information is still lacking in MD module * update output information * remove some global variables in relax_driver * update outputs * update relaxation outputs * update relaxation output messages * update tests of print info * fix a test * fix cg outputs * udpate cg test * update relax tests * update LCAO output stress format * change update_cell.cpp algorithm, when the ion move is larger than the cell length, it is fine to proceed the relaxation calculations * fix tests for unitcells * update cell * update some function names, update output A to Angstrom * change eV/A to eV/Angstrom * bump version to 3.9.0.10 (#6397) Co-authored-by: Liang Sun <50293369+sunliang98@users.noreply.github.com> * Fix: fix exx_gamma_extrapolation error in MPI * Fix: fix exx_gamma_extrapolation error in MPI * Update lapack.cu * Refactor: Use LAPACK interfaces from ATen * Fix: Integrate test * Fix: implement devinfo for potrf * Fix: MPI and Makefile * Fix: get_potential * Fix: ace * Refactor * Refactor * Refactor * Refactor * Refactor * Fix: conv * Revert "Fix: conv" This reverts commit d2da506. * Fix: conv * Fix: conv hard code thr for now * Fix: conv hard code thr for now * Fix: conv hard code thr for now * Fix: conv hard code thr for now * Refactor * Refactor * Refactor * Refactor * Refactor * Mod * Begin EXX KPAR * Begin EXX KPAR * Begin EXX KPAR * Begin EXX KPAR * Begin EXX KPAR * EXX KPAR WORKS * EXX KPAR WORKS Alternative * Fix GPU, but so ugly... * Undo cuda aware mpi * Undo cuda aware mpi * Revert "Undo cuda aware mpi" This reverts commit a8d71b2. * EXX KPAR WORKS on NSPIN=2 * Fix without MPI * Fix header * Separate ecut for exx pw * Documents related * Documents related * EXX KPAR BUG FIX * EXX KPAR BUG FIX * CUDA-Aware MPI * Fix: nspin=2 --------- Co-authored-by: Mohan Chen <mohan.chen.chen.mohan@gmail.com> Co-authored-by: wqzhou <33364058+WHUweiqingzhou@users.noreply.github.com> Co-authored-by: HTZhao <104255052+ESROAMER@users.noreply.github.com> Co-authored-by: lanshuyue <140165754+lanshuyue@users.noreply.github.com> Co-authored-by: Liang Sun <50293369+sunliang98@users.noreply.github.com> Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com> Co-authored-by: linpeize <linpeize2024@163.com> Co-authored-by: linpz <linpz@mail.ustc.edu.cn> Co-authored-by: liiutao <74701833+A-006@users.noreply.github.com> Co-authored-by: Mohan Chen <mohanchen@pku.edu.cn> Co-authored-by: Critsium <tsfxwbbzxy@163.com> Co-authored-by: kirk0830 <67682086+kirk0830@users.noreply.github.com> Co-authored-by: Taoni Bao <baotaoni@pku.edu.cn> Co-authored-by: Chen Nuo <49788094+Cstandardlib@users.noreply.github.com>
diff --git a/source/source_pw/module_pwdft/operator_pw/exx_pw_ace.cpp b/source/source_pw/module_pwdft/operator_pw/exx_pw_ace.cpp
@@ -182,7 +182,27 @@ void OperatorEXXPW<T, Device>::construct_ace() const
                         // if (iq == 0)
                         //     std::cout << "Bcast psi_mq_real" << std::endl;
 #ifdef __MPI
+#ifdef __CUDA_MPI
                         MPI_Bcast(psi_mq_real, wfcpw->nrxx, MPI_DOUBLE_COMPLEX, iq_pool, KP_WORLD);
+#else
+                        if (PARAM.inp.device == "cpu")
+                        {
+                            MPI_Bcast(psi_mq_real, wfcpw->nrxx, MPI_DOUBLE_COMPLEX, iq_pool, KP_WORLD);
+                        }
+                        else if (PARAM.inp.device == "gpu")
+                        {
+                            // need to copy to cpu first
+                            T* psi_mq_real_cpu = new T[wfcpw->nrxx];
+                            syncmem_complex_d2c_op()(psi_mq_real_cpu, psi_mq_real, wfcpw->nrxx);
+                            MPI_Bcast(psi_mq_real_cpu, wfcpw->nrxx, MPI_DOUBLE_COMPLEX, iq_pool, KP_WORLD);
+                            syncmem_complex_c2d_op()(psi_mq_real, psi_mq_real_cpu, wfcpw->nrxx);
+                            delete[] psi_mq_real_cpu;
+                        }
+                        else
+                        {
+                            ModuleBase::WARNING_QUIT("OperatorEXXPW", "construct_ace: unknown device");
+                        }
+#endif
 #endif
 
                     } // end of iq
@@ -287,7 +307,7 @@ template <typename T, typename Device>
 double OperatorEXXPW<T, Device>::cal_exx_energy_ace(psi::Psi<T, Device>* ppsi_) const
 {
     double Eexx = 0;
-
+    int nspin_fac = PARAM.inp.nspin == 2 ? 2 : 1;
     psi::Psi<T, Device> psi_ = *ppsi_;
     int* ik_ = const_cast<int*>(&this->ik);
     int ik_save = this->ik;
diff --git a/source/source_pw/module_pwdft/operator_pw/exx_pw_pot.cpp b/source/source_pw/module_pwdft/operator_pw/exx_pw_pot.cpp
@@ -392,7 +392,7 @@ double exx_divergence(Conv_Coulomb_Pot_K::Coulomb_Type coulomb_type,
                       double ucell_omega)
 {
     double exx_div = 0;
-
+    // return exx_div;
     double nqs_half1 = 0.5 * kv->nmp[0];
     double nqs_half2 = 0.5 * kv->nmp[1];
     double nqs_half3 = 0.5 * kv->nmp[2];
diff --git a/source/source_pw/module_pwdft/operator_pw/op_exx_pw.cpp b/source/source_pw/module_pwdft/operator_pw/op_exx_pw.cpp
@@ -174,6 +174,9 @@ void OperatorEXXPW<T, Device>::act(const int nbands,
                                    const bool is_first_node) const
 {
     if (first_iter) return;
+    // std::cout << cal_exx_energy_ace(&psi) << " EXX energy" << std::endl;
+    // MPI_Abort(MPI_COMM_WORLD, 0);
+    // return;
 
     if (is_first_node)
     {
@@ -213,6 +216,8 @@ void OperatorEXXPW<T, Device>::act_op(const int nbands,
     // for (auto iq: q_points)
     //     std::cout << iq << ", ";
     // std::cout << std::endl;
+    int nk_fac = PARAM.inp.nspin == 2 ? 2 : 1;
+    int nk = wfcpw->nks / nk_fac;
 
     // ik fixed here, select band n
     for (int n_iband = 0; n_iband < nbands; n_iband++)
@@ -226,7 +231,7 @@ void OperatorEXXPW<T, Device>::act_op(const int nbands,
         Real nqs = q_points.size();
         for (int iq: q_points)
         {
-            get_exx_potential<Real, Device>(kv, wfcpw, rhopw_dev, pot, tpiba, gamma_extrapolation, ucell->omega, this->ik, iq);
+            get_exx_potential<Real, Device>(kv, wfcpw, rhopw_dev, pot, tpiba, gamma_extrapolation, ucell->omega, this->ik, iq % nk);
             for (int m_iband = 0; m_iband < psi.get_nbands(); m_iband++)
             {
                 // double wg_mqb_real = GlobalC::exx_helper.wg(iq, m_iband);
@@ -259,7 +264,6 @@ void OperatorEXXPW<T, Device>::act_op(const int nbands,
                 }
 
                 T wk_iq = kv->wk[iq];
-                T wk_ik = kv->wk[this->ik];
 
                 T tmp_scalar = wg_mqb / wk_iq / nqs;
                 axpy_complex_op()(wfcpw->nrxx,
@@ -318,6 +322,10 @@ void OperatorEXXPW<T, Device>::act_op_kpar(const int nbands,
         // decide which pool does the iq belong to
         int iq_pool = kv->para_k.whichpool[iq];
         int iq_loc  = iq - kv->para_k.startk_pool[iq_pool];
+        if (ispin == 1)
+        {
+            iq_loc += wfcpw->nks / nspin_fac;
+        }
 
         for (int m_iband = 0; m_iband < psi.get_nbands(); m_iband++)
         {
@@ -339,11 +347,30 @@ void OperatorEXXPW<T, Device>::act_op_kpar(const int nbands,
                 // send
             }
 #ifdef __MPI
+#ifdef __CUDA_MPI
             MPI_Bcast(psi_mq_real, wfcpw->nrxx, MPI_DOUBLE_COMPLEX, iq_pool, KP_WORLD);
+#else
+            if (PARAM.inp.device == "cpu")
+            {
+                MPI_Bcast(psi_mq_real, wfcpw->nrxx, MPI_DOUBLE_COMPLEX, iq_pool, KP_WORLD);
+            }
+            else if (PARAM.inp.device == "gpu")
+            {
+                // need to copy to cpu first
+                T* psi_mq_real_cpu = new T[wfcpw->nrxx];
+                syncmem_complex_d2c_op()(psi_mq_real_cpu, psi_mq_real, wfcpw->nrxx);
+                MPI_Bcast(psi_mq_real_cpu, wfcpw->nrxx, MPI_DOUBLE_COMPLEX, iq_pool, KP_WORLD);
+                syncmem_complex_c2d_op()(psi_mq_real, psi_mq_real_cpu, wfcpw->nrxx);
+                delete[] psi_mq_real_cpu;
+            }
+            else
+            {
+                ModuleBase::WARNING_QUIT("OperatorEXXPW", "construct_ace: unknown device");
+            }
+#endif
 #endif
             for (int n_iband = 0; n_iband < nbands; n_iband++)
             {
-                double wg_nkb = (*wg)(this->ik, n_iband);
                 const T* psi_nk = tmpsi_in + n_iband * nbasis;
                 // retrieve \psi_nk in real space
                 wfcpw->recip_to_real(ctx, psi_nk, psi_nk_real, this->ik);
@@ -369,9 +396,8 @@ void OperatorEXXPW<T, Device>::act_op_kpar(const int nbands,
 
                 Real wk_iq = kv->wk[iq];
                 Real wk_ik = kv->wk[this->ik];
-                // std::cout << "wk_iq: " << wk_iq << " wk_ik: " << wk_ik << std::endl;
 
-                Real tmp_scalar = wg_mqb / wk_ik / nqs;
+                Real tmp_scalar = wg_mqb / wk_ik / nqs; // wk_ik works for now, but wrong for symmetry.
 
                 T* h_psi_nk = tmhpsi + n_iband * nbasis;
                 Real hybrid_alpha = GlobalC::exx_info.info_global.hybrid_alpha;
@@ -569,7 +595,8 @@ double OperatorEXXPW<T, Device>::cal_exx_energy_op(psi::Psi<T, Device> *ppsi_) c
 
             for (int iq: q_points)
             {
-                get_exx_potential<Real, Device>(kv, wfcpw, rhopw_dev, pot, tpiba, gamma_extrapolation, ucell->omega, ik, iq);
+                int nk = wfcpw->nks / nk_fac;
+                get_exx_potential<Real, Device>(kv, wfcpw, rhopw_dev, pot, tpiba, gamma_extrapolation, ucell->omega, ik, iq % nk);
                 for (int m_iband = 0; m_iband < psi.get_nbands(); m_iband++)
                 {
                     // double wg_f = GlobalC::exx_helper.wg(iq, m_iband);
@@ -589,7 +616,7 @@ double OperatorEXXPW<T, Device>::cal_exx_energy_op(psi::Psi<T, Device> *ppsi_) c
 
                     int nks = wfcpw->nks;
                     int npw = rhopw_dev->npw;
-                    int nk = nks / nk_fac;
+                    // int nk = nks / nk_fac;
                     Eexx_ik_real += exx_cal_energy_op<T, Device>()(density_recip, pot, wg_iqb_real / nqs * wg_ikb_real / kv->wk[ik], npw);
 
                 } // m_iband

Original file line number	Diff line number	Diff line change
`@@ -392,7 +392,7 @@ double exx_divergence(Conv_Coulomb_Pot_K::Coulomb_Type coulomb_type,`
`392`	`392`	`double ucell_omega)`
`393`	`393`	`{`
`394`	`394`	`double exx_div = 0;`
`395`		`-`
	`395`	`+ // return exx_div;`
`396`	`396`	`double nqs_half1 = 0.5 * kv->nmp[0];`
`397`	`397`	`double nqs_half2 = 0.5 * kv->nmp[1];`
`398`	`398`	`double nqs_half3 = 0.5 * kv->nmp[2];`