Skip to content

Conversation

@AsTonyshment
Copy link
Collaborator

Linked Issue

Fix #6229

Description

When generating sparse matrices using get_R_range, different MPI processes could produce inconsistent sets of R-coordinates. This led to MPI_ERR_TRUNCATE errors during MPI_Allreduce operations because the data sizes across processes mismatched.

My solution is:

  1. Synchronize R-coordinates globally:
    • Added sync_all_R_coor to aggregate R-coordinates from all processes via MPI_Allgatherv, ensuring a consistent all_R_coor set across all ranks.

  2. Fix MPI buffer size handling:
    • Corrected buffer size calculations in MPI_Allgatherv to account for 3 integers (x, y, z) per R-coordinate, resolving MPI_ERR_TRUNCATE.

@AsTonyshment AsTonyshment requested a review from mohanchen May 23, 2025 10:13
@mohanchen mohanchen added Bugs Bugs that only solvable with sufficient knowledge of DFT Refactor Refactor ABACUS codes labels May 24, 2025
@dyzheng dyzheng self-requested a review May 27, 2025 06:11
@mohanchen mohanchen merged commit 5daf5d9 into deepmodeling:develop May 29, 2025
14 checks passed
@AsTonyshment AsTonyshment deleted the fix_scf_MPI_ERR_TRUNCATE branch May 29, 2025 02:21
dyzheng pushed a commit that referenced this pull request Sep 30, 2025
…atch problem during sparse matrix generation) (#6555)

* Fixed the bug in memory statistics

* Fix: MPI communication errors due to inconsistent R-coordinates in sparse matrix generation​ (#6233)

* Fix MPI_ERR_TRUNCATE error

* Add MPI compilation macro

* Temp debug info print

* Move sync operation into get_R_range

---------

Co-authored-by: Taoni Bao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bugs Bugs that only solvable with sufficient knowledge of DFT Refactor Refactor ABACUS codes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Segmentation fault during SCF calculation in v3.9.0.3+ for specific structure (MPI_ERR_TRUNCATE)

3 participants