-
Notifications
You must be signed in to change notification settings - Fork 569
fix(c++): fix NULL type in custom op #4889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughCalls a new helper to remap LAMMPS sendlists using a forward-map before constructing MPI/send tensors in DeepPotPT/DeepPotPD/DeepSpinPT, and adds the helper declaration/implementation and a small test. No public API signature changes beyond the new helper. Changes
Sequence Diagram(s)sequenceDiagram
participant Compute as DeepPot*::compute
participant Select as select_real_atoms_sendlist
participant Map as fwd_map
participant Tensors as Tensor & comm_dict build
Compute->>Select: call select_real_atoms_sendlist(lmp_list, fwd_map)
Select->>Map: map each send index -> forwarded index
Map-->>Select: remapped sendlist (invalids removed)
Select-->>Compute: lmp_list updated (sendlist/sendnum/recvnum)
Compute->>Tensors: build send/recv tensors and comm_dict from updated lmp_list
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
source/api_cc/src/DeepPotPT.cc (2)
251-257
: Remove stale commented-out codeDead commented code obscures the current data path and makes maintenance harder.
Apply this diff:
-// torch::Tensor firstrecv_tensor = -// torch::from_blob(lmp_list.firstrecv, {nswap}, int32_option); -// torch::Tensor recvnum_tensor = -// torch::from_blob(lmp_list.recvnum, {nswap}, int32_option); -// torch::Tensor sendnum_tensor = -// torch::from_blob(lmp_list.sendnum, {nswap}, int32_option);
266-269
: Remove redundant commented-out legacy codeSame reasoning; commented legacy path is preserved in git history.
Apply this diff:
-// int total_send = -// std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0); -// torch::Tensor sendlist_tensor = -// torch::from_blob(lmp_list.sendlist, {total_send}, int32_option);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
source/api_cc/src/DeepPotPT.cc
(3 hunks)
🔇 Additional comments (2)
source/api_cc/src/DeepPotPT.cc (2)
185-204
: Remapping logic LGTMCorrectly rebuilds per-swap send counts and a dense send list using fwd_map with bounds checks and filtering. Reserving capacity via the accumulated legacy counts is a good optimization.
226-232
: firstrecv_new is unused and not required — original comment is incorrectShort: deepmd/pt/model/descriptor/repflows.py builds comm_dict and calls torch.ops.deepmd.border_op with send_list, send_proc, recv_proc, send_num, recv_num, communicator (no first_recv). The computed firstrecv_new/firstrecv_tensor in the PT wrappers is dead code — remove it or document why it is kept.
Files to update:
- source/api_cc/src/DeepPotPT.cc
- Remove the firstrecv_new prefix-sum computation (around lines 226–231) and the unused firstrecv_tensor creation (around line 238).
- source/api_cc/src/DeepSpinPT.cc
- Same pattern: firstrecv_tensor is created around lines 187–191 but never used/inserted.
Suggested change (remove unused code) — example diff for DeepPotPT.cc:
@@
std::vector<int> firstrecv_new(nswap, 0);
int acc = 0;
for (int s = 0; s < nswap; ++s) {
firstrecv_new[s] = acc;
acc += recvnum_new[s];
}
/* firstrecv computation removed — not used by border_op */
@@
torch::Tensor firstrecv_tensor =
torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
/* firstrecv tensor omitted — border_op expects recv_num, not first_recv */
If you prefer to keep the computation for clarity, add a short comment explaining it's intentionally unused.
Likely an incorrect or invalid review comment.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug in the C++ custom operations for distributed MPI communication by implementing proper handling of NULL types in the send/receive lists. The fix ensures that atom indices are correctly mapped and synchronized across MPI processes by filtering out invalid indices and updating counts accordingly.
Key changes:
- Added a new function to select and remap real atoms in send lists using forward mapping
- Integrated the send list filtering into PyTorch, Paddle, and DeepSpin compute functions
- Added a test case to verify NULL type handling in LAMMPS pair coefficients
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
source/api_cc/src/common.cc | Implements select_real_atoms_sendlist function to filter and remap atom indices in MPI send lists |
source/api_cc/include/common.h | Adds function declaration for the new send list selection function |
source/api_cc/src/DeepPotPT.cc | Integrates send list filtering into PyTorch backend compute function |
source/api_cc/src/DeepPotPD.cc | Integrates send list filtering into Paddle backend compute function |
source/api_cc/src/DeepSpinPT.cc | Integrates send list filtering into DeepSpin PyTorch compute function |
source/lmp/tests/test_lammps_dpa_pt.py | Adds test case for NULL type handling in LAMMPS pair coefficients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
source/api_cc/src/DeepSpinPT.cc (1)
205-213
: Flatten thesendlist
pointer-of-pointers into a contiguous buffer before callingfrom_blob
InputNlist::sendlist is declared as
int **
(an array of pointers to per-swap arrays) inapi_c/include/deepmd.hpp
, so passing it directly totorch::from_blob(lmp_list.sendlist, {total_send}, int32_option)
will interpret the array‐of‐pointers as a contiguous block of
int
s, which is invalid and will lead to incorrect data being read (or a crash). You must first flatten all per-swap subarrays into a single contiguousint
buffer.Please update the following files:
source/api_cc/src/DeepSpinPT.cc
(around line 207)source/api_cc/src/DeepPotPT.cc
(around line 200)Proposed refactoring to safely flatten (ensuring the buffer outlives the call to
run_method
):- torch::Tensor sendlist_tensor = - torch::from_blob(lmp_list.sendlist, {total_send}, int32_option); + // Flatten the int** sendlist into a single contiguous vector + std::vector<int> flat_sendlist; + flat_sendlist.reserve(total_send); + for (int s = 0; s < nswap; ++s) { + flat_sendlist.insert( + flat_sendlist.end(), + lmp_list.sendlist[s], + lmp_list.sendlist[s] + lmp_list.sendnum[s]); + } + // Create a tensor from the contiguous buffer + torch::Tensor sendlist_tensor = + torch::from_blob(flat_sendlist.data(), {total_send}, int32_option);[mend_review_comment]
source/api_cc/src/DeepPotPD.cc (1)
421-439
: Shape bug: send_list should carry nswap pointer addresses, not total_send elementsYou reshape send_list to total_send but copy only nswap pointer addresses. This is a mismatched shape and can corrupt reads downstream.
- int total_send = - std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0); - sendlist_tensor->Reshape({total_send}); + // send_list carries one pointer per swap + sendlist_tensor->Reshape({nswap}); /** ** NOTE: paddle do not support construct a Tensor with from_blob(T**, ...) ** from a double pointer, so we convert int* pointer to indptr_t for each ** entry and wrap it into int64 Tensor as a workaround. */ std::vector<std::intptr_t> pointer_addresses; - pointer_addresses.reserve(nswap); + pointer_addresses.reserve(nswap);
🧹 Nitpick comments (1)
source/api_cc/include/common.h (1)
105-107
: API const-correctness: this mutates inlist; avoid const on parameterThe helper writes through inlist pointers (sendnum/recvnum/sendlist). Marking the parameter const is misleading.
-void select_real_atoms_sendlist(const deepmd::InputNlist& inlist, - const std::vector<int>& fwd_map); +// In-place: trims and remaps sendlist; updates sendnum/recvnum/(firstrecv if present) +void select_real_atoms_sendlist(deepmd::InputNlist& inlist, + const std::vector<int>& fwd_map);If changing the signature ripples too far now, at least document the in-place side effects in a comment.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
source/api_cc/include/common.h
(1 hunks)source/api_cc/src/DeepPotPD.cc
(1 hunks)source/api_cc/src/DeepPotPT.cc
(1 hunks)source/api_cc/src/DeepSpinPT.cc
(1 hunks)source/api_cc/src/common.cc
(1 hunks)source/lmp/tests/test_lammps_dpa_pt.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- source/api_cc/src/DeepPotPT.cc
🧰 Additional context used
🧬 Code graph analysis (5)
source/lmp/tests/test_lammps_dpa_pt.py (1)
source/lmp/tests/test_lammps.py (1)
lammps_type_map
(289-292)
source/api_cc/src/common.cc (1)
source/op/tf/prod_env_mat_multi_device.cc (1)
inlist
(1590-1590)
source/api_cc/include/common.h (2)
source/api_cc/src/common.cc (2)
select_real_atoms_sendlist
(235-258)select_real_atoms_sendlist
(235-236)source/api_c/include/deepmd.hpp (4)
InputNlist
(807-814)InputNlist
(815-822)InputNlist
(823-850)InputNlist
(851-851)
source/api_cc/src/DeepSpinPT.cc (1)
source/api_cc/src/common.cc (2)
select_real_atoms_sendlist
(235-258)select_real_atoms_sendlist
(235-236)
source/api_cc/src/DeepPotPD.cc (1)
source/api_cc/src/common.cc (2)
select_real_atoms_sendlist
(235-258)select_real_atoms_sendlist
(235-236)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (1)
source/api_cc/src/common.cc (1)
235-258
: Recomputefirstrecv
after trimming and verify symmetryYou need to keep the receive‐offset array in sync whenever you shrink
recvnum
, and also confirm that the assumptionrecvnum[s] == sendnum[s]
holds across all ranks/types.• After your loop that sets
inlist.sendnum[s] = cnt; inlist.recvnum[s] = cnt;add:
+ // Recompute receive‐offsets to match updated recvnum[] + if (inlist.firstrecv) { + int offset = 0; + for (int s = 0; s < nswap; ++s) { + inlist.firstrecv[s] = offset; + offset += inlist.recvnum[s]; + } + }This keeps the
firstrecv
prefix sums correct (firstrecv
is “where to put 1st recv atom in each swap” ) (docs.deepmodeling.com).• Because you’re setting
recvnum[s] = sendnum[s]
, you’re assuming perfect send/recv symmetry per swap. Please verify this holds under all type‐map filters and MPI ranks. Consider adding a debug‐only assertion or a collective check (e.g. an MPI_Allreduce on the arrays) to catch any imbalance early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for source/api_cc/src/DeepPotPD.cc
Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.
Summary by CodeRabbit