Skip to content

Conversation

iProzd
Copy link
Collaborator

@iProzd iProzd commented Aug 14, 2025

Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.

Summary by CodeRabbit

  • Refactor
    • Message-passing now remaps send lists to actual atoms before building communication tensors, improving consistency for distributed runs.
  • Bug Fixes
    • Invalid or out-of-range send indices are filtered and counts updated to prevent communication mismatches and related errors.
  • Tests
    • Added a test covering a type-map scenario with a NULL mapping to exercise the updated handling.

Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.
@iProzd iProzd marked this pull request as draft August 14, 2025 13:26
@github-actions github-actions bot added the C++ label Aug 14, 2025
Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

📝 Walkthrough

Walkthrough

Calls a new helper to remap LAMMPS sendlists using a forward-map before constructing MPI/send tensors in DeepPotPT/DeepPotPD/DeepSpinPT, and adds the helper declaration/implementation and a small test. No public API signature changes beyond the new helper.

Changes

Cohort / File(s) Summary
Sendlist remapping helper
source/api_cc/src/common.cc, source/api_cc/include/common.h
- Adds deepmd::select_real_atoms_sendlist(const deepmd::InputNlist& inlist, const std::vector<int>& fwd_map) implementation and declaration. Filters/remaps per-swap sendlist indices via fwd_map, discards invalid mappings, updates sendlist, sendnum, and recvnum.
DeepPot message-passing callers*
source/api_cc/src/DeepPotPT.cc, source/api_cc/src/DeepPotPD.cc, source/api_cc/src/DeepSpinPT.cc
- After obtaining nswap in the message-passing branch, call select_real_atoms_sendlist(lmp_list, fwd_map) so subsequent tensor/comm_dict construction uses remapped sendlists. No other control-flow or API changes.
Tests
source/lmp/tests/test_lammps_dpa_pt.py
- Adds test_pair_deepmd_type_map_with_null which sets up a DeepMD type-map with pair_coeff("* * H NULL") and runs 0 steps; no assertions added.

Sequence Diagram(s)

sequenceDiagram
  participant Compute as DeepPot*::compute
  participant Select as select_real_atoms_sendlist
  participant Map as fwd_map
  participant Tensors as Tensor & comm_dict build

  Compute->>Select: call select_real_atoms_sendlist(lmp_list, fwd_map)
  Select->>Map: map each send index -> forwarded index
  Map-->>Select: remapped sendlist (invalids removed)
  Select-->>Compute: lmp_list updated (sendlist/sendnum/recvnum)
  Compute->>Tensors: build send/recv tensors and comm_dict from updated lmp_list
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • wanghan-iapcm
  • njzjz
  • CaRoLZhangxy
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
source/api_cc/src/DeepPotPT.cc (2)

251-257: Remove stale commented-out code

Dead commented code obscures the current data path and makes maintenance harder.

Apply this diff:

-//      torch::Tensor firstrecv_tensor =
-//          torch::from_blob(lmp_list.firstrecv, {nswap}, int32_option);
-//      torch::Tensor recvnum_tensor =
-//          torch::from_blob(lmp_list.recvnum, {nswap}, int32_option);
-//      torch::Tensor sendnum_tensor =
-//          torch::from_blob(lmp_list.sendnum, {nswap}, int32_option);

266-269: Remove redundant commented-out legacy code

Same reasoning; commented legacy path is preserved in git history.

Apply this diff:

-//      int total_send =
-//          std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0);
-//      torch::Tensor sendlist_tensor =
-//          torch::from_blob(lmp_list.sendlist, {total_send}, int32_option);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between accc331 and f99ad4d.

📒 Files selected for processing (1)
  • source/api_cc/src/DeepPotPT.cc (3 hunks)
🔇 Additional comments (2)
source/api_cc/src/DeepPotPT.cc (2)

185-204: Remapping logic LGTM

Correctly rebuilds per-swap send counts and a dense send list using fwd_map with bounds checks and filtering. Reserving capacity via the accumulated legacy counts is a good optimization.


226-232: firstrecv_new is unused and not required — original comment is incorrect

Short: deepmd/pt/model/descriptor/repflows.py builds comm_dict and calls torch.ops.deepmd.border_op with send_list, send_proc, recv_proc, send_num, recv_num, communicator (no first_recv). The computed firstrecv_new/firstrecv_tensor in the PT wrappers is dead code — remove it or document why it is kept.

Files to update:

  • source/api_cc/src/DeepPotPT.cc
    • Remove the firstrecv_new prefix-sum computation (around lines 226–231) and the unused firstrecv_tensor creation (around line 238).
  • source/api_cc/src/DeepSpinPT.cc
    • Same pattern: firstrecv_tensor is created around lines 187–191 but never used/inserted.

Suggested change (remove unused code) — example diff for DeepPotPT.cc:
@@

  •     std::vector<int> firstrecv_new(nswap, 0);
    
  •      int acc = 0;
    
  •      for (int s = 0; s < nswap; ++s) {
    
  •        firstrecv_new[s] = acc;
    
  •        acc += recvnum_new[s];
    
  •      }
    
  •     /* firstrecv computation removed — not used by border_op */
    

@@

  •    torch::Tensor firstrecv_tensor =
    
  •  torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
    
  •    /* firstrecv tensor omitted — border_op expects recv_num, not first_recv */
    

If you prefer to keep the computation for clarity, add a short comment explaining it's intentionally unused.

Likely an incorrect or invalid review comment.

@caic99 caic99 linked an issue Aug 22, 2025 that may be closed by this pull request
@iProzd iProzd marked this pull request as ready for review August 27, 2025 10:25
@caic99 caic99 requested a review from Copilot August 27, 2025 10:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in the C++ custom operations for distributed MPI communication by implementing proper handling of NULL types in the send/receive lists. The fix ensures that atom indices are correctly mapped and synchronized across MPI processes by filtering out invalid indices and updating counts accordingly.

Key changes:

  • Added a new function to select and remap real atoms in send lists using forward mapping
  • Integrated the send list filtering into PyTorch, Paddle, and DeepSpin compute functions
  • Added a test case to verify NULL type handling in LAMMPS pair coefficients

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
source/api_cc/src/common.cc Implements select_real_atoms_sendlist function to filter and remap atom indices in MPI send lists
source/api_cc/include/common.h Adds function declaration for the new send list selection function
source/api_cc/src/DeepPotPT.cc Integrates send list filtering into PyTorch backend compute function
source/api_cc/src/DeepPotPD.cc Integrates send list filtering into Paddle backend compute function
source/api_cc/src/DeepSpinPT.cc Integrates send list filtering into DeepSpin PyTorch compute function
source/lmp/tests/test_lammps_dpa_pt.py Adds test case for NULL type handling in LAMMPS pair coefficients

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
source/api_cc/src/DeepSpinPT.cc (1)

205-213: Flatten the sendlist pointer-of-pointers into a contiguous buffer before calling from_blob

InputNlist::sendlist is declared as int ** (an array of pointers to per-swap arrays) in api_c/include/deepmd.hpp, so passing it directly to

torch::from_blob(lmp_list.sendlist, {total_send}, int32_option)

will interpret the array‐of‐pointers as a contiguous block of ints, which is invalid and will lead to incorrect data being read (or a crash). You must first flatten all per-swap subarrays into a single contiguous int buffer.

Please update the following files:

  • source/api_cc/src/DeepSpinPT.cc (around line 207)
  • source/api_cc/src/DeepPotPT.cc (around line 200)

Proposed refactoring to safely flatten (ensuring the buffer outlives the call to run_method):

-      torch::Tensor sendlist_tensor =
-          torch::from_blob(lmp_list.sendlist, {total_send}, int32_option);
+      // Flatten the int** sendlist into a single contiguous vector
+      std::vector<int> flat_sendlist;
+      flat_sendlist.reserve(total_send);
+      for (int s = 0; s < nswap; ++s) {
+        flat_sendlist.insert(
+            flat_sendlist.end(),
+            lmp_list.sendlist[s],
+            lmp_list.sendlist[s] + lmp_list.sendnum[s]);
+      }
+      // Create a tensor from the contiguous buffer
+      torch::Tensor sendlist_tensor =
+          torch::from_blob(flat_sendlist.data(), {total_send}, int32_option);

[mend_review_comment]

source/api_cc/src/DeepPotPD.cc (1)

421-439: Shape bug: send_list should carry nswap pointer addresses, not total_send elements

You reshape send_list to total_send but copy only nswap pointer addresses. This is a mismatched shape and can corrupt reads downstream.

-      int total_send =
-          std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0);
-      sendlist_tensor->Reshape({total_send});
+      // send_list carries one pointer per swap
+      sendlist_tensor->Reshape({nswap});
       /**
       ** NOTE: paddle do not support construct a Tensor with from_blob(T**, ...)
       ** from a double pointer, so we convert int* pointer to indptr_t for each
       ** entry and wrap it into int64 Tensor as a workaround.
       */
       std::vector<std::intptr_t> pointer_addresses;
-      pointer_addresses.reserve(nswap);
+      pointer_addresses.reserve(nswap);
🧹 Nitpick comments (1)
source/api_cc/include/common.h (1)

105-107: API const-correctness: this mutates inlist; avoid const on parameter

The helper writes through inlist pointers (sendnum/recvnum/sendlist). Marking the parameter const is misleading.

-void select_real_atoms_sendlist(const deepmd::InputNlist& inlist,
-                                const std::vector<int>& fwd_map);
+// In-place: trims and remaps sendlist; updates sendnum/recvnum/(firstrecv if present)
+void select_real_atoms_sendlist(deepmd::InputNlist& inlist,
+                                const std::vector<int>& fwd_map);

If changing the signature ripples too far now, at least document the in-place side effects in a comment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f99ad4d and af2411d.

📒 Files selected for processing (6)
  • source/api_cc/include/common.h (1 hunks)
  • source/api_cc/src/DeepPotPD.cc (1 hunks)
  • source/api_cc/src/DeepPotPT.cc (1 hunks)
  • source/api_cc/src/DeepSpinPT.cc (1 hunks)
  • source/api_cc/src/common.cc (1 hunks)
  • source/lmp/tests/test_lammps_dpa_pt.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • source/api_cc/src/DeepPotPT.cc
🧰 Additional context used
🧬 Code graph analysis (5)
source/lmp/tests/test_lammps_dpa_pt.py (1)
source/lmp/tests/test_lammps.py (1)
  • lammps_type_map (289-292)
source/api_cc/src/common.cc (1)
source/op/tf/prod_env_mat_multi_device.cc (1)
  • inlist (1590-1590)
source/api_cc/include/common.h (2)
source/api_cc/src/common.cc (2)
  • select_real_atoms_sendlist (235-258)
  • select_real_atoms_sendlist (235-236)
source/api_c/include/deepmd.hpp (4)
  • InputNlist (807-814)
  • InputNlist (815-822)
  • InputNlist (823-850)
  • InputNlist (851-851)
source/api_cc/src/DeepSpinPT.cc (1)
source/api_cc/src/common.cc (2)
  • select_real_atoms_sendlist (235-258)
  • select_real_atoms_sendlist (235-236)
source/api_cc/src/DeepPotPD.cc (1)
source/api_cc/src/common.cc (2)
  • select_real_atoms_sendlist (235-258)
  • select_real_atoms_sendlist (235-236)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (1)
source/api_cc/src/common.cc (1)

235-258: Recompute firstrecv after trimming and verify symmetry

You need to keep the receive‐offset array in sync whenever you shrink recvnum, and also confirm that the assumption recvnum[s] == sendnum[s] holds across all ranks/types.

• After your loop that sets

inlist.sendnum[s] = cnt;
inlist.recvnum[s] = cnt;

add:

+  // Recompute receive‐offsets to match updated recvnum[]
+  if (inlist.firstrecv) {
+    int offset = 0;
+    for (int s = 0; s < nswap; ++s) {
+      inlist.firstrecv[s] = offset;
+      offset += inlist.recvnum[s];
+    }
+  }

This keeps the firstrecv prefix sums correct ( firstrecv is “where to put 1st recv atom in each swap” ) (docs.deepmodeling.com).

• Because you’re setting recvnum[s] = sendnum[s], you’re assuming perfect send/recv symmetry per swap. Please verify this holds under all type‐map filters and MPI ranks. Consider adding a debug‐only assertion or a collective check (e.g. an MPI_Allreduce on the arrays) to catch any imbalance early.

Copy link
Collaborator

@HydrogenSulfate HydrogenSulfate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for source/api_cc/src/DeepPotPD.cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NULL atom type can not be used with deepmd 3.1.0a0 pth
4 participants