Skip to content

Conversation

@jameskermode
Copy link
Contributor

New implementation with multi-theading and GPU support using KernalAbstractions.jl, AcceleratedKernels.jl and Atomix.jl

The sort-based approach was proposed by Teemu Järvinen and Timon Gutleb.

Benchmarks on NVIDIA RTX A4500 (cutoff = 5.0 Å, density = 0.05 atoms/ų):

Atoms Pairs CPU (1T) CPU (16T) GPU GPU Speedup
1,000 26k 8 ms 3.3 ms 2.4 ms 3.3x
5,000 130k 41 ms 4.3 ms 2.4 ms 17x
10,000 261k 82 ms 6.7 ms 2.6 ms 32x
50,000 1.3M 430 ms 26 ms 4.4 ms 98x
100,000 2.6M 880 ms 37 ms 7.4 ms 119x

jameskermode and others added 20 commits December 11, 2025 10:10
Major changes:
- Add new SortedCellList type with sort-based cell list construction
- Implement GPU kernels for cell ID computation, neighbour counting, and pair filling
- Add CUDA extension (ext/NeighbourListsCUDAExt.jl) with GPU-optimized methods
- Add lazy neighbour iteration via for_each_neighbour for efficient in-kernel access
- Update mapreduce operations with parallel implementations

New files:
- src/gpu_kernels.jl: KernelAbstractions-based GPU kernels
- ext/NeighbourListsCUDAExt.jl: CUDA-specific implementations
- benchmark/: Performance benchmarking tools
- CLAUDE.md: Development documentation

Performance improvements:
- ~450x speedup on GPU vs CPU for 50k atoms (NVIDIA RTX A4500)
- ~8x speedup with CPU multi-threading on lazy iteration

Dependencies added:
- Atomix (for atomic operations)
- CUDA as optional weak dependency

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update GitHub Actions to latest versions (v4, v2)
- Remove CUDA from test targets (not used in tests, saves CI time)
- CUDA remains available as a weak dependency for users

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GPU Support section with usage example
- Remove JuLIP.jl references (no longer supported)
- Fix typo in intro paragraph

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- test_sortbased.jl: 783 tests for CPU sort-based cell list
  - Validates against legacy implementation
  - Tests all PBC combinations, non-cubic cells, edge cases
  - Tests for_each_neighbour lazy iteration
  - Tests map_sites! and map_pairs! functions
  - Stress tests with larger systems and high/low density

- test_gpu.jl: GPU tests (conditional on CUDA availability)
  - CPU vs GPU correctness validation
  - Tests cell list construction and pair materialization
  - Tests all PBC combinations and edge cases

- Fix CUDA extension type signatures for proper dispatch

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test/test_matscipy.jl with comprehensive validation against Python matscipy neighbourlist
- Tests include: cubic/non-cubic cells, all PBC combinations, edge cases, larger systems
- Add PythonCall and CondaPkg as test dependencies
- Add CondaPkg.toml to specify matscipy and numpy Python dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GPU vs matscipy validation tests in test_matscipy.jl
  - Tests skip gracefully when CUDA is unavailable
  - Includes basic validation, multiple configurations, and larger systems
- Fix CI workflow to ensure matscipy is available:
  - Create separate test environment at /tmp/test_env
  - Include all test dependencies in Project.toml
  - Copy CondaPkg.toml for matscipy installation
  - Use Pkg.develop() to include NeighbourLists
  - Run CondaPkg.resolve() before tests
- Add compat bounds for PythonCall and CondaPkg

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test/Project.toml with all test dependencies and [sources] for NeighbourLists
- Simplify CI.yml to use --project=test instead of inline Project.toml
- Test Julia 1.11 and 1.12 only (drop 1.9 and nightly)
- Update julia compat to 1.11 in main Project.toml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated import statement to include LineaAlgebra.
…ations

- Use KernelAbstractions kernels for CPU materialize_pairlist (enables threading)
- Add GPU dispatch for map_pairs! and map_pairs_d! with atomic accumulation
- Add map_pairs_d_vec! for GPU-friendly 3D vector force computation
- Fix Julia 1.12 threading bug (use maxthreadid() for thread-local arrays)
- Document map_sites! GPU limitation (requires dynamic allocation)
- Update README benchmarks with multi-threaded CPU (16T) vs GPU results
- Add benchmark script to scripts/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…API)

- Remove CUDA extension entirely - no more vendor-specific code
- Use Atomix.jl for portable atomic operations in GPU kernels
- Use AcceleratedKernels.jl for portable sorting and prefix sums
- Auto-detect backend from array type in build_cell_list and PairList
- Fix scalar indexing for GPU arrays in _scalar_getindex

The package now works with any GPU backend supported by
KernelAbstractions.jl without requiring vendor-specific extensions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add CPU example showing multi-threaded sort-based API
- Add GPU example showing the same API with CuArray
- Document what's common (API) and different (array type only)
- Update to reflect portable implementation (CUDA, ROCm, Metal, oneAPI)
- Mention KernelAbstractions.jl and AcceleratedKernels.jl

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add clear introduction describing package purpose and history
- Add table comparing legacy (linked-list) vs sort-based implementations
- Add Quick Start section with AtomsBase and raw array examples
- Reorganize for better flow from basics to advanced GPU usage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cortner cortner self-requested a review December 12, 2025 16:08
Copy link
Member

@cortner cortner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of my comments are blocking. I think we could just merge. But there are a number of things I'd like to discuss:

  • there is still a huge amount of repeated code also in the tests, makes it very hard to read the tests.
  • separate test scripts that basically to the same thing. Maybe we can write a small testing sub-module that implements the testing utility functions to make the main test scripts easier to read
  • it would be good to sketch a unified interface for the two implementations - and then we need to rewrite various dependencies
  • should AtomsBase be moved into an extension?
  • we need a GPU compatible structure - maybe this can be done with DecoratedParticles, which I anyhow want to integrate into ET asap.
  • should we deprecate the legacy code right away?

a few other minor points. None of them blocking, but if we merge now then I'd like to make a list of all comments and post them into an issue.

@cortner
Copy link
Member

cortner commented Dec 12, 2025

@jameskermode - based on the comments shall we just go ahead, merge and register or are there any points you'd like to discuss before we go ahead?

@jameskermode
Copy link
Contributor Author

Let me do another pass before merging

@cortner
Copy link
Member

cortner commented Dec 12, 2025

Can you bully Claude into streamlining the test suite?

@jameskermode
Copy link
Contributor Author

Can you bully Claude into streamlining the test suite?

"Got it - I'll be more aggressive about consolidating the tests. Let me update the plan and make more substantial changes."

- Reduce test code duplication: Created test/test_utils.jl with shared
  utilities (rand_config, comparison helpers, parametric test runners).
  Reduced test lines by ~60% while maintaining coverage.

- Add unified high-level API: neighbour_list(), neighbours(), num_neighbours()
  that work with both PairList and SortedCellList

- Move AtomsBase to extension: ext/NeighbourListsAtomsBaseExt.jl
  (loaded automatically when AtomsBase is imported)

- Add CUDA extension stub: ext/NeighbourListsCUDAExt.jl
  (main GPU code remains portable via KernelAbstractions)

- Add deprecation warnings to legacy PairList constructor with
  migration guidance to new API

- Create DEPRECATIONS.md documenting v0.6→v0.7 migration path

- Update benchmark script to use PrettyTables for cleaner output

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jameskermode
Copy link
Contributor Author

Next iteration, here's what's changed:

  • Test duplication: Created test/test_utils.jl with shared utilities, ~60% reduction in test code
  • Unified API: Added neighbour_list(), neighbours(), num_neighbours() working with both PairList and SortedCellList
  • Extensions: Moved AtomsBase to ext/NeighbourListsAtomsBaseExt.jl, added CUDA extension stub (main GPU code stays portable via KernelAbstractions)
  • Dual storage: Kept as-is—both sorted and original positions needed for coalesced GPU access + correct displacement calculations
  • GPU testing: Infrastructure now backend-agnostic via test_cpu_vs_gpu(to_gpu_fn)
  • Benchmarks: Updated to use PrettyTables
  • Deprecation: Added Base.depwarn() to legacy constructor + DEPRECATIONS.md for v0.6→v0.7 migration

jameskermode and others added 2 commits December 12, 2025 20:57
Addresses feedback: "would be great to have legacy, MT, GPU all in a single benchmark"

Now shows: Legacy (linked-list) | CPU (nT) | GPU | Speedup vs legacy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Shows full comparison across all implementations:
- Legacy (linked-list, single-threaded)
- CPU 1T (sort-based, single-threaded)
- CPU 8T (sort-based, multi-threaded)
- GPU (sort-based, CUDA)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
claude and others added 2 commits December 13, 2025 11:21
Replace two-step build_cell_list + materialize_pairlist with the
new neighbour_list() unified entry point introduced in dd2bf33.

- Use neighbour_list() in Quick Start and CPU/GPU examples
- Rename section from "Sort-Based API" to "Unified API"
- Add new "Lazy Mode" section showing lazy=true for memory efficiency
- Use neighbours() instead of neigs() for consistency
@jameskermode
Copy link
Contributor Author

Thanks for the detailed review @cortner! Here's how each point has been addressed:

1. Code Duplication in Tests

"there is still a huge amount of repeated code also in the tests"

Created test/test_utils.jl with shared utilities (rand_config, compare_pairlists, test_legacy_vs_sortbased, etc.). Reduced test code by ~60%. Tests now use parametric helpers like test_all_pbc_legacy_vs_sortbased() instead of copy-pasted loops.

2. Unified Interface

Suggested sketching a unified interface for both implementations

Added neighbour_list(X, cutoff, cell, pbc; lazy=false) as the unified entry point. Returns PairList by default, or SortedCellList with lazy=true for memory-efficient iteration. Works identically on CPU and GPU—backend auto-detected from array type.

3. AtomsBase Extension

Should AtomsBase be moved into an extension?

Done. Moved to ext/NeighbourListsAtomsBaseExt.jl. Loads automatically when user imports AtomsBase+Unitful. Also added ext/NeighbourListsCUDAExt.jl stub (main GPU code remains in core via KernelAbstractions for portability).

4. Legacy Code Deprecation

Should legacy code be deprecated immediately?

Added Base.depwarn() to legacy PairList(X::Vector{SVec}, ...) constructor pointing users to neighbour_list(). Created DEPRECATIONS.md documenting the v0.6→v0.7 migration path. Legacy code remains functional—just warns.

5. API Design (neighbour_list, neighbours)

Proposed neighbour_list() and neighbours() as high-level functions

Implemented exactly as suggested:

  • neighbour_list(X, cutoff, cell, pbc) → builds list
  • neighbours(nlist, i) → returns (js, Rs, Ss), works with both PairList and SortedCellList
  • num_neighbours(nlist, i) → count neighbours

6. Benchmark with PrettyTables

Requested single benchmark showing legacy, MT, GPU together

Updated scripts/benchmark.jl to show all implementations in one table using PrettyTables. README now shows Legacy | CPU(1T) | CPU(8T) | GPU | Speedup columns.

7. SortedCellList Dual Storage

Asked if both sorted and unsorted positions are necessary

Yes, both are needed. X (sorted by cell ID) enables coalesced GPU memory access during cell traversal. X_orig (original order) is required for displacement calculations since neighbor indices reference original positions. The permutation array maps between them.

8. GPU Testing

Referenced alternative GPU testing patterns

Test infrastructure now uses backend-agnostic helpers (test_cpu_vs_gpu(to_gpu_fn), test_all_pbc_cpu_vs_gpu). Currently tests CUDA when available; same helpers work for any KernelAbstractions backend.

- Remove dead code (_adjacent_cells, _wrap_cell_index functions)
- Unify _sub2ind/_sub2ind_scalar into single GPU-compatible version
- Unify position_to_cell_index variants into single function
- Consolidate bin_wrap_or_trunc overloads
- Extract _for_each_neighbor_pair helper to reduce GPU kernel duplication
- Unify CPU/GPU materialize_pairlist into single dispatch
- Add threaded_accumulate! helper for repeated threading pattern
- Standardize on get_array_backend throughout

Net reduction: ~120 lines while preserving all functionality.
Benchmarks show no performance regression.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jameskermode
Copy link
Contributor Author

Maintainability Improvements

Addressed concerns about code duplication and maintainability with a refactoring commit that consolidates duplicate code patterns.

Summary of Changes

Change Impact
Remove dead code (_adjacent_cells, _wrap_cell_index) -22 lines
Unify _sub2ind / _sub2ind_scalar Single GPU-compatible version
Unify position_to_cell_index variants Single GPU-compatible version
Consolidate bin_wrap_or_trunc overloads Cleaner API
Extract _for_each_neighbor_pair helper Reduces GPU kernel duplication by ~50 lines
Unify CPU/GPU materialize_pairlist Single implementation with dispatch
Add threaded_accumulate! helper Removes repeated threading pattern (~30 lines)
Standardize on get_array_backend Consistent API

Net result: -122 lines (313 removed, 191 added) while preserving all functionality.

Benchmark Verification

Ran benchmarks on NVIDIA RTX A4500 - no performance regression:

Atoms CPU (8T) GPU
1,000 3.41 ms 2.23 ms
5,000 5.40 ms 2.20 ms
10,000 7.98 ms 2.35 ms
50,000 32.37 ms 4.12 ms
100,000 61.43 ms 6.95 ms

GPU throughput: 375 million pairs/second (consistent with previous 370M).

jameskermode and others added 6 commits December 13, 2025 21:54
- Move Unified API section to top as the recommended entry point
- Add deprecation notice with link to DEPRECATIONS.md
- Add dedicated Deprecation Notice section with migration summary
- Reorder implementations table: sort-based first with status column
- Merge AtomsBase example into Unified API section
- Promote Acknowledgements to top-level heading

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace neigs() with neighbours() in the AtomsBase.jl integration
example for consistency with the CPU example and the recommended
unified API.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add FULL_PBC and NO_PBC constants to replace ~25 hardcoded SVec instances
- Add parametric test runners: test_edge_cases(), test_edge_cases_cpu_vs_gpu(),
  test_large_systems(), test_large_systems_cpu_vs_gpu()
- Update test_sortbased.jl, test_gpu.jl, test_matscipy.jl to use shared utilities
- Remove dead code: unused rand_config_aux() function from test_aux.jl

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GPU backend detection system that tries CUDA, then AMDGPU, then Metal
- New functions: init_gpu_backend!(), gpu_available(), gpu_backend(),
  to_gpu_array(), gpu_array_type()
- Use Base.invokelatest to handle world age issues with dynamically loaded modules
- Update test_gpu.jl to be backend-agnostic using to_gpu_array()
- Update test_matscipy.jl GPU tests to use multi-backend detection
- Tests now show which GPU backend is being used in test set names

Based on pattern from EquivariantTensors.jl

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add neighbour_list() method for AtomsBase systems in extension
- Create test_unified_api.jl with tests for:
  - neighbour_list() materialized and lazy modes
  - neighbours() function for PairList and SortedCellList
  - num_neighbours() function
  - max_neighbours() function
  - GPU support with unified API
- Expand test_atoms_base.jl to use unified API:
  - neighbour_list(), neighbours(), num_neighbours()
  - Lazy mode with AtomsBase systems
  - build_cell_list() direct usage
  - Unit handling tests
- Update README to show unified API for AtomsBase integration
- Clarify API selection between sort-based and legacy implementations

Test count: 222 → 668

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add count_lazy_pairs(), count_manual_neighbours(), verify_neighbours_consistency() to test_utils.jl
- Update test_unified_api.jl and test_atoms_base.jl to use new utilities
- Fix matscipy_available soft scope warning with global keyword

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jameskermode
Copy link
Contributor Author

We can discuss when we meet, but I think all feedback has now been addressed and this should be OK to merge

jameskermode and others added 2 commits December 15, 2025 16:09
Add bool_to_val() helper to convert Bool to Val for type-stable dispatch.
The public API accepts Union{Bool, Val} so users can call with lazy=true
(convenient) or lazy=Val(true) (type-stable for performance-critical code).

Internal _neighbour_list() and _neighbour_list_ab() functions use Val{L}
dispatch so Julia can infer the return type at compile time.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update documentation to reflect that the legacy linked-list implementation
will be retained indefinitely as a reference implementation for validating
correctness in tests. The depwarn is kept to encourage users to use the
new API, but removal is no longer planned.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cortner
Copy link
Member

cortner commented Dec 15, 2025

I want to do a final cleanup of Project.toml and the AtomsBase extension. Can I merge and then make another PR before we register please?

@jameskermode
Copy link
Contributor Author

Yes of course - I believe I have also given you acess to this branch in my fork if you want to do it in one go

@cortner
Copy link
Member

cortner commented Dec 15, 2025

Ok, I'll do it there.

@cortner
Copy link
Member

cortner commented Dec 15, 2025

neighbours was behaving inconsistently. At first I thought it was only the documentation but it was also the implementation. I did the following:

  • point neighbours to neigss instead of neigs to return j, R, S instead of only j, R
  • remove get_neighbours; no point in having both. This resulted in a few global changes and one test removal

Furthermore:

  • I moved export statements to the top of files
  • I removed a few export statements for symbols that I think should be internal
  • moved the @testset outside a file for the atomsbase tests, makes it easier to run tests interactively

@cortner
Copy link
Member

cortner commented Dec 15, 2025

@jameskermode - if you are happy with my final edits, since tests pass, we can merge and register. (ignore the test_atoms_base.jl this is just an indentation shift which looks nasty but otherwise there is no real change; I should have made that a separate commit, sorry.)

@jameskermode
Copy link
Contributor Author

Yes, very happy, thanks for checking the API more carefully than me. Go ahead, we might need to cleanup in patch releases but I think this is good for 0.6.0.

@cortner cortner merged commit 1732252 into JuliaMolSim:master Dec 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants