-
Notifications
You must be signed in to change notification settings - Fork 4
Multi-threading and GPU support #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Major changes: - Add new SortedCellList type with sort-based cell list construction - Implement GPU kernels for cell ID computation, neighbour counting, and pair filling - Add CUDA extension (ext/NeighbourListsCUDAExt.jl) with GPU-optimized methods - Add lazy neighbour iteration via for_each_neighbour for efficient in-kernel access - Update mapreduce operations with parallel implementations New files: - src/gpu_kernels.jl: KernelAbstractions-based GPU kernels - ext/NeighbourListsCUDAExt.jl: CUDA-specific implementations - benchmark/: Performance benchmarking tools - CLAUDE.md: Development documentation Performance improvements: - ~450x speedup on GPU vs CPU for 50k atoms (NVIDIA RTX A4500) - ~8x speedup with CPU multi-threading on lazy iteration Dependencies added: - Atomix (for atomic operations) - CUDA as optional weak dependency 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update GitHub Actions to latest versions (v4, v2) - Remove CUDA from test targets (not used in tests, saves CI time) - CUDA remains available as a weak dependency for users 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GPU Support section with usage example - Remove JuLIP.jl references (no longer supported) - Fix typo in intro paragraph 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- test_sortbased.jl: 783 tests for CPU sort-based cell list - Validates against legacy implementation - Tests all PBC combinations, non-cubic cells, edge cases - Tests for_each_neighbour lazy iteration - Tests map_sites! and map_pairs! functions - Stress tests with larger systems and high/low density - test_gpu.jl: GPU tests (conditional on CUDA availability) - CPU vs GPU correctness validation - Tests cell list construction and pair materialization - Tests all PBC combinations and edge cases - Fix CUDA extension type signatures for proper dispatch 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test/test_matscipy.jl with comprehensive validation against Python matscipy neighbourlist - Tests include: cubic/non-cubic cells, all PBC combinations, edge cases, larger systems - Add PythonCall and CondaPkg as test dependencies - Add CondaPkg.toml to specify matscipy and numpy Python dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GPU vs matscipy validation tests in test_matscipy.jl - Tests skip gracefully when CUDA is unavailable - Includes basic validation, multiple configurations, and larger systems - Fix CI workflow to ensure matscipy is available: - Create separate test environment at /tmp/test_env - Include all test dependencies in Project.toml - Copy CondaPkg.toml for matscipy installation - Use Pkg.develop() to include NeighbourLists - Run CondaPkg.resolve() before tests - Add compat bounds for PythonCall and CondaPkg 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test/Project.toml with all test dependencies and [sources] for NeighbourLists - Simplify CI.yml to use --project=test instead of inline Project.toml - Test Julia 1.11 and 1.12 only (drop 1.9 and nightly) - Update julia compat to 1.11 in main Project.toml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated import statement to include LineaAlgebra.
…ations - Use KernelAbstractions kernels for CPU materialize_pairlist (enables threading) - Add GPU dispatch for map_pairs! and map_pairs_d! with atomic accumulation - Add map_pairs_d_vec! for GPU-friendly 3D vector force computation - Fix Julia 1.12 threading bug (use maxthreadid() for thread-local arrays) - Document map_sites! GPU limitation (requires dynamic allocation) - Update README benchmarks with multi-threaded CPU (16T) vs GPU results - Add benchmark script to scripts/ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…API) - Remove CUDA extension entirely - no more vendor-specific code - Use Atomix.jl for portable atomic operations in GPU kernels - Use AcceleratedKernels.jl for portable sorting and prefix sums - Auto-detect backend from array type in build_cell_list and PairList - Fix scalar indexing for GPU arrays in _scalar_getindex The package now works with any GPU backend supported by KernelAbstractions.jl without requiring vendor-specific extensions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add CPU example showing multi-threaded sort-based API - Add GPU example showing the same API with CuArray - Document what's common (API) and different (array type only) - Update to reflect portable implementation (CUDA, ROCm, Metal, oneAPI) - Mention KernelAbstractions.jl and AcceleratedKernels.jl 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add clear introduction describing package purpose and history - Add table comparing legacy (linked-list) vs sort-based implementations - Add Quick Start section with AtomsBase and raw array examples - Reorganize for better flow from basics to advanced GPU usage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cortner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of my comments are blocking. I think we could just merge. But there are a number of things I'd like to discuss:
- there is still a huge amount of repeated code also in the tests, makes it very hard to read the tests.
- separate test scripts that basically to the same thing. Maybe we can write a small testing sub-module that implements the testing utility functions to make the main test scripts easier to read
- it would be good to sketch a unified interface for the two implementations - and then we need to rewrite various dependencies
- should AtomsBase be moved into an extension?
- we need a GPU compatible structure - maybe this can be done with DecoratedParticles, which I anyhow want to integrate into ET asap.
- should we deprecate the legacy code right away?
a few other minor points. None of them blocking, but if we merge now then I'd like to make a list of all comments and post them into an issue.
|
@jameskermode - based on the comments shall we just go ahead, merge and register or are there any points you'd like to discuss before we go ahead? |
|
Let me do another pass before merging |
|
Can you bully Claude into streamlining the test suite? |
"Got it - I'll be more aggressive about consolidating the tests. Let me update the plan and make more substantial changes." |
- Reduce test code duplication: Created test/test_utils.jl with shared utilities (rand_config, comparison helpers, parametric test runners). Reduced test lines by ~60% while maintaining coverage. - Add unified high-level API: neighbour_list(), neighbours(), num_neighbours() that work with both PairList and SortedCellList - Move AtomsBase to extension: ext/NeighbourListsAtomsBaseExt.jl (loaded automatically when AtomsBase is imported) - Add CUDA extension stub: ext/NeighbourListsCUDAExt.jl (main GPU code remains portable via KernelAbstractions) - Add deprecation warnings to legacy PairList constructor with migration guidance to new API - Create DEPRECATIONS.md documenting v0.6→v0.7 migration path - Update benchmark script to use PrettyTables for cleaner output 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Next iteration, here's what's changed:
|
Addresses feedback: "would be great to have legacy, MT, GPU all in a single benchmark" Now shows: Legacy (linked-list) | CPU (nT) | GPU | Speedup vs legacy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Shows full comparison across all implementations: - Legacy (linked-list, single-threaded) - CPU 1T (sort-based, single-threaded) - CPU 8T (sort-based, multi-threaded) - GPU (sort-based, CUDA) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace two-step build_cell_list + materialize_pairlist with the new neighbour_list() unified entry point introduced in dd2bf33. - Use neighbour_list() in Quick Start and CPU/GPU examples - Rename section from "Sort-Based API" to "Unified API" - Add new "Lazy Mode" section showing lazy=true for memory efficiency - Use neighbours() instead of neigs() for consistency
…api-01KQqdVJVo9iAHaDztUysw2S
|
Thanks for the detailed review @cortner! Here's how each point has been addressed: 1. Code Duplication in Tests
Created 2. Unified Interface
Added 3. AtomsBase Extension
Done. Moved to 4. Legacy Code Deprecation
Added 5. API Design (neighbour_list, neighbours)
Implemented exactly as suggested:
6. Benchmark with PrettyTables
Updated 7. SortedCellList Dual Storage
Yes, both are needed. 8. GPU Testing
Test infrastructure now uses backend-agnostic helpers ( |
- Remove dead code (_adjacent_cells, _wrap_cell_index functions) - Unify _sub2ind/_sub2ind_scalar into single GPU-compatible version - Unify position_to_cell_index variants into single function - Consolidate bin_wrap_or_trunc overloads - Extract _for_each_neighbor_pair helper to reduce GPU kernel duplication - Unify CPU/GPU materialize_pairlist into single dispatch - Add threaded_accumulate! helper for repeated threading pattern - Standardize on get_array_backend throughout Net reduction: ~120 lines while preserving all functionality. Benchmarks show no performance regression. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Maintainability ImprovementsAddressed concerns about code duplication and maintainability with a refactoring commit that consolidates duplicate code patterns. Summary of Changes
Net result: -122 lines (313 removed, 191 added) while preserving all functionality. Benchmark VerificationRan benchmarks on NVIDIA RTX A4500 - no performance regression:
GPU throughput: 375 million pairs/second (consistent with previous 370M). |
- Move Unified API section to top as the recommended entry point - Add deprecation notice with link to DEPRECATIONS.md - Add dedicated Deprecation Notice section with migration summary - Reorder implementations table: sort-based first with status column - Merge AtomsBase example into Unified API section - Promote Acknowledgements to top-level heading 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace neigs() with neighbours() in the AtomsBase.jl integration example for consistency with the CPU example and the recommended unified API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add FULL_PBC and NO_PBC constants to replace ~25 hardcoded SVec instances - Add parametric test runners: test_edge_cases(), test_edge_cases_cpu_vs_gpu(), test_large_systems(), test_large_systems_cpu_vs_gpu() - Update test_sortbased.jl, test_gpu.jl, test_matscipy.jl to use shared utilities - Remove dead code: unused rand_config_aux() function from test_aux.jl 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GPU backend detection system that tries CUDA, then AMDGPU, then Metal - New functions: init_gpu_backend!(), gpu_available(), gpu_backend(), to_gpu_array(), gpu_array_type() - Use Base.invokelatest to handle world age issues with dynamically loaded modules - Update test_gpu.jl to be backend-agnostic using to_gpu_array() - Update test_matscipy.jl GPU tests to use multi-backend detection - Tests now show which GPU backend is being used in test set names Based on pattern from EquivariantTensors.jl 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add neighbour_list() method for AtomsBase systems in extension - Create test_unified_api.jl with tests for: - neighbour_list() materialized and lazy modes - neighbours() function for PairList and SortedCellList - num_neighbours() function - max_neighbours() function - GPU support with unified API - Expand test_atoms_base.jl to use unified API: - neighbour_list(), neighbours(), num_neighbours() - Lazy mode with AtomsBase systems - build_cell_list() direct usage - Unit handling tests - Update README to show unified API for AtomsBase integration - Clarify API selection between sort-based and legacy implementations Test count: 222 → 668 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add count_lazy_pairs(), count_manual_neighbours(), verify_neighbours_consistency() to test_utils.jl - Update test_unified_api.jl and test_atoms_base.jl to use new utilities - Fix matscipy_available soft scope warning with global keyword 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
We can discuss when we meet, but I think all feedback has now been addressed and this should be OK to merge |
Add bool_to_val() helper to convert Bool to Val for type-stable dispatch.
The public API accepts Union{Bool, Val} so users can call with lazy=true
(convenient) or lazy=Val(true) (type-stable for performance-critical code).
Internal _neighbour_list() and _neighbour_list_ab() functions use Val{L}
dispatch so Julia can infer the return type at compile time.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update documentation to reflect that the legacy linked-list implementation will be retained indefinitely as a reference implementation for validating correctness in tests. The depwarn is kept to encourage users to use the new API, but removal is no longer planned. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
I want to do a final cleanup of Project.toml and the AtomsBase extension. Can I merge and then make another PR before we register please? |
|
Yes of course - I believe I have also given you acess to this branch in my fork if you want to do it in one go |
|
Ok, I'll do it there. |
|
Furthermore:
|
|
@jameskermode - if you are happy with my final edits, since tests pass, we can merge and register. (ignore the |
|
Yes, very happy, thanks for checking the API more carefully than me. Go ahead, we might need to cleanup in patch releases but I think this is good for 0.6.0. |
New implementation with multi-theading and GPU support using KernalAbstractions.jl, AcceleratedKernels.jl and Atomix.jl
The sort-based approach was proposed by Teemu Järvinen and Timon Gutleb.
Benchmarks on NVIDIA RTX A4500 (cutoff = 5.0 Å, density = 0.05 atoms/ų):