JuliaMolSim
diff --git a/‎.github/workflows/CI.yml‎
Lines changed: 18 additions & 8 deletions b/‎.github/workflows/CI.yml‎
Lines changed: 18 additions & 8 deletions
diff --git a/‎CondaPkg.toml‎
Lines changed: 3 additions & 0 deletions b/‎CondaPkg.toml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎DEPRECATIONS.md‎
Lines changed: 119 additions & 0 deletions b/‎DEPRECATIONS.md‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎Project.toml‎
Lines changed: 31 additions & 5 deletions b/‎Project.toml‎
Lines changed: 31 additions & 5 deletions
diff --git a/‎README.md‎
Lines changed: 126 additions & 26 deletions b/‎README.md‎
Lines changed: 126 additions & 26 deletions
@@ -5,6 +5,7 @@ on:
       - master
     tags: ['*']
   pull_request:
+  workflow_dispatch:
 concurrency:
   # Skip intermediate builds: always.
   # Cancel intermediate builds: only if it is a pull request build.
@@ -18,20 +19,29 @@ jobs:
       fail-fast: false
       matrix:
         version:
-          - '1.9'
-          - '1'
-          - 'nightly'
+          - '1.11'
+          - '1.12'
         os:
           - ubuntu-latest
         arch:
           - x64
     steps:
-      - uses: actions/checkout@v2
-      - uses: julia-actions/setup-julia@v1
+      - uses: actions/checkout@v4
+      - uses: julia-actions/setup-julia@v2
         with:
           version: ${{ matrix.version }}
           arch: ${{ matrix.arch }}
-      - uses: julia-actions/cache@v1
-      # - run: julia -e 'using Pkg; pkg"registry add https://github.com/ACEsuit/ACEregistry.git"'
+      - uses: julia-actions/cache@v2
       - uses: julia-actions/julia-buildpkg@v1
-      - uses: julia-actions/julia-runtest@v1
+      - name: Run tests
+        shell: bash
+        run: |
+          julia --project=test --color=yes -e '
+            using Pkg
+            Pkg.instantiate()
+            # Resolve CondaPkg Python environment (installs matscipy)
+            using CondaPkg
+            CondaPkg.resolve()
+            # Run tests
+            include("test/runtests.jl")
+          '
@@ -0,0 +1,3 @@
+[deps]
+matscipy = ""
+numpy = ""
@@ -0,0 +1,119 @@
+# API Migration Guide
+
+This document outlines the transition from the legacy linked-list algorithm to the unified sort-based implementation in NeighbourLists.jl.
+
+## Version 0.6.x (Current)
+
+The sort-based implementation is now the recommended API. The legacy linked-list implementation remains available as a reference implementation used internally for testing correctness.
+
+### Legacy vs New API
+
+| Legacy API | New API | Notes |
+|------------|---------|-------|
+| `PairList(X::Vector{SVec}, cutoff, cell, pbc)` | `neighbour_list(X, cutoff, cell, pbc)` | Sort-based, parallelizable |
+| `CellList` struct | `SortedCellList` | Used internally by new API |
+| `_celllist_` | `build_cell_list` | Internal function |
+| `_pairlist_` | `materialize_pairlist` | Internal function |
+
+> **Note:** The legacy implementation will be retained indefinitely as a reference implementation for validating correctness in tests. However, new code should use the unified `neighbour_list()` API.
+
+### New Unified API
+
+```julia
+# High-level entry point (recommended)
+nlist = neighbour_list(X, cutoff, cell, pbc)
+
+# Lazy iteration (memory efficient)
+clist = neighbour_list(X, cutoff, cell, pbc; lazy=true)
+for_each_neighbour(clist, i) do j, R, S
+    # process neighbour
+end
+
+# Unified accessors (work with both PairList and SortedCellList)
+js, Rs, Ss = neighbours(nlist, i)
+n = num_neighbours(nlist, i)
+```
+
+### AtomsBase Support
+
+AtomsBase integration has been moved to a package extension. To use it:
+
+```julia
+using NeighbourLists
+using AtomsBase, Unitful
+
+# Extension loads automatically
+nlist = PairList(system, 5.0u"Å")
+```
+
+## Why Use the New API?
+
+### Benefits of Sort-Based Algorithm
+
+1. **GPU support**: Works on CUDA, ROCm, Metal, and oneAPI via KernelAbstractions.jl
+2. **Multi-threaded CPU**: Parallel construction and pair enumeration
+3. **Memory efficiency**: Option for lazy iteration without materializing all pairs
+4. **Consistent API**: Same code works on CPU and GPU
+
+## Migration Guide
+
+### Before (v0.5.x and earlier)
+
+```julia
+using NeighbourLists
+
+# Legacy linked-list constructor
+nlist = PairList(X, cutoff, cell, pbc)
+
+# Access neighbours
+j, R = neigs(nlist, i)
+```
+
+### After (v0.6.x+)
+
+```julia
+using NeighbourLists
+
+# New unified API (recommended)
+nlist = neighbour_list(X, cutoff, cell, pbc)
+
+# Or explicitly with backend
+nlist = neighbour_list(X, cutoff, cell, pbc; 
+                       backend=NeighbourLists.CPU())
+
+# GPU support
+using CUDA
+X_gpu = CuArray(X)
+nlist_gpu = neighbour_list(X_gpu, cutoff, cell, pbc)
+
+# Access neighbours (unchanged)
+j, R = neigs(nlist, i)
+# Or using unified accessor
+j, R, S = neighbours(nlist, i)
+
+# Lazy iteration (new, memory efficient)
+clist = neighbour_list(X, cutoff, cell, pbc; lazy=true)
+for_each_neighbour(clist, i) do j, R, S
+    # process each neighbour
+end
+```
+
+### AtomsBase Users
+
+```julia
+# Before: AtomsBase was always loaded
+using NeighbourLists
+
+# After: Load AtomsBase explicitly to enable extension
+using NeighbourLists
+using AtomsBase, Unitful
+
+# Then use as before
+nlist = PairList(system, 5.0u"Å")
+clist = build_cell_list(system, 5.0u"Å")
+```
+
+## Questions?
+
+If you have questions about migrating to the new API, please open an issue at:
+https://github.com/JuliaMolSim/NeighbourLists.jl/issues
@@ -1,26 +1,52 @@
 name = "NeighbourLists"
 uuid = "2fcf5ba9-9ed4-57cf-b73f-ff513e316b9c"
-version = "0.5.10"
+version = "0.6.0"
 
 [deps]
-AtomsBase = "a963bdd2-2df7-4f54-a1ee-49d51e6be12a"
+AcceleratedKernels = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
+Atomix = "a9b6321e-bd34-4604-b9c9-b65b8de01458"
+BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
+KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
+
+[weakdeps]
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
+AtomsBase = "a963bdd2-2df7-4f54-a1ee-49d51e6be12a"
 Unitful = "1986cc42-f94f-5a68-af5c-568840ba703d"
 
+[extensions]
+NeighbourListsAtomsBaseExt = ["AtomsBase", "Unitful"]
+NeighbourListsCUDAExt = "CUDA"
+
 [compat]
-julia = "1"
-StaticArrays = "1"
+AcceleratedKernels = "0.4"
+Atomix = "0.1, 1"
 AtomsBase = "0.5"
+AtomsBuilder = "0.2.2"
+BenchmarkTools = "1.6.3"
+CUDA = "5"
+CondaPkg = "0.2"
+KernelAbstractions = "0.9"
 LinearAlgebra = "1"
+PythonCall = "0.9"
+StaticArrays = "1"
 Unitful = "1"
+julia = "1.11"
 
 [extras]
+AtomsBase = "a963bdd2-2df7-4f54-a1ee-49d51e6be12a"
+AtomsBuilder = "f5cc8831-eeb7-4288-8d9f-d6c1ddb77004"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
+CondaPkg = "992eb4ea-22a4-4c89-a5bb-47a3300528ab"
 Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
 ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210"
 NearestNeighbors = "b8a86587-4115-5ab1-83bc-aa920d37bbce"
+PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
 Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
+PythonCall = "6099a3de-0909-46bc-b1f4-468b9a2dfc0d"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
+Unitful = "1986cc42-f94f-5a68-af5c-568840ba703d"
 
 [targets]
-test = ["Test", "Distances", "ForwardDiff", "NearestNeighbors", "Printf"]
+test = ["Test", "Distances", "ForwardDiff", "NearestNeighbors", "Printf", "AtomsBase", "AtomsBuilder", "PythonCall", "CondaPkg", "CUDA", "Unitful"]
@@ -1,40 +1,140 @@
 # NeighbourLists.jl
 
-A Julia port and restructuring of the neighbourlist implemented in
-[matscipy](https://github.com/libAtoms/matscipy) (with the authors' permission).
-Single-threaded, the Julia version is faster than matscipy for small systems,
-probably due  to the overhead of dealing with Python, but on large systems it is
-tends to be slower (up to around a factor 2 for 100,000 atoms). 
+A Julia package for computing neighbour lists in molecular simulations. Originally a port of the neighbourlist from [matscipy](https://github.com/libAtoms/matscipy), now extended with multi-threaded CPU and portable GPU support.
 
-The package is can be used stand-alone, with
-[JuLIP.jl](https://github.com/libAtoms/JuLIP.jl), or with [AtomsBase.jl](https://github.com/JuliaMolSim/AtomsBase.jl). 
+The package can be used stand-alone or with [AtomsBase.jl](https://github.com/JuliaMolSim/AtomsBase.jl).
 
-## Getting Started
+## Installation
 
-```Julia
+```julia
+using Pkg
 Pkg.add("NeighbourLists")
-using NeighbourLists
-?PairList
 ```
 
-### Usage via `AtomsBase.jl` 
+## Unified API (Recommended)
+
+The `neighbour_list()` function provides a unified entry point that works on both CPU and GPU with the same API. The backend is automatically detected from the array type.
+
+> **Note:** The legacy `PairList` constructor using linked-list algorithm is retained as a reference implementation for testing. New code should use `neighbour_list()` instead. See [DEPRECATIONS.md](DEPRECATIONS.md) for migration details.
+
+### CPU Example (Multi-threaded)
+
+```julia
+using NeighbourLists, StaticArrays, LinearAlgebra
+
+# Create positions (CPU Vector)
+L = 10.0
+X = [SVector{3,Float64}(L*rand(), L*rand(), L*rand()) for _ in 1:10000]
+cell = SMatrix{3,3,Float64}(L*I)
+pbc = SVector{3,Bool}(true, true, true)
+
+# Build neighbour list (uses sort-based algorithm with multi-threading)
+nlist = neighbour_list(X, 3.0, cell, pbc)
+
+# Access neighbours of atom 1
+j, R, S = neighbours(nlist, 1)
+```
+
+### GPU Example (CUDA, ROCm, Metal, oneAPI)
 
 ```julia
-using ASEconvert, NeighbourLists, Unitful
-cu = ase.build.bulk("Cu") * pytuple((4, 2, 3))
-sys = pyconvert(AbstractSystem, cu)
-nlist = PairList(sys, 3.5u"Å")
-neigs_1, Rs_1 = neigs(nlist, 1)
+using NeighbourLists, StaticArrays, LinearAlgebra
+using CUDA  # or AMDGPU, Metal, oneAPI
+
+# Create positions on GPU (only difference: use CuArray)
+L = 10.0
+X = CuArray([SVector{3,Float64}(L*rand(), L*rand(), L*rand()) for _ in 1:10000])
+cell = SMatrix{3,3,Float64}(L*I)
+pbc = SVector{3,Bool}(true, true, true)
+
+# Same API - backend auto-detected from array type
+nlist = neighbour_list(X, 3.0, cell, pbc)
+```
+
+**What's the same:** The `neighbour_list()` API is identical on CPU and GPU. Cell matrix, cutoff, and boundary conditions work the same way.
+
+**What's different:** Only the array type changes (`Vector` vs `CuArray`/`ROCArray`/etc.). The backend is automatically detected - no need to specify it manually.
+
+### Lazy Mode (Memory Efficient)
+
+For large systems where materializing all pairs is memory-intensive, use lazy mode:
+
+```julia
+# Returns a SortedCellList instead of materializing all pairs
+clist = neighbour_list(X, 3.0, cell, pbc; lazy=true)
+
+# Iterate without storing all pairs in memory
+for i in 1:nsites(clist)
+    for_each_neighbour(clist, i) do j, R, S
+        # process neighbour j with distance vector R and shift S
+    end
+end
 ```
 
-### Usage via `JuLIP.jl` 
+### AtomsBase.jl Integration
 
 ```julia
-using JuLIP 
-at = bulk(:Cu) * (4, 2, 3)
-nlist = neighbourlist(at, 3.5)
-neigs_1, Rs_1 = neigs(nlist, 1)
-``` 
-
-Please also look at the tests on how to use this package. Or just open an issue and
-ask.
+using AtomsBuilder, NeighbourLists, Unitful
+
+sys = bulk(:Cu, cubic=true) * (4, 4, 4)
+nlist = neighbour_list(sys, 5.0u"Å")
+j, R, S = neighbours(nlist, 1)  # neighbours of atom 1
+
+# Lazy mode also works with AtomsBase systems
+clist = neighbour_list(sys, 5.0u"Å"; lazy=true)
+for_each_neighbour(clist, 1) do j, R, S
+    # process neighbour
+end
+```
+
+The implementation uses [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) for portable parallelism and [AcceleratedKernels.jl](https://github.com/JuliaGPU/AcceleratedKernels.jl) for portable sorting. On CPU this enables multi-threading; on GPU it runs native parallel kernels.
+
+## Two Implementations
+
+The package provides two cell list implementations:
+
+| Implementation | Algorithm | Parallelism | Status |
+|---------------|-----------|-------------|--------|
+| **Sort-based** | Sort by cell ID | Multi-threaded CPU, GPU | Recommended |
+| **Legacy** | Linked-list | Single-threaded | Reference implementation for testing |
+
+Both produce identical results (validated in tests).
+
+**API Selection:**
+- `neighbour_list()` always uses the sort-based implementation (recommended)
+- `PairList(system::AbstractSystem, cutoff)` uses sort-based (for AtomsBase)
+- `PairList(X::Vector{SVec}, cutoff, cell, pbc)` uses legacy linked-list (reference implementation)
+
+## Migration Guide
+
+The legacy linked-list implementation (`CellList`, `_celllist_`, `_pairlist_`) is retained as a reference for testing correctness, but new code should use the unified API.
+
+**Recommended changes:**
+- Use `neighbour_list(X, cutoff, cell, pbc)` instead of `PairList(X, cutoff, cell, pbc)`
+- Use `neighbours(nlist, i)` instead of `neigss(nlist, i)` (both still work)
+- For memory-efficient iteration, use `neighbour_list(...; lazy=true)` with `for_each_neighbour`
+
+See [DEPRECATIONS.md](DEPRECATIONS.md) for the complete migration guide.
+
+## Benchmarks
+
+Benchmarks on NVIDIA RTX A4500 (cutoff = 5.0 Å, density = 0.05 atoms/Å³):
+
+| Atoms | Pairs | Legacy | CPU (1T) | CPU (8T) | GPU | Speedup |
+|------:|------:|-------:|---------:|---------:|--------:|--------:|
+| 1,000 | 26k | 8 ms | 3.6 ms | 3.4 ms | 2.3 ms | 3.5x |
+| 5,000 | 131k | 38 ms | 17 ms | 3.9 ms | 2.2 ms | 17x |
+| 10,000 | 262k | 84 ms | 35 ms | 7.8 ms | 2.4 ms | 36x |
+| 50,000 | 1.3M | 516 ms | 201 ms | 31 ms | 4.2 ms | 124x |
+| 100,000 | 2.6M | 1.1 s | 400 ms | 62 ms | 6.9 ms | 160x |
+
+GPU throughput: ~370 million pairs/second for large systems.
+
+*Note: Speedup is GPU vs Legacy. Run `julia --project -t N scripts/benchmark.jl` to reproduce.*
+
+
+## Acknowledgements
+
+- Original inspiration from [matscipy](https://github.com/libAtoms/matscipy) neighbourlist written by Lars Pastewka
+- Linked-list approach was implemented by Christoph Ortner
+- Sort-based approach idea proposed by Teemu Järvinen and Timon Gutleb, and implemented by James Kermode
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[deps]`
	`2`	`+matscipy = ""`
	`3`	`+numpy = ""`