- 
                Notifications
    You must be signed in to change notification settings 
- Fork 8
Implement neighborhood search based on CellListMap.jl #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Codecov ReportAttention: Patch coverage is  
 
 Additional details and impacted files@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   86.78%   87.54%   +0.75%     
==========================================
  Files          15       16       +1     
  Lines         560      618      +58     
==========================================
+ Hits          486      541      +55     
- Misses         74       77       +3     
 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
a2b1d55    to
    f7f48e8      
    Compare
  
    | @lmiq It would be great if you could have a look as well if you find the time. To check that I'm using CellListMap.jl correctly here and that I'm getting the best performance out of it. | 
| 
 Can you point to a MWE of a running benchmark with one of the methods you implemented? It is not obvious to me how to run the benchmarks. | 
| I should probably update the plotting script and add this in the docs somewhere. include("benchmarks/benchmarks.jl"); plot_benchmarks(benchmark_n_body, (10, 10, 10), 9, parallel=true)where 9 means 9 iterations. In the benchmark plotting code ( function plot_benchmarks(benchmark, n_points_per_dimension, iterations;
                         parallel = true, title = "",
                         seed = 1, perturbation_factor_position = 1.0)
    neighborhood_searches_names = [#"TrivialNeighborhoodSearch";;
                                   "DictionaryCellList";;
                                   "FullGridCellList";;
                                #    "PrecomputedNeighborhoodSearch";;
                                   "CellListMapNeighborhoodSearch"]
    # Multiply number of points in each iteration (roughly) by this factor
    scaling_factor = 4
    per_dimension_factor = scaling_factor^(1 / length(n_points_per_dimension))
    sizes = [round.(Int, n_points_per_dimension .* per_dimension_factor^(iter - 1))
             for iter in 1:iterations]
    n_particles_vec = prod.(sizes)
    times = zeros(iterations, length(neighborhood_searches_names))
    for iter in 1:iterations
        coordinates = point_cloud(sizes[iter], seed = seed,
                                  perturbation_factor_position = perturbation_factor_position)
        search_radius = 3.0
        NDIMS = size(coordinates, 1)
        n_particles = size(coordinates, 2)
        neighborhood_searches = [
            # TrivialNeighborhoodSearch{NDIMS}(; search_radius, eachpoint = 1:n_particles),
            GridNeighborhoodSearch{NDIMS}(; search_radius, n_points = n_particles),
            GridNeighborhoodSearch{NDIMS}(; cell_list = FullGridCellList(; search_radius,
                                                                         min_corner = minimum(coordinates, dims = 2) .- search_radius,
                                                                         max_corner = maximum(coordinates, dims = 2) .+ search_radius),
                                          search_radius, n_points = n_particles),
            CellListMapNeighborhoodSearch(NDIMS; search_radius)
        ]
        for i in eachindex(neighborhood_searches)
            neighborhood_search = neighborhood_searches[i]
            initialize!(neighborhood_search, coordinates, coordinates)
            time = benchmark(neighborhood_search, coordinates, parallel = parallel)
            times[iter, i] = time
            time_string = BenchmarkTools.prettytime(time * 1e9)
            println("$(neighborhood_searches_names[i])")
            println("with $(join(sizes[iter], "x")) = $(prod(sizes[iter])) particles finished in $time_string\n")
        end
    end
    plot(n_particles_vec, times ./ n_particles_vec .* 1e9,
         xaxis = :log, yaxis = :log,
         xticks = (n_particles_vec, n_particles_vec), linewidth = 2,
         xlabel = "#particles", ylabel = "runtime per particle [ns]",
         legend = :outerright, size = (750, 400), dpi = 200, margin = 4 * Plots.mm,
         label = neighborhood_searches_names, title = title)
end | 
| Ok, I managed to run a very simple case: julia> using CellListMap, PointNeighbors, StaticArrays
julia> function test_celllistmap(x)
           sys = ParticleSystem(positions=x, cutoff=5.0, output=zero(x))
           map_pairwise!(sys) do x, y, i, j, d2, f
               d = sqrt(d2)
               fij = (y - x) / d^3
               f[:,i] .+= fij
               f[:,j] .-= fij
               return f
           end
           return sys.output
       end
test_celllistmap (generic function with 1 method)
julia> function test_pointneighbors(x)
           f = zero(x)
           neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2))
           initialize!(neighborhood_search, x, x)
           foreach_point_neighbor(x, x, neighborhood_search, parallel=true) do i, j, pos_diff, distance
               distance < sqrt(eps()) && return
               dv = -pos_diff / distance^3
               for dim in axes(f, 1)
                   @inbounds f[dim, i] += dv[dim]
               end
           end
           return f
       end
test_pointneighbors (generic function with 1 method)
julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat)  # second run
  3.557435 seconds (468.93 M allocations: 21.058 GiB, 42.83% gc time, 9 lock conflicts)
3×100000 Matrix{Float64}:
  40.841    -140.986   12.1143   77.6912   11.3534   16.8916    10.9625  …  32.2969       43.5096   7.40524   17.4521   82.746     4.78501
  -4.60323    10.6294  28.8101  -11.418   -33.4972  -12.3769   -17.9138     -0.0016226  -222.654   -4.39804  -80.1207   37.4482   51.3636
 -57.4807     14.2068  15.9348   71.1715   68.6271   30.9155  -120.342      -2.29914     -55.8974   3.31074  117.948   -27.3466  -26.7153
julia> @time test_pointneighbors(xmat) # second run
  0.857072 seconds (967 allocations: 49.540 MiB, 0.49% gc time)
3×100000 Matrix{Float64}:
  40.841    -140.986   12.1143   77.6912   11.3534   16.8916    10.9625  …  32.2969       43.5096   7.40524   17.4521   82.746     4.78501
  -4.60323    10.6294  28.8101  -11.418   -33.4972  -12.3769   -17.9138     -0.0016226  -222.654   -4.39804  -80.1207   37.4482   51.3636
 -57.4807     14.2068  15.9348   71.1715   68.6271   30.9155  -120.342      -2.29914     -55.8974   3.31074  117.948   -27.3466  -26.7153
julia> x, _ = CellListMap.xgalactic(10^6); # 10^6 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat)
 40.969504 seconds (5.51 G allocations: 247.261 GiB, 45.60% gc time, 13 lock conflicts)
3×1000000 Matrix{Float64}:
    9.20059   -2.41152     0.114992  107.299   -23.107   16.395   -19.8264  …  135.155    -4.36814  -3.9865    5.18173  107.441    -24.7465
  -26.039     16.8446    -37.0278    -13.3514   74.7956  15.8967   34.0362     -57.17     19.5576   46.6715  -22.3012    26.9519  -156.584
 -168.725    -63.8354   -163.919     -18.8817   51.9869  19.4064   34.0843      58.2957  -15.1018   37.0779  -16.5614   -23.6634    -5.18014
julia> @time test_pointneighbors(xmat)
 23.620195 seconds (6.68 k allocations: 496.365 MiB, 0.02% gc time)
3×1000000 Matrix{Float64}:
    9.20059   -2.41152     0.114992  107.299   -23.107   16.395   -19.8264  …  135.155    -4.36814  -3.9865    5.18173  107.441    -24.7465
  -26.039     16.8446    -37.0278    -13.3514   74.7956  15.8967   34.0362     -57.17     19.5576   46.6715  -22.3012    26.9519  -156.584
 -168.725    -63.8354   -163.919     -18.8817   51.9869  19.4064   34.0843      58.2957  -15.1018   37.0779  -16.5614   -23.6634    -5.18014
I guess this is more or less consistent with what you see? If the two sets are different, I get: julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> y, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> ymat = Matrix(reinterpret(reshape, Float64, y));
julia> function test_celllistmap(x,y)
           sys = ParticleSystem(xpositions=x, ypositions=y, cutoff=5.0, output=zero(x))
           map_pairwise!(sys) do x, y, i, j, d2, f
               d = sqrt(d2)
               fij = (y - x) / d^3
               f[:,i] .+= fij
               return f
           end
           return sys.output
       end
test_celllistmap (generic function with 2 methods)
julia> @time test_celllistmap(xmat,ymat)
  2.527272 seconds (469.03 M allocations: 21.066 GiB, 39.28% gc time, 12 lock conflicts, 15.07% compilation time)
3×100000 Matrix{Float64}:
  21.8077   -0.183917  -25.5396  -38.6599    -7.02715   -22.8335  -115.32    …  -10.6988   21.6762    12.2227   -1.46057    13.9872    7.5366
 -17.4007  -35.3851     14.0842    6.13937   30.6116    -10.2836   309.663      116.57     -6.51407  138.022     8.54818    25.051   155.299
 -52.4528   70.1513     26.3846   31.3363   -72.5124   -166.564     31.0876      29.6242  -15.704     -5.80962  11.3542   -157.471     4.72672
julia> function test_pointneighbors(x,y)
           f = zero(x)
           neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2))
           initialize!(neighborhood_search, x, y)
           foreach_point_neighbor(x, y, neighborhood_search, parallel=true) do i, j, pos_diff, distance
               distance < sqrt(eps()) && return
               dv = -pos_diff / distance^3
               for dim in axes(f, 1)
                   @inbounds f[dim, i] += dv[dim]
               end
           end
           return f
       end
test_pointneighbors (generic function with 2 methods)
julia> @time test_pointneighbors(xmat,ymat)
  1.100204 seconds (942.08 k allocations: 92.813 MiB, 0.89% gc time, 21.68% compilation time)
3×100000 Matrix{Float64}:
  21.8077   -0.183917  -25.5396  -38.6599    -7.02715   -22.8335  -115.32    …  -10.6988   21.6762    12.2227   -1.46057    13.9872    7.5366
 -17.4007  -35.3851     14.0842    6.13937   30.6116    -10.2836   309.663      116.57     -6.51407  138.022     8.54818    25.051   155.299
 -52.4528   70.1513     26.3846   31.3363   -72.5124   -166.564     31.0876      29.6242  -15.704     -5.80962  11.3542   -157.471     4.72672
Let me know if (when) you are supporting periodic boundary conditions :-). edit: Seems that orthorhombic boxes are already supported? (it it unusual to provide minimum and maximum coordinates instead of lengths, though, does it make any difference if the lengths are the same, but the corners are placed differently? - I guess it doesn't ). | 
| I don't think your benchmarks are fair.  
 This originated from the way we initialize simulations. We might have a pipe flow with open boundaries, and then want to use periodic boundaries instead, and then we can just provide the bounding box for the domain. We don't have any plans to provide more complex periodic boundaries. But when this PR is merged, we can add support for complex periodic boundaries with the CellListMap.jl backend. | 
| 
 Yes, good catch (although why it allocates I don't know - it should not). Unfortunately everything I do lately I have to do in a rush... With that fixed I get: julia> function test_celllistmap(x)
           sys = ParticleSystem(positions=x, cutoff=5.0, output=zero(x))
           map_pairwise!(sys) do x, y, i, j, d2, f
               d = sqrt(d2)
               fij = (y - x) / d^3
               for k in eachindex(fij)
                   f[k,i] += fij[k]
                   f[k,j] -= fij[k]
               end
               return f
           end
           return sys.output
       end
test_celllistmap (generic function with 2 methods)
julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat)  # second run
  0.298172 seconds (13.65 k allocations: 97.775 MiB, 1.98% gc time, 12 lock conflicts)
3×100000 Matrix{Float64}:
 -88.6852   -129.101   -46.6896  -48.3546  147.853   -24.7797    18.0712  …   -6.32157    9.97575    -1.63363  10.3277   15.7659    -29.5342
   5.05557   284.657    42.8502  -10.2702  101.038    69.9032  -101.463       21.2009   -36.1706   -128.78     32.0599    6.22065     4.78016
 101.876      97.5458   80.0019    1.7858  147.629  -120.023    -20.9682     -19.8745   -38.8108     26.2592   -2.34381   0.131151  108.677
julia> @time test_pointneighbors(xmat) # second run
  0.937435 seconds (968 allocations: 49.540 MiB, 2.56% gc time)
3×100000 Matrix{Float64}:
 -88.6852   -129.101   -46.6896  -48.3546  147.853   -24.7797    18.0712  …   -6.32157    9.97575    -1.63363  10.3277   15.7659    -29.5342
   5.05557   284.657    42.8502  -10.2702  101.038    69.9032  -101.463       21.2009   -36.1706   -128.78     32.0599    6.22065     4.78016
 101.876      97.5458   80.0019    1.7858  147.629  -120.023    -20.9682     -19.8745   -38.8108     26.2592   -2.34381   0.131151  108.677and for the  julia> x, _ = CellListMap.xgalactic(10^6); # 10^6 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat)
  3.535903 seconds (103.01 k allocations: 1.097 GiB, 5.36% gc time, 11 lock conflicts)
3×1000000 Matrix{Float64}:
 15.8458    6.61017  129.444    16.249   -36.1833  -47.6617   -4.15331  …   83.2507  -135.724   -8.00536    8.17124    -9.491   -13.0002
 -2.91166  17.8658   -28.7162   17.7136  -39.6242   44.6349  -14.1746       12.7022   -14.5613  -3.03081   -8.79777  -136.763     4.94949
 19.9325    2.57851  -18.5558  -17.7504  -31.7794  173.664    -9.83589     -13.8241    37.9389  -7.4949   175.836     -37.2366   35.1991
julia> @time test_pointneighbors(xmat)
 24.696261 seconds (6.68 k allocations: 496.365 MiB, 0.01% gc time)
3×1000000 Matrix{Float64}:
 15.8458    6.61017  129.444    16.249   -36.1833  -47.6617   -4.15331  …   83.2507  -135.724   -8.00536    8.17124    -9.491   -13.0002
 -2.91166  17.8658   -28.7162   17.7136  -39.6242   44.6349  -14.1746       12.7022   -14.5613  -3.03081   -8.79777  -136.763     4.94949
 19.9325    2.57851  -18.5558  -17.7504  -31.7794  173.664    -9.83589     -13.8241    37.9389  -7.4949   175.836     -37.2366   35.1991(does this make sense?) I tried to run with  julia> function test_pointneighbors(x)
           f = zero(x)
           neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2), 
                cell_list=FullGridCellList(search_radius=5.0, min_corner=[0.0, 0.0, 0.0], max_corner=[20.274, 20.274, 20.274]))
           initialize!(neighborhood_search, x, x)
           foreach_point_neighbor(x, y, neighborhood_search, parallel=true) do i, j, pos_diff, distance
               distance < sqrt(eps()) && return
               dv = -pos_diff / distance^3
               for dim in axes(f, 1)
                   @inbounds f[dim, i] += dv[dim]
               end
           end
           return f
       end
test_pointneighbors (generic function with 2 methods)
julia> @time test_pointneighbors(xmat)
ERROR: BoundsError: attempt to access 100×216 Matrix{Int32} at index [101, 115]
Stacktrace:
  [1] throw_boundserror(A::Matrix{Int32}, I::Tuple{Int64, Int64})
    @ Base ./essentials.jl:14
  [2] checkbounds
    @ ./abstractarray.jl:699 [inlined]
  [3] setindex!
    @ ./array.jl:993 [inlined]
  [4] pushat!
    @ ~/.julia/packages/PointNeighbors/vKy7i/src/vector_of_vectors.jl:64 [inlined]
  [5] push_cell!
    @ ~/.julia/packages/PointNeighbors/vKy7i/src/cell_lists/full_grid.jl:124 [inlined]
  [6] initialize_grid!(neighborhood_search::GridNeighborhoodSearch{…}, y::Matrix{…})
    @ PointNeighbors ~/.julia/packages/PointNeighbors/vKy7i/src/nhs_grid.jl:176
  [7] initialize!(neighborhood_search::GridNeighborhoodSearch{…}, x::Matrix{…}, y::Matrix{…})
    @ PointNeighbors ~/.julia/packages/PointNeighbors/vKy7i/src/nhs_grid.jl:162
  [8] test_pointneighbors(x::Matrix{Float64})
    @ Main ./REPL[85]:5
  [9] macro expansion
    @ ./timing.jl:581 [inlined]
 [10] top-level scope
    @ ./REPL[86]:1
Some type information was truncated. Use `show(err)` to see complete types.
 | 
| ps: The allocations come from the combination of the slice and broadcasting, which gets interpreted as: f[:,i] .= f[:,i] .+ fijand then the  | 
| CellListMap.map_pairwise!(0, box, cell_list, | ||
| parallel = parallel !== false) do x, y, i, j, d2, output | ||
| # Skip all indices not in `points` | ||
| i in points || return output | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| i in points || return output | |
| i in points || return output | 
Doesn't this produce a notable overhead which affects the benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried. It's negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of interest. Am I missing something?
This check is done for each particle - so multiple times. For a large set, the overhead accumulates, no?
julia> f(i, points) = i in points
f (generic function with 1 method)
julia> points = rand(Int, 1_000_000);
julia> @benchmark f($5, $points)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  221.584 μs … 380.625 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     221.834 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   223.171 μs ±   4.460 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  █▅▂▄▁   ▅  ▁  ▁                                               ▁
  █████▇▆██▇██▆██▇▆▆▆▇▅▆▆▅▅▆▅▇▅▄▄▃▅▄▄▅▃▃▄▃▄▂▂▃▃▃▃▂▃▂▂▄▅▇▆▄▃▅▂▃▇ █
  222 μs        Histogram: log(frequency) by time        243 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that this is about 1-2% of the actual computation when you have a real example where you are doing actual work per particle.
| 
 @lmiq This is a bit tricky. You have to pad the bounding box with the search radius and increase the maximum number of particles per cell. The former we want to change (#80), the latter should have a more meaningful error message (#65). Here is a benchmark for only the interaction, not the initialization: using CellListMap, PointNeighbors, StaticArrays, BenchmarkTools
x, _ = CellListMap.xgalactic(10^6);
xmat = Matrix(reinterpret(reshape, Float64, x));
f = similar(xmat)
f2 = similar(xmat)
f3 = similar(xmat)
f4 = similar(xmat)
nhs1 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2))
initialize!(nhs1, xmat, xmat)
min_corner = minimum(xmat, dims=2) .- 5
max_corner = maximum(xmat, dims=2) .+ 5
nhs2 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2),
                                 cell_list=FullGridCellList(search_radius=5.0;
                                                            min_corner, max_corner,
                                                            max_points_per_cell=2000))
initialize!(nhs2, xmat, xmat)
sys = ParticleSystem(positions=x, cutoff=5.0, output=f)
sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=5.0, output=f2)
function test_celllistmap(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
            @inbounds f[dim, j] -= fij[dim]
        end
        return f
    end
    return sys.output
end
function test_celllistmap_different(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d2 < eps() && return f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
        end
        return f
    end
    return sys.output
end
function test_pointneighbors(x, f, nhs)
    f .= 0
    foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
        distance < sqrt(eps()) && return
        dv = -pos_diff / distance^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += dv[dim]
        end
    end
    return f
end
I get this: The last three seem reasonable based on my original benchmarks in the first post. The first one looks suspicious. The  
 Why is the first one so much faster? In my benchmarks in the first post, I found that the speedup comes only from using symmetry to skip half the pairs. But this should be a speedup of at most 2x. | 
| 
 Indeed, there's something not scaling well there in  Thus, the parallelization mechanisms and splitting of tasks are different, and the reason for that is empirical testing of different schemes at some point in the development. I will probably have to revisit the behavior of the two-set case at some point to check these issues. For the records, this is what I got, increasing the number of particles: julia> run(;n=5*10^4)
CellListMap (t1): 0.163701008 s
CellListMap different (t2/t1): 1.8283861147635694
PointNeighbors nhs1 (t3/t1): 2.213968633595708
PointNeighbors nhs2 (t4/t1): 2.1056406384498256
Test Passed
julia> run(;n=10^5)
CellListMap (t1): 0.305206052 s
CellListMap different (t2/t1): 2.0926074362378633
PointNeighbors nhs1 (t3/t1): 2.8729478044557255
PointNeighbors nhs2 (t4/t1): 2.8898966918257574
Test Passed
julia> run(;n=5*10^5)
CellListMap (t1): 1.581912255 s
CellListMap different (t2/t1): 2.4655888565703035
PointNeighbors nhs1 (t3/t1): 4.414566006380676
PointNeighbors nhs2 (t4/t1): 4.29086459855512
Test Passed
julia> run(;n=10^6)
CellListMap (t1): 3.4457909320000004 s
CellListMap different (t2/t1): 3.6117841437862372
PointNeighbors nhs1 (t3/t1): 6.811157335183328
PointNeighbors nhs2 (t4/t1): 6.209188126681157
Test PassedWhere we can see that for a smallish number of particles the ratio between "one-set" and "two-sets" is roughly 2x, which is what we expect, but it is getting worse as the number of particles increases. This is the code I ran: using CellListMap, PointNeighbors, StaticArrays, BenchmarkTools, Chairmarks
using Test
function test_celllistmap(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
            @inbounds f[dim, j] -= fij[dim]
        end
        return f
    end
    return sys.output
end
function test_celllistmap_different(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d2 < eps() && return f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
        end
        return f
    end
    return sys.output
end
function test_pointneighbors(x, f, nhs)
    f .= 0
    foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
        distance < sqrt(eps()) && return
        dv = -pos_diff / distance^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += dv[dim]
        end
    end
    return f
end
function run(;n=10^5)
    x, _ = CellListMap.xgalactic(n);
    xmat = Matrix(reinterpret(reshape, Float64, x));
    f = similar(xmat)
    f2 = similar(xmat)
    f3 = similar(xmat)
    f4 = similar(xmat)
    
    nhs1 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2))
    initialize!(nhs1, xmat, xmat)
    
    min_corner = minimum(xmat, dims=2) .- 5
    max_corner = maximum(xmat, dims=2) .+ 5
    nhs2 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2),
                                    cell_list=FullGridCellList(search_radius=5.0;
                                                                min_corner, max_corner,
                                                                max_points_per_cell=2000))
    initialize!(nhs2, xmat, xmat)
    sys = ParticleSystem(positions=x, cutoff=5.0, output=f)
    sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=5.0, output=f2)
    b1 = @b test_celllistmap($(nothing), $sys)
    println("CellListMap (t1): ", b1.time, " s")
    b2 = @b test_celllistmap_different($(nothing), $sys2)
    println("CellListMap different (t2/t1): ", b2.time / b1.time) 
    b3 = @b test_pointneighbors($xmat, $f3, $nhs1)
    println("PointNeighbors nhs1 (t3/t1): ", b3.time / b1.time)
    b4 = @b test_pointneighbors($xmat, $f4, $nhs2)
    println("PointNeighbors nhs2 (t4/t1): ", b4.time / b1.time)
    @test f ≈ f2 ≈ f3 ≈ f4
end    | 
| Many thanks for the discussion! 
 Interesting that I didn't see this in my original benchmark, even though I'm using a lot bigger problems there. Probably because our use cases and therefore our benchmarks are very different. You use a much larger search radius with almost 2000 particles per cell. In our simulations and benchmarks, we usually have less than 30 particles per cell and ~100 neighbor particles within the search radius. 
 Please report back when you find something interesting. | 
| 
 Indeed, the density relative to the cutoff makes a lot of difference. I have tried to optimize  The atomic systems are less dense (relative to the cutoff). They are generated with the  julia> run(;n=10^5)
CellListMap (t1): 0.053651925 s
CellListMap different (t2/t1): 2.118421175009098
Test Passed
julia> run(;n=5*10^5)
CellListMap (t1): 0.272764123 s
CellListMap different (t2/t1): 2.6336540088155216
Test Passed
julia> run(;n=10^6)
CellListMap (t1): 0.5655830310000001 s
CellListMap different (t2/t1): 3.85268252505263
Test Passed(I couldn't run  
 The difference in performances, though, have nothing to do with the scaling of the parallelization. It is the algorithm that is different. In fact, I had already noted that possibility and the code is commented. The thing is that when using a full-cell list approach, I'm using a projection-approach to eliminate the calculation of unnecessary distances. It consists of projecting the positions of the particles into the vector connecting the centers of the cells being considered, sorting the positions along that line, and only computing interactions if the distance along that line is smaller than the cutoff. This method was proposed in 1 and 2. For the two-sets case, this is clearly very advantageous if both sets are large, but if one of the sets is small the cost of computing the cell lists for the largest set becomes less favorable (at least it was the case when I benchmarked these stuff). Probably I will need to test again if computing the lists for all particles is not worth the effort now, or at least switch automatically the method using some criteria. Code ran for the benchmark above: using CellListMap, PointNeighbors, StaticArrays
using Chairmarks
using Test
function test_celllistmap(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
            @inbounds f[dim, j] -= fij[dim]
        end
        return f
    end
    return sys.output
end
function test_celllistmap_different(_, sys)
    sys.output .= 0
    map_pairwise!(sys) do x, y, i, j, d2, f
        d2 < eps() && return f
        d = sqrt(d2)
        fij = (y - x) / d^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += fij[dim]
        end
        return f
    end
    return sys.output
end
function test_pointneighbors(x, f, nhs)
    f .= 0
    foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
        distance < sqrt(eps()) && return
        dv = -pos_diff / distance^3
        for dim in axes(f, 1)
            @inbounds f[dim, i] += dv[dim]
        end
    end
    return f
end
function run(;n=10^5)
    x, box = CellListMap.xatomic(n);
    xmat = Matrix(reinterpret(reshape, Float64, x));
    f = similar(xmat)
    f2 = similar(xmat)
    f3 = similar(xmat)
    f4 = similar(xmat)
    
    nhs1 = GridNeighborhoodSearch{3}(; search_radius=box.cutoff, n_points=size(xmat,2))
    initialize!(nhs1, xmat, xmat)
    
    min_corner = minimum(xmat, dims=2) .- 5
    max_corner = maximum(xmat, dims=2) .+ 5
    nhs2 = GridNeighborhoodSearch{3}(; search_radius=box.cutoff, n_points=size(xmat,2),
                                    cell_list=FullGridCellList(search_radius=box.cutoff;
                                                                min_corner, max_corner,
                                                                max_points_per_cell=2000))
    initialize!(nhs2, xmat, xmat)
    sys = ParticleSystem(positions=x, cutoff=box.cutoff, output=f)
    sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=box.cutoff, output=f2)
    b1 = @b test_celllistmap($(nothing), $sys)
    println("CellListMap (t1): ", b1.time, " s")
    b2 = @b test_celllistmap_different($(nothing), $sys2)
    println("CellListMap different (t2/t1): ", b2.time / b1.time) 
#    b3 = @b test_pointneighbors($xmat, $f3, $nhs1)
#    println("PointNeighbors nhs1 (t3/t1): ", b3.time / b1.time)
#    b4 = @b test_pointneighbors($xmat, $f4, $nhs2)
#    println("PointNeighbors nhs2 (t4/t1): ", b4.time / b1.time)
#    @test f ≈ f2 ≈ f3 ≈ f4
    @test f ≈ f2
end    | 
| In this branch I have implemented the strategy of computing cell lists for both sets of particles, and then using the same method to avoid computation of distances as was used in the single-set problems: https://github.com/m3g/CellListMap.jl/tree/new_CellListPair Now I get the following results from the benchmark in #8 (comment) julia> run(;n=5*10^4)
CellListMap (t1): 0.152432735 s
CellListMap different (t2/t1): 1.6050779643886859
PointNeighbors nhs1 (t3/t1): 2.2481417918533046
PointNeighbors nhs2 (t4/t1): 2.258499921293153
Test Passed
julia> run(;n=10^5)
CellListMap (t1): 0.299085832 s
CellListMap different (t2/t1): 1.710319695116819
PointNeighbors nhs1 (t3/t1): 2.901262748547715
PointNeighbors nhs2 (t4/t1): 2.868669272170673
Test Passed
julia> run(;n=5*10^5)
CellListMap (t1): 1.582251368 s
CellListMap different (t2/t1): 1.6601796706425727
PointNeighbors nhs1 (t3/t1): 4.411138771725176
PointNeighbors nhs2 (t4/t1): 4.24521378577819
Test Passed
julia> run(;n=10^6)
CellListMap (t1): 3.4587672630000004 s
CellListMap different (t2/t1): 1.725844078859029
PointNeighbors nhs1 (t3/t1): 6.799690402586073
PointNeighbors nhs2 (t4/t1): 6.206536996473243
Test PassedFor these tests this is a clear improvement for  Nevertheless, for smaller but relevant (maybe even more common for us) systems it becomes somewhat slower, and by constructing cell lists for both sets we use more memory, which can be a problem for very large systems. Thus, I'm still wondering if I'll merge that change as it is. But, it is important to remark that this method to avoid unnecessary computations is relevant for relatively dense cells (many particles per cell), or if the computations to be performed for each pair of particle are expensive, such that avoiding any computation is important. For lower densities per cells and fast operations per pair it might be detrimental for performance. | 
| Why do I get this on your branch?  | 
| Can you try now? I'm in the middle of updating other things and you might have reached an unusable state. | 
| Very interesting. Even for my benchmark in the first post with much less dense cells, this is significantly faster than the original version: Pretty much a factor of 2 now compared to the symmetric version. | 
| Indeed, the strategy to avoid unncessary computations is suprisingly effective. It is something I guess you could implement as well, it is a local computation (you just need a data structure that allows all the projection and sorting of particles local). | 
| Thanks! I will look into this. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update readme.md
I finally properly implemented the prototype from trixi-framework/TrixiParticles.jl#174. This doesn't have an advantage over existing implementations yet, but we can use the advanced periodicity features of CellListMap.jl in the future. For now, it's mostly useful for benchmarking against CellListMap.jl.
Here is a benchmark using the n-body benchmark on a single thread of an AMD Threadripper 3990X.

It's the usual plot, showing the runtime of a single interaction divided by the number of particles.
Lower is better, and a horizontal line means constant runtime per particle, so a perfect scaling with the problem size.
DictionaryCellListimplementation and CellListMap.jl start out with equal performance, but CellListMap.jl scales better.FullGridCellListis now faster than CellListMap.jl.points_equal_neighbors = true. For context, if this is set tofalse, aCellListPairis used, which works almost the same as theFullGridCellList(although storing coordinates together with indices, see [Proof of Concept] Store coordinates together with the points in a cell list #52). When the two coordinate matrices are identical (so we are looking for fluid neighbors of fluid particles), we can use a regularCellList, which makes use of symmetry to skip half the particle-neighbor pairs. This implementation is faster here.FullGridCellList, but modified to also use symmetry like explained above. I implemented this in [Proof of Concept] Use symmetry to speed up the main loop #85. This reaches the same performance as the CellListMap.jl implementation.The same plot using 128 threads on the Threadripper:


The two CellListMap.jl implementations start quite slow because they are not using Polyester.jl. The regular CellListMap.jl implementation is also slower again for large problems for some reason.
For a fair comparison, I added "CellListMap Polyester.jl", which is the same as the "CellListMap" implementation but using Polyester.jl on the particle loop, just like we do in PointNeighbors.jl. This is implemented in m3g/CellListMap.jl#104.
This plot looks very similar to the serial plot.
Here are the same plots for a single WCSPH interaction loop:


Again, looking very similar to the N-Body benchmark, but "CellListMap equal" is a bit slower.
Here is a benchmark for the updates:

It is clear that one of our main features is: