Skip to content

Conversation

@jishnub
Copy link
Member

@jishnub jishnub commented Nov 22, 2024

In copyto!(A, B) where both the arrays use linear indexing, if one or both the arrays are contiguous linear views, we may forward the copyto! to the parents. This ensures that we hit optimized copyto! implementations for the parent, if any. Similarly, for the 5-argument version. In fact, in this PR, we call the 5-argument version within the 2-argument copyto! if both the arrays use linear indexing.

An example of a performance improvement:

julia> v1 = view(rand(4000, 4000), :, :);

julia> B = similar(v1);

julia> @btime copyto!($B, $v1);
  14.257 ms (0 allocations: 0 bytes) # nightly v"1.12.0-DEV.1677"
  8.596 ms (0 allocations: 0 bytes) # this PR

In this case, we dispatch to the optimized copyto!(::Array, ::Integer, ::Array, ::Integer, ::Integer) method by unwrapping the view. This method uses memmove, which is faster than the fallback implementation that uses loops.

@jishnub jishnub added performance Must go faster arrays [a, r, r, a, y, s] labels Nov 22, 2024
@jishnub jishnub force-pushed the jishnub/copyto_subarray_linear branch from 9bbe697 to 47bac98 Compare November 26, 2024 09:42
@jishnub
Copy link
Member Author

jishnub commented Nov 26, 2024

@LilithHafner I wonder if you might have any insight into why the additional recompilation happens, leading to the test failure? I suspect the recursive copyto! calls might hinder inference.

The failing tests appear to be

Test Failed at /cache/build/tester-amdci5-8/julialang/julia-master/julia-b418cc25b7/share/julia/test/sorting.jl:986
  Expression: 2 >= #= /cache/build/tester-amdci5-8/julialang/julia-master/julia-b418cc25b7/share/julia/test/sorting.jl:986 =# @allocations(sortperm(v, rev = true))
   Evaluated: 2 >= 5
Test Failed at /cache/build/tester-amdci5-8/julialang/julia-master/julia-b418cc25b7/share/julia/test/sorting.jl:987
  Expression: 2 >= #= /cache/build/tester-amdci5-8/julialang/julia-master/julia-b418cc25b7/share/julia/test/sorting.jl:987 =# @allocations(sortperm(v, rev = false))
   Evaluated: 2 >= 5

@LilithHafner
Copy link
Member

Ah, I likely added those tests.

It looks like this both adds an extra layer of indirection to unwrap views, which makes sense to me and explains the performance improvement in the OP; and also some other stuff around bounds checking and LinearIndexing that I havn't groked yet.

@LilithHafner
Copy link
Member

Something reproducible is going on here:

julia> @b rand(10) sortperm(_, rev=true)
44.313 ns (2 allocs: 144 bytes) # master
418.945 ns (5 allocs: 272 bytes) # PR

@LilithHafner
Copy link
Member

According to profiling, most of the runtime is constructing a namedtuple on line 1894:

_sortperm(A; alg, order=ord(lt, by, true, order), scratch, dims...)

I think this is due to sorting performance depending on compiler heuristics which means you can go ahead and merge this and loosen the sorting allocation tests if you want to.

This diff should fix the sorting test:

--- a/base/sort.jl
+++ b/base/sort.jl
@@ -1891,12 +1891,12 @@ function sortperm(A::AbstractArray;
                   scratch::Union{Vector{<:Integer}, Nothing}=nothing,
                   dims...) #to optionally specify dims argument
     if rev === true
-        _sortperm(A; alg, order=ord(lt, by, true, order), scratch, dims...)
+        _sortperm(A, alg, ord(lt, by, true, order), scratch, dims)
     else
-        _sortperm(A; alg, order=ord(lt, by, nothing, order), scratch, dims...)
+        _sortperm(A, alg, ord(lt, by, nothing, order), scratch, dims)
     end
 end
-function _sortperm(A::AbstractArray; alg, order, scratch, dims...)
+function _sortperm(A::AbstractArray, alg, order, scratch, dims)
     if order === Forward && isa(A,Vector) && eltype(A)<:Integer
         n = length(A)
         if n > 1
julia> @b rand(10) sortperm(_, rev=true)
44.313 ns (2 allocs: 144 bytes) # master
418.945 ns (5 allocs: 272 bytes) # PR
61.263 ns (2 allocs: 144 bytes) # PR with the diff

A takeaway for this PR is that it increases stress on the compiler in some cases which, in certain use cases, can cause performance regressions. That can be said about any change, though, so I think it's not blocking.

@jishnub
Copy link
Member Author

jishnub commented Nov 26, 2024

Thanks for looking into this!

@jishnub jishnub requested a review from vtjnash November 26, 2024 17:51
@LilithHafner
Copy link
Member

LilithHafner commented Nov 26, 2024

Sorry sorting is being so finicky here. #56661 should at least make the types in errors more readable.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I really have the ability to review this

@jishnub jishnub requested a review from aviatesk December 2, 2024 07:16
@jishnub jishnub force-pushed the jishnub/copyto_subarray_linear branch from 0945185 to f571b27 Compare December 22, 2024 07:46
@jishnub jishnub force-pushed the jishnub/copyto_subarray_linear branch from f571b27 to 0a8333d Compare February 15, 2025 11:29
@jishnub jishnub force-pushed the jishnub/copyto_subarray_linear branch from e02bbe3 to 0e22216 Compare February 24, 2025 06:48
@jishnub
Copy link
Member Author

jishnub commented Feb 24, 2025

Looks like removing the bounds-checking changes has resolved the failures in the sorting tests. These weren't essential anyway.

@LilithHafner
Copy link
Member

I still see the sorting regression, strange that tests now pass

x@x:~$ julia +nightly -q --startup=no
A new `nightly` version is available. Install with `juliaup update`.
julia> versioninfo()
uJulia Version 1.13.0-DEV.55
Commit b3198c9962c (2025-02-14 11:22 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m2)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)

julia> using Chairmarks

julia> @b rand(10) sortperm(_, rev=true)
44.063 ns (2 allocs: 144 bytes)

julia> 
x@x:~$ julia +pr56657 -q --startup=no
julia> versioninfo()
Julia Version 1.13.0-DEV.110
Commit 0e22216bb69 (2025-02-24 06:48 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m2)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)

julia> using Chairmarks

julia> @b rand(10) sortperm(_, rev=true)
401.800 ns (5 allocs: 272 bytes)

@jishnub jishnub force-pushed the jishnub/copyto_subarray_linear branch from d4cc9f3 to 9017e0e Compare March 13, 2025 11:00
@jishnub
Copy link
Member Author

jishnub commented Mar 13, 2025

Oddly, I don't see the regressions in allocations (which might explain why the tests pass now), although there appears to be a performance regression.
This PR:

julia> @b rand(10) sortperm(_, rev=true)
106.121 ns (2 allocs: 144 bytes)

julia> versioninfo()
Julia Version 1.13.0-DEV.216
Commit 9017e0ec7f (2025-03-13 10:19 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
  WORD_SIZE: 64
  LLVM: libLLVM-19.1.7 (ORCJIT, skylake)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = subl

vs nightly:

julia> @b rand(10) sortperm(_, rev=true)
72.554 ns (2 allocs: 144 bytes)

julia> versioninfo()
Julia Version 1.13.0-DEV.207
Commit 96eb8762cab (2025-03-12 18:04 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
  WORD_SIZE: 64
  LLVM: libLLVM-19.1.7 (ORCJIT, skylake)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = subl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrays [a, r, r, a, y, s] performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants