Skip to content

Make resize! run faster #2828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Make resize! run faster #2828

wants to merge 6 commits into from

Conversation

huiyuxie
Copy link
Contributor

I need resize! function run faster when it is called frequently.

I'm not sure about whether 2 is a good resize factor but I think it might be a reasonable number. And I do the benchmarks based on the assumption that the resize length is uniformly distributed within a range.

Click for benchmark script
using CUDA
using BenchmarkTools
using Random

# we assume resize length is uniformly distributed within a range

# case 1: range 1 to 100
a1 = CUDA.rand(50)
function old1(a::CuArray, rng)
    size = rand(rng, 1:100)
    CUDA.resize!(a, size)
end
function new1(a::CuArray, rng)
    size = rand(rng, 1:100)
    CUDA.new_resize!(a, size)
end
seed = 1
rng1a = MersenneTwister(seed)
ben1a = @benchmark CUDA.@sync old1(a1, rng1a)
rng1b = MersenneTwister(seed)
ben1b = @benchmark CUDA.@sync new1(a1, rng1b)

# case 2: range 1 to 1000
a2 = CUDA.rand(500)
function old2(a::CuArray, rng)
    size = rand(rng, 1:1000)
    CUDA.resize!(a, size)
end
function new2(a::CuArray, rng)
    size = rand(rng, 1:1000)
    CUDA.new_resize!(a, size)
end
seed = 12
rng2a = MersenneTwister(seed)
ben2a = @benchmark CUDA.@sync old2(a2, rng2a)
rng2b = MersenneTwister(seed)
ben2b = @benchmark CUDA.@sync new2(a2, rng2b)

# case 3: range 1 to 10000
a3 = CUDA.rand(5000)
function old3(a::CuArray, rng)
    size = rand(rng, 1:10000)
    CUDA.resize!(a, size)
end
function new3(a::CuArray, rng)
    size = rand(rng, 1:10000)
    CUDA.new_resize!(a, size)
end
seed = 123
rng3a = MersenneTwister(seed)
ben3a = @benchmark CUDA.@sync old3(a3, rng3a)
rng3b = MersenneTwister(seed)
ben3b = @benchmark CUDA.@sync new3(a3, rng3b)

# case 4: range 1 to 100000
a4 = CUDA.rand(50000)
function old4(a::CuArray, rng)
    size = rand(rng, 1:100000)
    CUDA.resize!(a, size)
end
function new4(a::CuArray, rng)
    size = rand(rng, 1:100000)
    CUDA.new_resize!(a, size)
end
seed = 1234
rng4a = MersenneTwister(seed)
ben4a = @benchmark CUDA.@sync old4(a4, rng4a)
rng4b = MersenneTwister(seed)
ben4b = @benchmark CUDA.@sync new4(a4, rng4b)

# case 5: range 1 to 1000000
a5 = CUDA.rand(500000)
function old5(a::CuArray, rng)
    size = rand(rng, 1:1000000)
    CUDA.resize!(a, size)
end
function new5(a::CuArray, rng)
    size = rand(rng, 1:1000000)
    CUDA.new_resize!(a, size)
end
seed = 12345
rng5a = MersenneTwister(seed)
ben5a = @benchmark CUDA.@sync old5(a5, rng5a)
rng5b = MersenneTwister(seed)
ben5b = @benchmark CUDA.@sync new5(a5, rng5b)
Click for benchmark result
# range 1 to 100

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  400.000 ns   2.403 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      19.800 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    27.475 μs ± 41.275 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂              ██▅▂▂▂▁      ▁▅▅▂▁▁    ▄▃▂▁                   ▂
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█████████▇██████████▇███████████▇▆▅▅▅▄▆▅▅▅▅▃▄ █
  400 ns        Histogram: log(frequency) by time      70.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# new
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min  max):  366.667 ns  897.633 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      17.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    17.700 μs ±  17.207 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂      █      ▆
  ▃▁▁▁▁▁▂█▃▂▂▂▂▃█▅▃▃▂▃▄█▇▄▃▃▃▃▄▄▃▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  367 ns           Histogram: frequency by time         51.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
# range 1 to 1000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  400.000 ns   1.352 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      20.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    28.770 μs ± 43.310 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ██▆▄▄▃▂▁▁▁▂▅▄▃▂▁▁▁▄▁▁ ▁▁▁                         ▂
  ▆▁▁▁▁▁▁▁▁▁▁██████████████████████████▇▆▇▆▇▆▇▆▆▆▅▅▅▅▅▆▅▄▆▁▅▃▄ █
  400 ns        Histogram: log(frequency) by time      93.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
# new
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (min  max):   3.650 μs  324.167 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     19.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.643 μs ±  16.182 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▂█▃▆▁▄
  ▂▃▄▆▆███████▇█▅▄▅▃▃▂▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  3.65 μs         Histogram: frequency by time          107 μs <

 Memory estimate: 85 bytes, allocs estimate: 4.
# range 1 to 10000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  500.000 ns   1.365 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      21.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    28.338 μs ± 44.612 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                █▇▅▅▃▃▂▂▂▂▁▁▄▄▃▂▁▁  ▁▁▁ ▁                      ▂
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁███████████████████▇█████▇▇▇▆▆▆▅▆▅▃▃▆▃▅▅▁▃▅▃▄▅ █
  500 ns        Histogram: log(frequency) by time        81 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
# new
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min  max):  366.667 ns  551.667 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      19.367 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    20.171 μs ±  16.419 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▃      █▁     ▆▁
  ▃▂▁▁▁▁▂█▃▂▂▂▂▃██▄▃▃▃▅██▆▄▃▃▄▆▅▄▃▃▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂ ▃
  367 ns           Histogram: frequency by time         54.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
# range 1 to 100000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  16.100 μs  771.100 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     20.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.895 μs ±  31.561 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▇▅▄▄▂▂▃▄▃▃▂▁▁▁ ▁                                          ▂
  ███████████████████████▇▇▇▇▇█▇▇▇▇▇▇█▇▇▆▅▅▅▅▄▆▆▄▄▅▅▄▄▃▂▄▄▃▄▂▄ █
  16.1 μs       Histogram: log(frequency) by time       105 μs <

 Memory estimate: 512 bytes, allocs estimate: 24.
# new
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  300.000 ns  805.600 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      17.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    17.344 μs ±  28.719 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                  ▃▇▇▇▆▅▃▂▁      ▁▂▃▄▃▃▂▁                    ▂
  █▆▃▄▁▄▄▃▁▁▁▁▁▁▁▁▁▁▁████████████▇▇███████████▆▇▆▆▆▆▇███▇▇▇▇▅▅▅ █
  300 ns        Histogram: log(frequency) by time       49.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
# range 1 to 1000000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  11.600 μs  236.500 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     19.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.556 μs ±  11.217 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▇▄  ▃▇█▇▅▃▂
  ▆█████▅▄███████▇▆▄▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁ ▃
  11.6 μs         Histogram: frequency by time         53.6 μs <

 Memory estimate: 512 bytes, allocs estimate: 24.
# new
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min  max):  375.000 ns   1.725 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      15.550 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    16.869 μs ± 23.618 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▃▆▆▆▆▇▇▇██▇▆▅▅▄▄▂▁▁
  ▃▁▁▁▂▄▄▆▄▅▇▆█████████████████████▇▇▅▆▅▅▄▄▄▄▄▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂ ▅
  375 ns          Histogram: frequency by time         40.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

But it is still not as fast as CPU resize! function.

Do you have any other suggestions to make it run faster (like any other good resize factor)? And do we need to benchmark it based on other assumptions (like the resize length linearly grows or shrinks with the times)? I guess the performance will not be good on some corner cases (like when we keep expanding the GPU array to less than double its length).

I separate the new resize! and the old resize! temporarily for better comparison.

Copy link
Contributor

github-actions bot commented Jul 30, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/array.jl b/src/array.jl
index 64a568504..d00c1a75b 100644
--- a/src/array.jl
+++ b/src/array.jl
@@ -888,41 +888,41 @@ with undefined values.
 function Base.resize!(A::CuVector{T}, n::Integer) where T
   n == length(A) && return A
 
-  cap = A.maxsize ÷ aligned_sizeof(T)
-
-  # do nothing when new length is smaller than maxsize
-  if n > cap # n > length(A)
-    
-    # if maxsize is larger than 10 MiB
-    if A.maxsize > RESIZE_THRESHOLD
-      len = max(cap + RESIZE_INCREMENT ÷ aligned_sizeof(T), n) # add at least 1 MiB
-    else 
-      len = max(n, 2 * length(A))
-    end
+    cap = A.maxsize ÷ aligned_sizeof(T)
 
-    maxsize = len * aligned_sizeof(T)
-    bufsize = if isbitstype(T)
-        maxsize
-    else
-      # type tag array past the data
-      maxsize + len
-    end
+    # do nothing when new length is smaller than maxsize
+    if n > cap # n > length(A)
 
-    new_data = context!(context(A)) do
-      mem = pool_alloc(memory_type(A), bufsize)
-      ptr = convert(CuPtr{T}, mem)
-      m = min(length(A), n)
-      if m > 0
-        GC.@preserve A unsafe_copyto!(ptr, pointer(A), m)
-      end
-      DataRef(pool_free, mem)
+        # if maxsize is larger than 10 MiB
+        if A.maxsize > RESIZE_THRESHOLD
+            len = max(cap + RESIZE_INCREMENT ÷ aligned_sizeof(T), n) # add at least 1 MiB
+        else
+            len = max(n, 2 * length(A))
+        end
+
+        maxsize = len * aligned_sizeof(T)
+        bufsize = if isbitstype(T)
+            maxsize
+        else
+            # type tag array past the data
+            maxsize + len
+        end
+
+        new_data = context!(context(A)) do
+            mem = pool_alloc(memory_type(A), bufsize)
+            ptr = convert(CuPtr{T}, mem)
+            m = min(length(A), n)
+            if m > 0
+                GC.@preserve A unsafe_copyto!(ptr, pointer(A), m)
+            end
+            DataRef(pool_free, mem)
+    end
+        unsafe_free!(A)
+        A.data = new_data
+        A.maxsize = maxsize
+        A.offset = 0
     end
-    unsafe_free!(A)
-    A.data = new_data
-    A.maxsize = maxsize
-    A.offset = 0
-  end
 
-  A.dims = (n,)
+    A.dims = (n,)
   A
 end
diff --git a/test/base/array.jl b/test/base/array.jl
index a0573c73f..b89c88cb0 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -550,68 +550,68 @@ end
 end
 
 @testset "resizing" begin
-  # 1) small arrays (<=10 MiB): should still use doubling policy
-  a = CuArray([1, 2, 3])
-
-  # reallocation (add less than half)
-  CUDA.resize!(a, 4)
-  @test length(a) == 4
-  @test Array(a)[1:3] == [1, 2, 3]
-  @test a.maxsize == max(4, 2*3) * sizeof(eltype(a))
-
-  # no reallocation 
-  CUDA.resize!(a, 5)
-  @test length(a) == 5
-  @test Array(a)[1:3] == [1, 2, 3]
-  @test a.maxsize == 6 * sizeof(eltype(a))
-
-  # reallocation (add more than half)
-  CUDA.resize!(a, 12)
-  @test length(a) == 12
-  @test Array(a)[1:3] == [1, 2, 3]
-  @test a.maxsize == max(12, 2*5) * sizeof(eltype(a))
-
-  # 2) large arrays (>10 MiB): should use 1 MiB increments
-  b = CUDA.fill(1, 2*1024^2)
-  maxsize = b.maxsize
-
-  # should bump by exactly 1 MiB
-  CUDA.resize!(b, 2*1024^2 + 1)
-  @test length(b) == 2*1024^2 + 1
-  @test b.maxsize == maxsize + CUDA.RESIZE_INCREMENT
-  @test all(Array(b)[1:2*1024^2] .== 1)
-
-  b = CUDA.fill(1, 2*1024^2)
-  maxsize = b.maxsize
-
-  # should bump greater than 1 MiB
-  new = CUDA.RESIZE_INCREMENT ÷ sizeof(eltype(b))  
-  CUDA.resize!(b, 2*1024^2 + new + 1)
-  @test length(b) == 2*1024^2 + new + 1
-  @test b.maxsize > maxsize + CUDA.RESIZE_INCREMENT
-  @test all(Array(b)[1:2*1024^2] .== 1)
-
-  b = CUDA.fill(1, 2*1024^2)
-  maxsize = b.maxsize
-
-  # no reallocation
-  CUDA.resize!(b, 2*1024^2 - 1)
-  @test length(b) == 2*1024^2 - 1
-  @test b.maxsize == maxsize
-  @test all(Array(b)[1:2*1024^2 - 1] .== 1)
-
-  # 3) corner cases
-  c = CuArray{Int}(undef, 0)
-  @test length(c) == 0
-  CUDA.resize!(c, 1)
-  @test length(c) == 1
-  @test c.maxsize == 1 * sizeof(eltype(c))
-
-  c = CuArray{Int}(undef, 1)
-  @test length(c) == 1
-  CUDA.resize!(c, 0)
-  @test length(c) == 0
-  @test c.maxsize == 1 * sizeof(eltype(c))
+    # 1) small arrays (<=10 MiB): should still use doubling policy
+    a = CuArray([1, 2, 3])
+
+    # reallocation (add less than half)
+    CUDA.resize!(a, 4)
+    @test length(a) == 4
+    @test Array(a)[1:3] == [1, 2, 3]
+    @test a.maxsize == max(4, 2 * 3) * sizeof(eltype(a))
+
+    # no reallocation
+    CUDA.resize!(a, 5)
+    @test length(a) == 5
+    @test Array(a)[1:3] == [1, 2, 3]
+    @test a.maxsize == 6 * sizeof(eltype(a))
+
+    # reallocation (add more than half)
+    CUDA.resize!(a, 12)
+    @test length(a) == 12
+    @test Array(a)[1:3] == [1, 2, 3]
+    @test a.maxsize == max(12, 2 * 5) * sizeof(eltype(a))
+
+    # 2) large arrays (>10 MiB): should use 1 MiB increments
+    b = CUDA.fill(1, 2 * 1024^2)
+    maxsize = b.maxsize
+
+    # should bump by exactly 1 MiB
+    CUDA.resize!(b, 2 * 1024^2 + 1)
+    @test length(b) == 2 * 1024^2 + 1
+    @test b.maxsize == maxsize + CUDA.RESIZE_INCREMENT
+    @test all(Array(b)[1:(2 * 1024^2)] .== 1)
+
+    b = CUDA.fill(1, 2 * 1024^2)
+    maxsize = b.maxsize
+
+    # should bump greater than 1 MiB
+    new = CUDA.RESIZE_INCREMENT ÷ sizeof(eltype(b))
+    CUDA.resize!(b, 2 * 1024^2 + new + 1)
+    @test length(b) == 2 * 1024^2 + new + 1
+    @test b.maxsize > maxsize + CUDA.RESIZE_INCREMENT
+    @test all(Array(b)[1:(2 * 1024^2)] .== 1)
+
+    b = CUDA.fill(1, 2 * 1024^2)
+    maxsize = b.maxsize
+
+    # no reallocation
+    CUDA.resize!(b, 2 * 1024^2 - 1)
+    @test length(b) == 2 * 1024^2 - 1
+    @test b.maxsize == maxsize
+    @test all(Array(b)[1:(2 * 1024^2 - 1)] .== 1)
+
+    # 3) corner cases
+    c = CuArray{Int}(undef, 0)
+    @test length(c) == 0
+    CUDA.resize!(c, 1)
+    @test length(c) == 1
+    @test c.maxsize == 1 * sizeof(eltype(c))
+
+    c = CuArray{Int}(undef, 1)
+    @test length(c) == 1
+    CUDA.resize!(c, 0)
+    @test length(c) == 0
+    @test c.maxsize == 1 * sizeof(eltype(c))
 end
 
 @testset "aliasing" begin

@huiyuxie
Copy link
Contributor Author

@maleadt Please review. Thanks!

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify resize! instead of adding a new_resize!?

The change should also make use of the maxsize property of CuArray, which is there to allow additional data to be allocated without the dimensions of the array having to match. resize should be made aware of that, not doing anything when the needed size is smaller than maxsize.

In addition, IIUC you're using a growth factor of 2 now (ignoring resizes when shrinking by up to a half, resizing by 2 when requesting a larger array), which I think may be too aggressive for GPU arrays. For small arrays it's probably fine, but at some point (> 10MB?) we should probably use a fixed (1MB?) increment instead.

In addition,

@huiyuxie
Copy link
Contributor Author

huiyuxie commented Aug 5, 2025

The change should also make use of the maxsize property of CuArray, which is there to allow additional data to be allocated without the dimensions of the array having to match.

👍

which I think may be too aggressive for GPU arrays. For small arrays it's probably fine, but at some point (> 10MB?) we should probably use a fixed (1MB?) increment instead.

👍 but why do we choose 10MB and 1MB?

In addition,

Do you have any other comments?

cap = A.maxsize ÷ aligned_sizeof(T)

# do nothing when new length is smaller than maxsize
if n > cap # n > length(A)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resize should be made aware of that, not doing anything when the needed size is smaller than maxsize

Is that really ok if we never shrink the GPU arrays?

@huiyuxie huiyuxie requested a review from maleadt August 5, 2025 18:53
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.71%. Comparing base (205c238) to head (81b67a4).

Files with missing lines Patch % Lines
src/array.jl 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2828      +/-   ##
==========================================
- Coverage   89.78%   89.71%   -0.08%     
==========================================
  Files         150      150              
  Lines       13229    13234       +5     
==========================================
- Hits        11878    11873       -5     
- Misses       1351     1361      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@huiyuxie
Copy link
Contributor Author

huiyuxie commented Aug 6, 2025

Can you @maleadt point me to the existing CI benchmark results? I could not find them. Thanks!

maxsize
else
# type tag array past the data
maxsize + len
Copy link
Contributor Author

@huiyuxie huiyuxie Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to cover the test for this line since https://app.codecov.io/gh/JuliaGPU/CUDA.jl/pull/2828?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaGPU?
But the original test also did not cover this line, I think.

@huiyuxie
Copy link
Contributor Author

@maleadt Could you please have a review if you get time? It's good if it can be merged ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants