Skip to content

Performance for stride 2 #517

@olof3

Description

@olof3

Hello,

I tried to leverage the speedups from @turbo for a case with stride-2 access of the input data . This actually seemed to degrade performance compared to @inbounds, and it got even worse with @tturbo. I realize that it is much more tricky with non-contiguous data, but wasn't expecting this much of a degradation. Didn't manage to find much info, so perhaps it is a rare use case. Is this to be expected or should something be done differently?

I tried to boil it down to the following mwe.

Thanks!

using LoopVectorization

function test_stride2_inbounds(out, x)
    @inbounds for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end

function test_stride2_turbo(out, x)
    @turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end

function test_stride2_tturbo(out, x)
    @tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end


function test_stride1_turbo(out, x)
    @turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[k-1]
        acc += x[k-2]
        acc += x[k-3]
        acc += x[k-4]
        acc += x[k-5]
        out[k] = acc
    end

    return out
end

function test_stride1_tturbo(out, x)
    @tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[k-1]
        acc += x[k-2]
        acc += x[k-3]
        acc += x[k-4]
        acc += x[k-5]
        out[k] = acc
    end

    return out
end


x = rand(-1000:1000, 2^17)
out1, out2, out3, out4, out5 = (similar(x) for _=1:5)


println("# Threads = $(Threads.nthreads())")

println("\ntest_stride2_inbounds:")
display( @benchmark test_stride2_inbounds(out1, x) )
println("\ntest_stride2_turbo:")
display( @benchmark test_stride2_turbo(out2, x) )
println("\ntest_stride2_tturbo:")
display( @benchmark test_stride2_tturbo(out3, x) )
println("\ntest_stride1_turbo:")
display( @benchmark test_stride1_turbo(out4, x) )
println("\ntest_stride1_tturbo:")
display( @benchmark test_stride1_tturbo(out5, x) )
# Threads = 10

test_stride2_inbounds:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  53.300 μs … 867.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.665 μs ±  23.947 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃   ▁   █▄▁  ▁▁   ▃▇    ▄▇▂   ▂▁      ▄▁    ▁               ▂
  ██▇▅███▇▅███▆▇███▇▆███▅▇▆█████▅██▇█▆▆▅▅██▇▆▅▁█▆▅▄▆▅▄▆▄▄▁▄▄▁▅ █
  53.3 μs       Histogram: log(frequency) by time      74.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride2_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   80.900 μs …   3.360 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     110.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.626 μs ± 133.504 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █   
  ▄██▄▂▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▁▁▂▂▂▂▁▂▁▂▂▁▂▂▂▂▂▂▂▁▂▂▂▂▁▂ ▂
  80.9 μs          Histogram: frequency by time         1.01 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride2_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  181.400 μs …  10.707 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     209.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   242.770 μs ± 203.921 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅█▅▃▃▃▂▁▁▁▁                                                  ▁
  ██████████████▇▇▇█▇▇▇▇▇▆▆▆▇▅▆▅▄▆▄▆▅▆▆▄▄▃▄▅▅▄▅▅▅▆▃▅▅▅▅▅▅▅▄▄▄▅▃ █
  181 μs        Histogram: log(frequency) by time        938 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride1_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.900 μs … 956.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.754 μs ±  30.490 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▅▄▂▁     ▄█▆▇▇▆▃▂▁▆▄▂▄▃▁▁                                   ▂
  ██████▇▇▆▆████████████████▇█▇█▇▇▇▆▇▅▅▅▅▃▄▂▃▃▃▄▅▄▃▄▃▄▅▃▄▃▃▃▃▂ █
  24.9 μs       Histogram: log(frequency) by time        52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride1_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.100 μs …  2.709 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.181 μs ± 51.142 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▃▄▁▁▂                                                    ▂
  ███████████▇▇▆▇▇▆▆▆▅▅▅▅▅▅▅▅▆▅▅▃▄▅▄▄▅▃▅▅▅▄▄▃▁▄▅▅▅▅▅▅▅▅▅▁▃▄▄▅ █
  12.1 μs      Histogram: log(frequency) by time       132 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions