-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Labels
bugSomething isn't workingSomething isn't working
Description
I have noticed that calling the generic mul!(C, A, B, α, β) with ROCArrays is much slower than it should.
From the reproducer below, it is clear that mul!(C, A, B, α, β) is badly outperformed by C .= α*C + β * A*B, even for small matrix sizes. The problem gets worse with increasing size. Note that the 3-argument mul!(C, A, B) seems to behave properly.
using AMDGPU
using LinearAlgebra
dims = [i*500 for i = 1:10]
nsamples = 5
for dim in dims
C = ROCArray(randn(Float64, dim, dim))
A = ROCArray(randn(Float64, dim, dim))
B = ROCArray(randn(Float64, dim, dim))
# Warm-up
C .= A * B
mul!(C, A, B, 1, 0)
# Benchmark
tbroad = 0.0
tmul = 0.0
AMDGPU.synchronize()
for _ = 1:nsamples
tb = @timed begin
C .= A * B
AMDGPU.synchronize()
end
tm = @timed begin
mul!(C, A, B, 1, 0)
AMDGPU.synchronize()
end
tbroad += tb.time
tmul += tm.time
end
println("\nTime broadcast for dim $dim: ", tbroad/nsamples)
println( "Time mul! for dim $dim: ", tmul/nsamples)
end
I use ROCm 6.1. I have also tested ROCm 6.3, but the problem remains. Here is the output of AMDGPU.versioninfo():
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────
│ Available │ Name │ Version │ Path ⋯
├───────────┼──────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────────────
│ + │ LLD │ - │ /opt/rocm-6.1.0/lib/llvm/bin/ld.lld ⋯
│ + │ Device Libraries │ - │ /capstor/scratch/cscs/abussy/dftk_beverin/.julia/artifacts/5ad5ecb46e3c334821f ⋯
│ + │ HIP │ 6.1.40091 │ /opt/rocm-6.1.0/lib/libamdhip64.so ⋯
│ + │ rocBLAS │ 4.1.0 │ /opt/rocm-6.1.0/lib/librocblas.so ⋯
│ + │ rocSOLVER │ 3.25.0 │ /opt/rocm-6.1.0/lib/librocsolver.so ⋯
│ + │ rocSPARSE │ 3.1.2 │ /opt/rocm-6.1.0/lib/librocsparse.so ⋯
│ + │ rocRAND │ 2.10.5 │ /opt/rocm-6.1.0/lib/librocrand.so ⋯
│ + │ rocFFT │ 1.0.27 │ /opt/rocm-6.1.0/lib/librocfft.so ⋯
│ + │ MIOpen │ 3.1.0 │ /opt/rocm-6.1.0/lib/libMIOpen.so ⋯
└───────────┴──────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────
1 column omitted
[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │ Name │ GCN arch │ Wavefront │ Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│ 1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
│ 2 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
│ 3 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
│ 4 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
│ 5 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
│ 6 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
│ 7 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
│ 8 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working