Skip to content

5-argument mul! is slow with ROCArrays #866

@abussy

Description

@abussy

I have noticed that calling the generic mul!(C, A, B, α, β) with ROCArrays is much slower than it should.

From the reproducer below, it is clear that mul!(C, A, B, α, β) is badly outperformed by C .= α*C + β * A*B, even for small matrix sizes. The problem gets worse with increasing size. Note that the 3-argument mul!(C, A, B) seems to behave properly.

using AMDGPU
using LinearAlgebra

dims = [i*500 for i = 1:10]

nsamples = 5

for dim in dims
    C = ROCArray(randn(Float64, dim, dim))
    A = ROCArray(randn(Float64, dim, dim))
    B = ROCArray(randn(Float64, dim, dim))

    # Warm-up
    C .= A * B
    mul!(C, A, B, 1, 0)

    # Benchmark
    tbroad = 0.0
    tmul = 0.0
    AMDGPU.synchronize()
    for _ = 1:nsamples
        tb = @timed begin
            C .= A * B
            AMDGPU.synchronize()
        end
        tm = @timed begin
            mul!(C, A, B, 1, 0)
            AMDGPU.synchronize()
        end
        tbroad += tb.time
        tmul += tm.time
    end
    println("\nTime broadcast for dim $dim: ", tbroad/nsamples)
    println( "Time mul! for dim $dim: ", tmul/nsamples)
end

I use ROCm 6.1. I have also tested ROCm 6.3, but the problem remains. Here is the output of AMDGPU.versioninfo():

[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────
│ Available │ Name             │ Version   │ Path                                                                           ⋯
├───────────┼──────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────────────
│     +     │ LLD              │ -         │ /opt/rocm-6.1.0/lib/llvm/bin/ld.lld                                            ⋯
│     +     │ Device Libraries │ -         │ /capstor/scratch/cscs/abussy/dftk_beverin/.julia/artifacts/5ad5ecb46e3c334821f ⋯
│     +     │ HIP              │ 6.1.40091 │ /opt/rocm-6.1.0/lib/libamdhip64.so                                             ⋯
│     +     │ rocBLAS          │ 4.1.0     │ /opt/rocm-6.1.0/lib/librocblas.so                                              ⋯
│     +     │ rocSOLVER        │ 3.25.0    │ /opt/rocm-6.1.0/lib/librocsolver.so                                            ⋯
│     +     │ rocSPARSE        │ 3.1.2     │ /opt/rocm-6.1.0/lib/librocsparse.so                                            ⋯
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm-6.1.0/lib/librocrand.so                                              ⋯
│     +     │ rocFFT           │ 1.0.27    │ /opt/rocm-6.1.0/lib/librocfft.so                                               ⋯
│     +     │ MIOpen           │ 3.1.0     │ /opt/rocm-6.1.0/lib/libMIOpen.so                                               ⋯
└───────────┴──────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────
                                                                                                             1 column omitted

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │                Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
│  2 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
│  3 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
│  4 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
│  5 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
│  6 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
│  7 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
│  8 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions