Skip to content

Throughput Failed On Mutiple GPUS #287

@westfly

Description

@westfly

I ran example throughput.cu and it failed on 4XGPU,

Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [5/8] throughput_bench [Device=0]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [6/8] throughput_bench [Device=1]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [7/8] throughput_bench [Device=2]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [8/8] throughput_bench [Device=3]
Pass: Cold: 0.007061ms GPU, 0.016156ms CPU, 0.50s total GPU, 6.81s total wall, 70816x
Pass: Batch: 0.002299ms GPU, 0.50s total GPU, 0.50s to

I noticed examples/stream.cu that can set_cuda_stream

  state.set_cuda_stream(nvbench::make_cuda_stream_view(default_stream));

so I added it to throughput.cu which works fine

# Log



Run:  [1/4] throughput_bench [Device=0]
Pass: Cold: 0.663276ms GPU, 0.672594ms CPU, 0.51s total GPU, 0.54s total wall, 768x
Pass: Batch: 0.659212ms GPU, 0.53s total GPU, 0.53s total wall, 800x
Run:  [2/4] throughput_bench [Device=1]
Pass: Cold: 0.665058ms GPU, 0.674441ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660540ms GPU, 0.54s total GPU, 0.54s total wall, 815x
Run:  [3/4] throughput_bench [Device=2]
Pass: Cold: 0.664827ms GPU, 0.674139ms CPU, 0.51s total GPU, 0.55s total wall, 768x
Pass: Batch: 0.660413ms GPU, 0.53s total GPU, 0.53s total wall, 809x
Run:  [4/4] throughput_bench [Device=3]
Pass: Cold: 0.665416ms GPU, 0.674786ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660745ms GPU, 0.53s total GPU, 0.53s total wall, 807x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions