Skip to content

Commit 3b7d71e

Browse files
add vector size concept to stream benchmarks
Signed-off-by: Michal Mrozek <[email protected]>
1 parent 112d868 commit 3b7d71e

File tree

6 files changed

+44
-50
lines changed

6 files changed

+44
-50
lines changed

TESTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ RemoteAccessMemoryMaxSaturation|Uses stream memory write to measure max data bus
137137
SLM_DataAccessLatency|generates SLM local memory transactions inside thread group to measure latency between reads (uses Intel only private intel_get_cycle_counter() )|<ul><li>--direction write or read mode (0 or 1)</li><li>--occupancyDiv H/W load divider by 8, 4, 2, full occupancy</li><li>--size SLM Size</li></ul>|:x:|:heavy_check_mark:|
138138
SlmSwitchLatency|Enqueues 2 kernels with different SLM size. Measures switch time between these kernels.|<ul><li>--firstSlmSize Size of the shared local memory per thread group. First kernel.</li><li>--secondSlmSize Size of the shared local memory per thread group. Second kernel.</li><li>--wgs Size of the work group.</li></ul>|:heavy_check_mark:|:x:|
139139
StreamAfterTransfer|Goal of this test is to measure how stream kernels perform right after host to device transfer populating the data. Test does clean caches, then emits transfers and then follows with stream kernel and measures GPU execution time of it.|<ul><li>--size Size of the memory to stream. Must be divisible by datatype size.</li><li>--type Memory streaming type (Read or Write or Scale or Triad)</li><li>--useEvents Perform GPU-side measurements using events (0 or 1)</li></ul>|:x:|:heavy_check_mark:|
140-
StreamMemory|Streams memory inside of kernel in a fashion described by 'type'. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to.|<ul><li>--contents Buffer contents zeros/random (Zeros or Random)</li><li>--memoryPlacement Memory type used for stream (Device or Host or Shared or non-USM-mapped or non-USMmisaligned or non-USM4KBAligned or non-USM2MBAligned or non-USMmisaligned-imported or non-USM4KBAligned-imported or non-USM2MBAligned-imported)</li><li>--multiplier multiplies id used for accessing the resources to simulate partials</li><li>--size Size of the memory to stream. Must be divisible by datatype size.</li><li>--type Memory streaming type (Read or Write or Scale or Triad)</li><li>--useEvents Perform GPU-side measurements using events (0 or 1)</li></ul>|:heavy_check_mark:|:heavy_check_mark:|
140+
StreamMemory|Streams memory inside of kernel in a fashion described by 'type'. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to.|<ul><li>--contents Buffer contents zeros/random (Zeros or Random)</li><li>--memoryPlacement Memory type used for stream (Device or Host or Shared or non-USM-mapped or non-USMmisaligned or non-USM4KBAligned or non-USM2MBAligned or non-USMmisaligned-imported or non-USM4KBAligned-imported or non-USM2MBAligned-imported)</li><li>--multiplier multiplies id used for accessing the resources to simulate partials</li><li>--size Size of the memory to stream. Must be divisible by datatype size.</li><li>--type Memory streaming type (Read or Write or Scale or Triad)</li><li>--useEvents Perform GPU-side measurements using events (0 or 1)</li><li>--vectorSize size of uint vector type 1/2/4/8/16</li></ul>|:heavy_check_mark:|:heavy_check_mark:|
141141
StreamMemoryImmediate|Streams memory inside of kernel in a fashion described by 'type' using immediate command list. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to.|<ul><li>--size Size of the memory to stream. Must be divisible by datatype size.</li><li>--type Memory streaming type (Read or Write or Scale or Triad)</li><li>--useEvents Perform GPU-side measurements using events (0 or 1)</li></ul>|:heavy_check_mark:|:x:|
142142
UnmapBuffer|allocates an OpenCL buffer and measures unmap bandwidth. Unmapping operation meansmemory transfer from CPU to GPU or a no-op, depending on map flags.|<ul><li>--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)</li><li>--contents Contents of the buffer (Zeros or Random)</li><li>--mapFlags OpenCL map flags passed during memory mapping (Read or Write or WriteInvalidate)</li><li>--size Size of the buffer</li><li>--useEvents Perform GPU-side measurements using events (0 or 1)</li></ul>|:x:|:heavy_check_mark:|
143143
UsmConcurrentCopy|allocates four unified shared memory buffers, 2 in device memory and 2 in host memory. Measures concurrent copy bandwidth between them.|<ul><li>--d2hEngine Engine used for device to host copy (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8)</li><li>--h2dEngine Engine used for host to device copy (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8)</li><li>--size Size of the buffer</li></ul>|:heavy_check_mark:|:x:|

source/benchmarks/memory_benchmark/definitions/stream_memory.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
* Copyright (C) 2022-2024 Intel Corporation
2+
* Copyright (C) 2022-2025 Intel Corporation
33
*
44
* SPDX-License-Identifier: MIT
55
*
@@ -20,14 +20,16 @@ struct StreamMemoryArguments : TestCaseArgumentContainer {
2020
BufferContentsArgument contents;
2121
UsmMemoryPlacementArgument memoryPlacement;
2222
PositiveIntegerArgument partialMultiplier;
23+
PositiveIntegerArgument vectorSize;
2324

2425
StreamMemoryArguments()
2526
: type(*this, "type", "Memory streaming type"),
2627
size(*this, "size", "Size of the memory to stream. Must be divisible by datatype size."),
2728
useEvents(*this, "useEvents", CommonHelpMessage::useEvents()),
2829
contents(*this, "contents", "Buffer contents zeros/random"),
2930
memoryPlacement(*this, "memoryPlacement", "Memory type used for stream"),
30-
partialMultiplier(*this, "multiplier", "multiplies id used for accessing the resources to simulate partials") {}
31+
partialMultiplier(*this, "multiplier", "multiplies id used for accessing the resources to simulate partials"),
32+
vectorSize(*this, "vectorSize", "size of uint vector type 1/2/4/8/16") {}
3133
};
3234

3335
struct StreamMemory : TestCase<StreamMemoryArguments> {

source/benchmarks/memory_benchmark/gtest/stream_memory.cpp

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
[[maybe_unused]] static const inline RegisterTestCase<StreamMemory> registerTestCase{};
1717

18-
class StreamMemoryTest : public ::testing::TestWithParam<std::tuple<Api, StreamMemoryType, size_t, bool, BufferContents, UsmMemoryPlacement, size_t>> {
18+
class StreamMemoryTest : public ::testing::TestWithParam<std::tuple<Api, StreamMemoryType, size_t, bool, BufferContents, UsmMemoryPlacement, size_t, size_t>> {
1919
};
2020

2121
TEST_P(StreamMemoryTest, Test) {
@@ -27,6 +27,7 @@ TEST_P(StreamMemoryTest, Test) {
2727
args.contents = std::get<4>(GetParam());
2828
args.memoryPlacement = std::get<5>(GetParam());
2929
args.partialMultiplier = std::get<6>(GetParam());
30+
args.vectorSize = std::get<7>(GetParam());
3031

3132
StreamMemory test;
3233
test.run(args);
@@ -43,29 +44,30 @@ INSTANTIATE_TEST_SUITE_P(
4344
::testing::Values(false, true),
4445
::testing::Values(BufferContents::Zeros, BufferContents::Random),
4546
::testing::ValuesIn(UsmMemoryPlacementArgument::deviceAndHost),
46-
::testing::Values(1u)));
47+
::testing::Values(1u),
48+
::testing::Values(1, 2, 4)));
4749

4850
INSTANTIATE_TEST_SUITE_P(
4951
StreamMemoryTestLIMITED,
5052
StreamMemoryTest,
5153
::testing::ValuesIn([] {
52-
std::vector<std::tuple<Api, StreamMemoryType, size_t, bool, BufferContents, UsmMemoryPlacement, size_t>> testCases;
53-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 1 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u);
54-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u);
55-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u);
56-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u);
57-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u);
58-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u);
59-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u);
60-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u);
61-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u);
62-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u);
63-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u);
64-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u);
65-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u);
66-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u);
67-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u);
68-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u);
69-
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u);
54+
std::vector<std::tuple<Api, StreamMemoryType, size_t, bool, BufferContents, UsmMemoryPlacement, size_t, size_t>> testCases;
55+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 1 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u, 1u);
56+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u, 1u);
57+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u, 1u);
58+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u, 1u);
59+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Read, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u, 1u);
60+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u, 1u);
61+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u, 1u);
62+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u, 1u);
63+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Scale, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u, 1u);
64+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u, 1u);
65+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u, 1u);
66+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u, 1u);
67+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Triad, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u, 1u);
68+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Device, 1u, 1u);
69+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Random, UsmMemoryPlacement::Host, 1u, 1u);
70+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Device, 1u, 1u);
71+
testCases.emplace_back(Api::OpenCL, StreamMemoryType::Write, 512 * megaByte, true, BufferContents::Zeros, UsmMemoryPlacement::Host, 1u, 1u);
7072
return testCases;
7173
}()));

source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
* Copyright (C) 2022-2024 Intel Corporation
2+
* Copyright (C) 2022-2025 Intel Corporation
33
*
44
* SPDX-License-Identifier: MIT
55
*
@@ -25,6 +25,9 @@ static TestResult run(const StreamMemoryArguments &arguments, Statistics &statis
2525
if (arguments.partialMultiplier > 1u) {
2626
return TestResult::NoImplementation;
2727
}
28+
if (arguments.vectorSize > 1u) {
29+
return TestResult::NoImplementation;
30+
}
2831

2932
if (isNoopRun()) {
3033
statistics.pushUnitAndType(typeSelector.getUnit(), typeSelector.getType());

source/benchmarks/memory_benchmark/implementations/ocl/stream_memory_ocl.cpp

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
* Copyright (C) 2022-2024 Intel Corporation
2+
* Copyright (C) 2022-2025 Intel Corporation
33
*
44
* SPDX-License-Identifier: MIT
55
*
@@ -42,10 +42,9 @@ static TestResult run(const StreamMemoryArguments &arguments, Statistics &statis
4242
QueueProperties queueProperties = QueueProperties::create().setProfiling(true).setOoq(0);
4343
Opencl opencl(queueProperties);
4444
Timer timer;
45-
bool useDoubles = opencl.getExtensions().areDoublesSupported();
4645

47-
size_t elementSize = useDoubles ? 8u : 4u;
48-
const int64_t scalarValue = -999;
46+
size_t elementSize = arguments.vectorSize * sizeof(uint32_t);
47+
unsigned int scalarValue[16] = {9999999u};
4948
bool setScalarArgument = true;
5049
const bool printBuildInfo = true;
5150

@@ -94,7 +93,11 @@ static TestResult run(const StreamMemoryArguments &arguments, Statistics &statis
9493

9594
// Create kernel
9695
CompilerOptionsBuilder compilerOptions;
97-
compilerOptions.addDefinitionKeyValue("STREAM_TYPE", useDoubles ? "double" : "float");
96+
std::string streamType = "uint";
97+
if (arguments.vectorSize > 1) {
98+
streamType += std::to_string(arguments.vectorSize);
99+
}
100+
compilerOptions.addDefinitionKeyValue("STREAM_TYPE", streamType.c_str());
98101
const char *programName = "memory_benchmark_stream_memory.cl";
99102
cl_program program{};
100103
if (auto result = ProgramHelperOcl::buildProgramFromSourceFile(opencl.context, opencl.device, programName, compilerOptions.str().c_str(), program); result != TestResult::Success) {

source/benchmarks/memory_benchmark/kernels/memory_benchmark_stream_memory.cl

Lines changed: 5 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,24 @@
11
/*
2-
* Copyright (C) 2022-2024 Intel Corporation
2+
* Copyright (C) 2022-2025 Intel Corporation
33
*
44
* SPDX-License-Identifier: MIT
55
*
66
*/
77

88
// #pragma OPENCL EXTENSION cl_khr_fp64 : enable
99

10-
__kernel void readWithMultiplier(const __global STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, STREAM_TYPE scalar, int multiplier) {
10+
__kernel void readWithMultiplier(const __global volatile STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, STREAM_TYPE scalar, int multiplier) {
1111
int i = get_global_id(0);
1212
if(multiplier > 1){
1313
i = i * multiplier;
1414
if(i >= get_global_size(0)) return;
1515
}
1616
STREAM_TYPE value = x[i];
17-
18-
// A trick to ensure compiler won't optimize away the read
19-
if (value == 0.37221) {
20-
*dummyOutput = value;
21-
}
2217
}
2318

24-
__kernel void read(const __global STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, STREAM_TYPE scalar) {
19+
__kernel void read(const __global volatile STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, STREAM_TYPE scalar) {
2520
const int i = get_global_id(0);
2621
STREAM_TYPE value = x[i];
27-
28-
// A trick to ensure compiler won't optimize away the read
29-
if (value == 0.37221) {
30-
*dummyOutput = value;
31-
}
3222
}
3323

3424
__kernel void writeWithMultiplier(__global STREAM_TYPE *restrict x, STREAM_TYPE scalar, int multiplier) {
@@ -107,7 +97,7 @@ __kernel void remote_triad(const __global STREAM_TYPE *restrict x, const __globa
10797
z[g_id] = x[g_id] + y[g_id];
10898
}
10999

110-
__kernel void remote_read(const __global STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, uint workItemGroupSize, const int remoteAccessFraction) {
100+
__kernel void remote_read(const __global volatile STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, uint workItemGroupSize, const int remoteAccessFraction) {
111101
int g_id = get_global_id(0);
112102
if (remoteAccessFraction != 0) {
113103
const size_t gws = get_global_size(0);
@@ -118,9 +108,6 @@ __kernel void remote_read(const __global STREAM_TYPE *restrict x, __global STREA
118108
}
119109

120110
STREAM_TYPE value = x[g_id];
121-
if (value == 37) {
122-
*dummyOutput = value;
123-
}
124111
}
125112

126113
#ifdef ELEMENT_SIZE
@@ -200,7 +187,7 @@ __kernel void full_remote_block_read_xe_cores_distributed(const __global STREAM_
200187
}
201188
#endif
202189

203-
__kernel void full_remote_scatter_read(const __global STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, const uint bufferLength, const uint iterations) {
190+
__kernel void full_remote_scatter_read(const __global volatile STREAM_TYPE *restrict x, __global STREAM_TYPE *restrict dummyOutput, const uint bufferLength, const uint iterations) {
204191
const uint gid = get_global_id(0);
205192
const size_t gws = get_global_size(0);
206193
// First half of workitems access memory starting from middle of the buffer
@@ -214,9 +201,6 @@ __kernel void full_remote_scatter_read(const __global STREAM_TYPE *restrict x, _
214201
for (uint i = 0; i < iterations; i++) {
215202
// Fold up calculated offset to prevent exceeding buffer length
216203
STREAM_TYPE value = x[startIndex + ((i * cachelineGap) & (bufferLength / 2 - 1))];
217-
if (value == 33) {
218-
*dummyOutput = value;
219-
}
220204
}
221205
}
222206

0 commit comments

Comments
 (0)