Skip to content

Commit dc7beeb

Browse files
EJainDevCNugteren
andauthored
Faster tuning (CNugteren#599)
* Fixed c_ld documentation. * Ran the generator script. * Added multithreaded tuning. * Added a parameter to control number of threads since more threads is not always faster. * Update the args struct and constants. * Fixed include order for tuning.cpp * Prevented the user from using more threads than configurations which would lead to wasted resources. * Fixed the typo of std::max for the number of threads instead of std::min which it should have been. * Make it clearer to understand tuning.cpp Co-authored-by: Cedric Nugteren <web@cedricnugteren.nl> * Fixed warnings of buffer overflow with GCC. * Added support for single threading which is the default used. * Updated changelog and added informationg to tuning.md * Update CHANGELOG Co-authored-by: Cedric Nugteren <web@cedricnugteren.nl> * Improve doc/tuning.md Co-authored-by: Cedric Nugteren <web@cedricnugteren.nl> * Improve readability of src/tuning/tuning.cpp Co-authored-by: Cedric Nugteren <web@cedricnugteren.nl> * Improved naming of variables. * Imrpoved const correctness in src/tuning/tuning.cpp Co-authored-by: Cedric Nugteren <web@cedricnugteren.nl> --------- Co-authored-by: Cedric Nugteren <web@cedricnugteren.nl>
1 parent 0549f3b commit dc7beeb

File tree

4 files changed

+135
-38
lines changed

4 files changed

+135
-38
lines changed

CHANGELOG

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Development version (next version)
22
- Fixed compatibility with CMake 4.0
33
- Applied clang-format code style formatting
44
- Added tuned parameters for many devices (see doc/tuning.md)
5+
- Enabled parallel kernel compilation for faster kernel tuning (see doc/tuning.md)
56

67
Version 1.6.3
78
- Fixed a bug in the GEMMK=1 kernel (with 2D register tiling) when MWG!=NWG

doc/tuning.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -216,6 +216,8 @@ The kernels `gemm` and `gemm_direct` have too many parameters to explore. Theref
216216

217217
There are also several routine-level tuners. They tune inter-kernel parameters and should only be run after the kernels are tuned. However, they do automatically pick up kernel tuning results from the current folder if there are any. An example is the GEMM routine tuner, which determines when to use the direct or the in-direct GEMM kernel.
218218

219+
The tuners also proivide a `-threads` option allowing you to control how many threads are used for OpenCL kernel compilation (not for actually executing the kernels). It defaults to running the single threaded version with 1 thread but more can be specified via the parameter. It is recommended to use the same amount of threads as CPU cores to maximize performance. More threads may hurt or improve performance. It is also the safest option to use the default of 1 thread.
220+
219221
Here are all the tuners included in the `make alltuners` target (in the same order) with all their precision arguments:
220222

221223
./clblast_tuner_copy_fast -precision 32

src/tuning/tuning.cpp

Lines changed: 130 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,11 @@
1111
#include "tuning/tuning.hpp"
1212

1313
#include <algorithm>
14+
#include <condition_variable>
1415
#include <cstdio>
1516
#include <random>
1617
#include <string>
18+
#include <thread>
1719
#include <utility>
1820
#include <vector>
1921

@@ -89,6 +91,105 @@ void print_separator(const size_t parameters_size) {
8991

9092
// =================================================================================================
9193

94+
struct ThreadInfo {
95+
std::mutex mtx;
96+
std::string print_info;
97+
Kernel kernel{nullptr};
98+
bool ready = false;
99+
std::condition_variable cv;
100+
std::vector<size_t> global;
101+
std::vector<size_t> local;
102+
};
103+
104+
template <typename... Args>
105+
inline void addPrintInfo(std::string& str, const char* format, Args&&... args) {
106+
const auto size = std::snprintf(nullptr, 0, format, std::forward<Args>(args)...);
107+
const auto original_size = str.size();
108+
str.resize(original_size + size);
109+
std::snprintf(&str[original_size], size + 1, format, std::forward<Args>(args)...);
110+
}
111+
112+
template <typename T>
113+
void kernelCompilationThread(std::vector<ThreadInfo>& infos, const std::vector<clblast::Configuration>& configurations,
114+
size_t id, const TunerSettings& settings, const Arguments<T>& args, const Device& device,
115+
const Context& context, const size_t num_threads) {
116+
#if defined(_WIN32)
117+
const std::string kPrintError = "";
118+
const std::string kPrintSuccess = "";
119+
const std::string kPrintMessage = "";
120+
const std::string kPrintEnd = "";
121+
#else
122+
const std::string kPrintError = "\x1b[31m";
123+
const std::string kPrintSuccess = "\x1b[32m";
124+
const std::string kPrintMessage = "\x1b[1m";
125+
const std::string kPrintEnd = "\x1b[0m";
126+
#endif
127+
for (size_t config_id = id; config_id < configurations.size(); config_id += num_threads) {
128+
auto& info = infos[config_id];
129+
info.mtx.lock();
130+
try {
131+
auto configuration = configurations[config_id];
132+
addPrintInfo(info.print_info, "| %4zu | %5zu |", config_id + 1, configurations.size());
133+
134+
for (const auto& parameter : settings.parameters) {
135+
addPrintInfo(info.print_info, "%5zu", configuration.at(parameter.first));
136+
}
137+
addPrintInfo(info.print_info, " |");
138+
139+
// Sets the OpenCL thread configuration
140+
auto global =
141+
SetThreadConfiguration(configuration, settings.global_size, settings.mul_global, settings.div_global);
142+
auto local = SetThreadConfiguration(configuration, settings.local_size, settings.mul_local, settings.div_local);
143+
144+
// Make sure that the global worksize is a multiple of the local
145+
for (auto i = size_t{0}; i < global.size(); ++i) {
146+
while ((global[i] / local[i]) * local[i] != global[i]) {
147+
global[i]++;
148+
}
149+
}
150+
if (local.size() > 1 && global.size() > 1) {
151+
addPrintInfo(info.print_info, "%8zu%8zu |%8zu%8zu |", local[0], local[1], global[0], global[1]);
152+
} else {
153+
addPrintInfo(info.print_info, "%8zu%8d |%8zu%8d |", local[0], 1, global[0], 1);
154+
}
155+
info.global = std::move(global);
156+
info.local = std::move(local);
157+
158+
// Sets the parameters for this configuration
159+
auto kernel_source = std::string{""};
160+
for (const auto& parameter : configuration) {
161+
kernel_source += "#define " + parameter.first + " " + ToString(parameter.second) + "\n";
162+
}
163+
kernel_source += settings.sources;
164+
165+
// Compiles the kernel
166+
const auto start_time = std::chrono::steady_clock::now();
167+
auto compiler_options = std::vector<std::string>();
168+
const auto program = CompileFromSource(kernel_source, args.precision, settings.kernel_name, device, context,
169+
compiler_options, 0, true);
170+
info.kernel = Kernel(program, settings.kernel_name);
171+
const auto elapsed_time = std::chrono::steady_clock::now() - start_time;
172+
const auto timing = std::chrono::duration<double, std::milli>(elapsed_time).count();
173+
addPrintInfo(info.print_info, " %sOK%s %5.0lf ms |", kPrintSuccess.c_str(), kPrintEnd.c_str(), timing);
174+
} catch (CLCudaAPIBuildError&) {
175+
const auto status_code = DispatchExceptionCatchAll(true);
176+
addPrintInfo(info.print_info, " %scompilation error: %5d%s |", kPrintError.c_str(),
177+
static_cast<int>(status_code), kPrintEnd.c_str());
178+
addPrintInfo(info.print_info, " - | - | <-- skipping\n");
179+
} catch (...) {
180+
const auto status_code = DispatchExceptionCatchAll(true);
181+
if (status_code != StatusCode::kUnknownError) {
182+
addPrintInfo(info.print_info, " %serror code %d%s |", kPrintError.c_str(), static_cast<int>(status_code),
183+
kPrintEnd.c_str());
184+
}
185+
addPrintInfo(info.print_info, " <-- skipping\n");
186+
}
187+
info.ready = true;
188+
info.mtx.unlock();
189+
info.cv.notify_one();
190+
}
191+
}
192+
92193
template <typename T>
93194
void Tuner(int argc, char* argv[], const int V, GetTunerDefaultsFunc GetTunerDefaults,
94195
GetTunerSettingsFunc<T> GetTunerSettings, TestValidArgumentsFunc<T> TestValidArguments,
@@ -119,6 +220,7 @@ void Tuner(int argc, char* argv[], const int V, GetTunerDefaultsFunc GetTunerDef
119220
args.device_id =
120221
GetArgument(command_line_args, help, kArgDevice, ConvertArgument(std::getenv("CLBLAST_DEVICE"), size_t{0}));
121222
args.precision = GetArgument(command_line_args, help, kArgPrecision, Precision::kSingle);
223+
args.extra_threads = GetArgument(command_line_args, help, kArgNumThreads, size_t{1}) - 1;
122224
for (auto& o : defaults.options) {
123225
if (o == kArgM) {
124226
args.m = GetArgument(command_line_args, help, kArgM, defaults.default_m);
@@ -289,56 +391,41 @@ void Tuner(int argc, char* argv[], const int V, GetTunerDefaultsFunc GetTunerDef
289391
}
290392
print_separator(settings.parameters.size());
291393

394+
// Perform the OpenCL kernel compilation in parallel
395+
std::vector<ThreadInfo> thread_infos(configurations.size());
396+
std::vector<std::thread> threads;
397+
threads.reserve(args.extra_threads);
398+
for (size_t i = 0; i < std::min(size_t{args.extra_threads}, configurations.size()); ++i) {
399+
threads.push_back(std::thread(&kernelCompilationThread<T>, std::ref(thread_infos), std::cref(configurations), i,
400+
std::cref(settings), std::cref(args), std::cref(device), std::cref(context),
401+
args.extra_threads));
402+
}
403+
292404
// Starts the tuning process
293405
auto results = std::vector<TuningResult>();
294406
for (auto config_id = size_t{0}; config_id < configurations.size(); ++config_id) {
295407
try {
296-
auto configuration = configurations[config_id];
297-
printf("| %4zu | %5zu |", config_id + 1, configurations.size());
298-
for (const auto& parameter : settings.parameters) {
299-
printf("%5zu", configuration.at(parameter.first));
300-
}
301-
printf(" |");
302-
303408
// Sets the input
304409
for (const auto id : settings.inputs) {
305410
device_buffers[id].Write(queue, buffer_sizes[id], source_buffers[id]);
306411
}
307412

308-
// Sets the thread configuration
309-
auto global =
310-
SetThreadConfiguration(configuration, settings.global_size, settings.mul_global, settings.div_global);
311-
auto local = SetThreadConfiguration(configuration, settings.local_size, settings.mul_local, settings.div_local);
312-
313-
// Make sure that the global worksize is a multiple of the local
314-
for (auto i = size_t{0}; i < global.size(); ++i) {
315-
while ((global[i] / local[i]) * local[i] != global[i]) {
316-
global[i]++;
413+
Kernel kernel{nullptr};
414+
std::vector<size_t> global;
415+
std::vector<size_t> local;
416+
{
417+
if (args.extra_threads == 0) {
418+
kernelCompilationThread<T>(thread_infos, configurations, config_id, settings, args, device, context,
419+
configurations.size());
317420
}
318-
}
319-
if (local.size() > 1 && global.size() > 1) {
320-
printf("%8zu%8zu |%8zu%8zu |", local[0], local[1], global[0], global[1]);
321-
} else {
322-
printf("%8zu%8d |%8zu%8d |", local[0], 1, global[0], 1);
421+
std::unique_lock<std::mutex> lock(thread_infos[config_id].mtx);
422+
thread_infos[config_id].cv.wait(lock, [&] { return thread_infos[config_id].ready; });
423+
kernel = std::move(thread_infos[config_id].kernel);
424+
global = std::move(thread_infos[config_id].global);
425+
local = std::move(thread_infos[config_id].local);
426+
printf("%s", thread_infos[config_id].print_info.c_str());
323427
}
324428

325-
// Sets the parameters for this configuration
326-
auto kernel_source = std::string{""};
327-
for (const auto& parameter : configuration) {
328-
kernel_source += "#define " + parameter.first + " " + ToString(parameter.second) + "\n";
329-
}
330-
kernel_source += settings.sources;
331-
332-
// Compiles the kernel
333-
const auto start_time = std::chrono::steady_clock::now();
334-
auto compiler_options = std::vector<std::string>();
335-
const auto program = CompileFromSource(kernel_source, args.precision, settings.kernel_name, device, context,
336-
compiler_options, 0, true);
337-
auto kernel = Kernel(program, settings.kernel_name);
338-
const auto elapsed_time = std::chrono::steady_clock::now() - start_time;
339-
const auto timing = std::chrono::duration<double, std::milli>(elapsed_time).count();
340-
printf(" %sOK%s %5.0lf ms |", kPrintSuccess.c_str(), kPrintEnd.c_str(), timing);
341-
342429
// Runs the kernel
343430
SetArguments(V, kernel, args, device_buffers);
344431
const auto time_ms = TimeKernel(args.num_runs, kernel, queue, device, global, local);
@@ -368,6 +455,7 @@ void Tuner(int argc, char* argv[], const int V, GetTunerDefaultsFunc GetTunerDef
368455
}
369456

370457
// All was OK
458+
auto& configuration = configurations[config_id];
371459
configuration["PRECISION"] = static_cast<size_t>(args.precision);
372460
results.push_back(TuningResult{settings.kernel_name, time_ms, configuration});
373461
printf(" %6.1lf |", settings.metric_amount / (time_ms * 1.0e6));
@@ -386,6 +474,10 @@ void Tuner(int argc, char* argv[], const int V, GetTunerDefaultsFunc GetTunerDef
386474
}
387475
}
388476

477+
for (auto& thread : threads) {
478+
thread.join();
479+
}
480+
389481
// Completed the tuning process
390482
print_separator(settings.parameters.size());
391483
printf("\n");

src/utilities/utilities.hpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ constexpr auto kArgAnnMaxTemp = "ann_max_temperature";
115115
constexpr auto kArgPlatform = "platform";
116116
constexpr auto kArgDevice = "device";
117117
constexpr auto kArgPrecision = "precision";
118+
constexpr auto kArgNumThreads = "threads";
118119
constexpr auto kArgHelp = "h";
119120
constexpr auto kArgQuiet = "q";
120121
constexpr auto kArgNoAbbreviations = "no_abbrv";
@@ -265,6 +266,7 @@ struct Arguments {
265266
// Common arguments
266267
size_t platform_id = 0;
267268
size_t device_id = 0;
269+
size_t extra_threads = 1;
268270
Precision precision = Precision::kSingle;
269271
bool print_help = false;
270272
bool silent = false;

0 commit comments

Comments
 (0)