Skip to content

Commit de5c576

Browse files
khuckmasterleinad
authored andcommitted
Adding occupancy tuning for CUDA architectures (kokkos#6788)
* Merging occupancy tuning changes from David Polikoff. Note: This is a re-commit of a somehow polluted branch when I rebased on develop. I started over with the 5 changed files. The old Kokkos fork/branch from : davidp git@github.com:DavidPoliakoff/kokkos.git (fetch) was merged with current Kokkos develop, and tested with ArborX to confirm that autotuning occupancy for the DBSCAN benchmark worked. In tests on a system with V100, the original benchmark when iterated 600 times took 119.064 seconds to run. During the tuning process (using simulated annealing), the runtime was 108.014 seconds. When using cached results, the runtime was 109.058 seconds. The converged occupancy value was 70. Here are the cached results from APEX autotuning: Input_1: name: kokkos.kernel_name id: 1 info.type: string info.category: categorical info.valueQuantity: unbounded info.candidates: unbounded num_bins: 0 Input_2: name: kokkos.kernel_type id: 2 info.type: string info.category: categorical info.valueQuantity: set info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy] Output_3: name: ArborX::Experimental::HalfTraversal id: 3 info.type: int64 info.category: ratio info.valueQuantity: range info.candidates: lower: 5 upper: 100 step: 5 open upper: 0 open lower: 0 Context_0: Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]" Converged: true Results: NumVars: 1 id: 3 value: 70 In manual experiments, the ArborX team determined that the optimal occupancy for this example was beetween 40-90, which were a 10% improvement over baseline default of 100. See arborx/ArborX#815 for details. One deviation from the branch that David had written - the occupancy range is [5-100], with a step size of 5. The original implementation in Kokkos used [1-100] with a step size of 1. * Fixing formatting check, not sure how those reverted * Fixing problems with recursive Impl namespace, MDRange Reduce tuning and OpenMP Reduce tuning. Now trying to fix Team tuning... * removing comments that failed format check * Removing commented code * Final code fixes, likely to be some formatting fixes needed. * Expected formatting changes * Yet another formatting fix... * Removing default operators and copy constructors that aren't needed * Update core/src/impl/Kokkos_Profiling.hpp Co-authored-by: Daniel Arndt <arndtd@ornl.gov> * Fixing formatting check * Clang-format complained about a newline * Update Kokkos_Profiling.hpp Minor fix to prevent incrementing the context id index when not calling `context_begin()`. In actuality, this should be refactored so that `begin_context()` increments the id, and returns it. `end_context()` is the only location that decrements the context id index. * Unify [begin|end]_parallel_* APIs * Merge more functionality * Update TestViewMapping_a test * Remove Reducers_d from MSVC tests --------- Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
1 parent ef560bf commit de5c576

File tree

8 files changed

+389
-127
lines changed

8 files changed

+389
-127
lines changed

core/src/Kokkos_Parallel.hpp

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -134,8 +134,10 @@ inline void parallel_for(const std::string& str, const ExecPolicy& policy,
134134
const FunctorType& functor) {
135135
uint64_t kpID = 0;
136136

137-
ExecPolicy inner_policy = policy;
138-
Kokkos::Tools::Impl::begin_parallel_for(inner_policy, functor, str, kpID);
137+
/** Request a tuned policy from the tools subsystem */
138+
const auto& response =
139+
Kokkos::Tools::Impl::begin_parallel_for(policy, functor, str, kpID);
140+
const auto& inner_policy = response.policy;
139141

140142
auto closure =
141143
Kokkos::Impl::construct_with_shared_allocation_tracking_disabled<
@@ -348,9 +350,11 @@ template <class ExecutionPolicy, class FunctorType,
348350
std::enable_if_t<is_execution_policy<ExecutionPolicy>::value>>
349351
inline void parallel_scan(const std::string& str, const ExecutionPolicy& policy,
350352
const FunctorType& functor) {
351-
uint64_t kpID = 0;
352-
ExecutionPolicy inner_policy = policy;
353-
Kokkos::Tools::Impl::begin_parallel_scan(inner_policy, functor, str, kpID);
353+
uint64_t kpID = 0;
354+
/** Request a tuned policy from the tools subsystem */
355+
const auto& response =
356+
Kokkos::Tools::Impl::begin_parallel_scan(policy, functor, str, kpID);
357+
const auto& inner_policy = response.policy;
354358

355359
auto closure =
356360
Kokkos::Impl::construct_with_shared_allocation_tracking_disabled<

core/src/Kokkos_Parallel_Reduce.hpp

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1498,27 +1498,28 @@ struct ParallelReduceAdaptor {
14981498
using PassedReducerType = typename return_value_adapter::reducer_type;
14991499
uint64_t kpID = 0;
15001500

1501-
PolicyType inner_policy = policy;
1502-
Kokkos::Tools::Impl::begin_parallel_reduce<PassedReducerType>(
1503-
inner_policy, functor, label, kpID);
1504-
15051501
using ReducerSelector =
15061502
Kokkos::Impl::if_c<std::is_same<InvalidType, PassedReducerType>::value,
15071503
FunctorType, PassedReducerType>;
15081504
using Analysis = FunctorAnalysis<FunctorPatternInterface::REDUCE,
15091505
PolicyType, typename ReducerSelector::type,
15101506
typename return_value_adapter::value_type>;
1511-
15121507
using CombinedFunctorReducerType =
15131508
CombinedFunctorReducer<FunctorType, typename Analysis::Reducer>;
1509+
1510+
CombinedFunctorReducerType functor_reducer(
1511+
functor, typename Analysis::Reducer(
1512+
ReducerSelector::select(functor, return_value)));
1513+
const auto& response = Kokkos::Tools::Impl::begin_parallel_reduce<
1514+
typename return_value_adapter::reducer_type>(policy, functor_reducer,
1515+
label, kpID);
1516+
const auto& inner_policy = response.policy;
1517+
15141518
auto closure = construct_with_shared_allocation_tracking_disabled<
15151519
Impl::ParallelReduce<CombinedFunctorReducerType, PolicyType,
15161520
typename Impl::FunctorPolicyExecutionSpace<
15171521
FunctorType, PolicyType>::execution_space>>(
1518-
CombinedFunctorReducerType(
1519-
functor, typename Analysis::Reducer(
1520-
ReducerSelector::select(functor, return_value))),
1521-
inner_policy,
1522+
functor_reducer, inner_policy,
15221523
return_value_adapter::return_value(return_value, functor));
15231524
closure.execute();
15241525

core/src/Kokkos_Tuners.hpp

Lines changed: 115 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,8 @@ VariableValue make_variable_value(size_t, int64_t);
5252
VariableValue make_variable_value(size_t, double);
5353
SetOrRange make_candidate_range(double lower, double upper, double step,
5454
bool openLower, bool openUpper);
55+
SetOrRange make_candidate_range(int64_t lower, int64_t upper, int64_t step,
56+
bool openLower, bool openUpper);
5557
size_t get_new_context_id();
5658
void begin_context(size_t context_id);
5759
void end_context(size_t context_id);
@@ -420,10 +422,11 @@ class TeamSizeTuner : public ExtendableTunerMixin<TeamSizeTuner> {
420422
template <typename ViableConfigurationCalculator, typename Functor,
421423
typename TagType, typename... Properties>
422424
TeamSizeTuner(const std::string& name,
423-
Kokkos::TeamPolicy<Properties...>& policy,
425+
const Kokkos::TeamPolicy<Properties...>& policy_in,
424426
const Functor& functor, const TagType& tag,
425427
ViableConfigurationCalculator calc) {
426-
using PolicyType = Kokkos::TeamPolicy<Properties...>;
428+
using PolicyType = Kokkos::TeamPolicy<Properties...>;
429+
PolicyType policy(policy_in);
427430
auto initial_vector_length = policy.impl_vector_length();
428431
if (initial_vector_length < 1) {
429432
policy.impl_set_vector_length(1);
@@ -505,7 +508,8 @@ class TeamSizeTuner : public ExtendableTunerMixin<TeamSizeTuner> {
505508
}
506509

507510
template <typename... Properties>
508-
void tune(Kokkos::TeamPolicy<Properties...>& policy) {
511+
auto tune(const Kokkos::TeamPolicy<Properties...>& policy_in) {
512+
Kokkos::TeamPolicy<Properties...> policy(policy_in);
509513
if (Kokkos::Tools::Experimental::have_tuning_tool()) {
510514
auto configuration = tuner.begin();
511515
auto team_size = std::get<1>(configuration);
@@ -515,6 +519,111 @@ class TeamSizeTuner : public ExtendableTunerMixin<TeamSizeTuner> {
515519
policy.impl_set_vector_length(vector_length);
516520
}
517521
}
522+
return policy;
523+
}
524+
void end() {
525+
if (Kokkos::Tools::Experimental::have_tuning_tool()) {
526+
tuner.end();
527+
}
528+
}
529+
530+
TunerType get_tuner() const { return tuner; }
531+
};
532+
namespace Impl {
533+
template <class T>
534+
struct tuning_type_for;
535+
536+
template <>
537+
struct tuning_type_for<double> {
538+
static constexpr Kokkos::Tools::Experimental::ValueType value =
539+
Kokkos::Tools::Experimental::ValueType::kokkos_value_double;
540+
static double get(
541+
const Kokkos::Tools::Experimental::VariableValue& value_struct) {
542+
return value_struct.value.double_value;
543+
}
544+
};
545+
template <>
546+
struct tuning_type_for<int64_t> {
547+
static constexpr Kokkos::Tools::Experimental::ValueType value =
548+
Kokkos::Tools::Experimental::ValueType::kokkos_value_int64;
549+
static int64_t get(
550+
const Kokkos::Tools::Experimental::VariableValue& value_struct) {
551+
return value_struct.value.int_value;
552+
}
553+
};
554+
} // namespace Impl
555+
template <class Bound>
556+
class SingleDimensionalRangeTuner {
557+
size_t id;
558+
size_t context;
559+
using tuning_util = Impl::tuning_type_for<Bound>;
560+
561+
Bound default_value;
562+
563+
public:
564+
SingleDimensionalRangeTuner() = default;
565+
SingleDimensionalRangeTuner(
566+
const std::string& name,
567+
Kokkos::Tools::Experimental::StatisticalCategory category,
568+
Bound default_val, Bound lower, Bound upper, Bound step = (Bound)0) {
569+
default_value = default_val;
570+
Kokkos::Tools::Experimental::VariableInfo info;
571+
info.category = category;
572+
info.candidates = make_candidate_range(
573+
static_cast<Bound>(lower), static_cast<Bound>(upper),
574+
static_cast<Bound>(step), false, false);
575+
info.valueQuantity =
576+
Kokkos::Tools::Experimental::CandidateValueType::kokkos_value_range;
577+
info.type = tuning_util::value;
578+
id = Kokkos::Tools::Experimental::declare_output_type(name, info);
579+
}
580+
581+
Bound begin() {
582+
context = Kokkos::Tools::Experimental::get_new_context_id();
583+
Kokkos::Tools::Experimental::begin_context(context);
584+
auto tuned_value =
585+
Kokkos::Tools::Experimental::make_variable_value(id, default_value);
586+
Kokkos::Tools::Experimental::request_output_values(context, 1,
587+
&tuned_value);
588+
return tuning_util::get(tuned_value);
589+
}
590+
591+
void end() { Kokkos::Tools::Experimental::end_context(context); }
592+
593+
template <typename Functor>
594+
void with_tuned_value(Functor& func) {
595+
func(begin());
596+
end();
597+
}
598+
};
599+
600+
class RangePolicyOccupancyTuner {
601+
private:
602+
using TunerType = SingleDimensionalRangeTuner<int64_t>;
603+
TunerType tuner;
604+
605+
public:
606+
RangePolicyOccupancyTuner() = default;
607+
template <typename ViableConfigurationCalculator, typename Functor,
608+
typename TagType, typename... Properties>
609+
RangePolicyOccupancyTuner(const std::string& name,
610+
const Kokkos::RangePolicy<Properties...>&,
611+
const Functor&, const TagType&,
612+
ViableConfigurationCalculator)
613+
: tuner(TunerType(name,
614+
Kokkos::Tools::Experimental::StatisticalCategory::
615+
kokkos_value_ratio,
616+
100, 5, 100, 5)) {}
617+
618+
template <typename... Properties>
619+
auto tune(const Kokkos::RangePolicy<Properties...>& policy_in) {
620+
Kokkos::RangePolicy<Properties...> policy(policy_in);
621+
if (Kokkos::Tools::Experimental::have_tuning_tool()) {
622+
auto occupancy = tuner.begin();
623+
policy.impl_set_desired_occupancy(
624+
Kokkos::Experimental::DesiredOccupancy{static_cast<int>(occupancy)});
625+
}
626+
return policy;
518627
}
519628
void end() {
520629
if (Kokkos::Tools::Experimental::have_tuning_tool()) {
@@ -578,11 +687,13 @@ struct MDRangeTuner : public ExtendableTunerMixin<MDRangeTuner<MDRangeRank>> {
578687
policy.impl_change_tile_size({std::get<Indices>(tuple)...});
579688
}
580689
template <typename... Properties>
581-
void tune(Kokkos::MDRangePolicy<Properties...>& policy) {
690+
auto tune(const Kokkos::MDRangePolicy<Properties...>& policy_in) {
691+
Kokkos::MDRangePolicy<Properties...> policy(policy_in);
582692
if (Kokkos::Tools::Experimental::have_tuning_tool()) {
583693
auto configuration = tuner.begin();
584694
set_policy_tile(policy, configuration, std::make_index_sequence<rank>{});
585695
}
696+
return policy;
586697
}
587698
void end() {
588699
if (Kokkos::Tools::Experimental::have_tuning_tool()) {

core/src/impl/Kokkos_Profiling.hpp

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,15 @@
1717
#ifndef KOKKOS_IMPL_KOKKOS_PROFILING_HPP
1818
#define KOKKOS_IMPL_KOKKOS_PROFILING_HPP
1919

20+
#ifndef KOKKOS_IMPL_PUBLIC_INCLUDE
21+
#define KOKKOS_IMPL_PUBLIC_INCLUDE
22+
#define KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
23+
#endif
24+
25+
#include <Kokkos_Core_fwd.hpp>
26+
#include <Kokkos_ExecPolicy.hpp>
27+
#include <Kokkos_Macros.hpp>
28+
#include <Kokkos_Tuners.hpp>
2029
#include <impl/Kokkos_Profiling_Interface.hpp>
2130
#include <memory>
2231
#include <iosfwd>
@@ -64,6 +73,11 @@ void parse_command_line_arguments(int& narg, char* arg[],
6473
Kokkos::Tools::Impl::InitializationStatus parse_environment_variables(
6574
InitArguments& arguments);
6675

76+
template <typename PolicyType, typename Functor>
77+
struct ToolResponse {
78+
PolicyType policy;
79+
};
80+
6781
} // namespace Impl
6882

6983
bool profileLibraryLoaded();
@@ -260,6 +274,8 @@ size_t get_new_context_id();
260274
size_t get_current_context_id();
261275
} // namespace Experimental
262276

277+
namespace Impl {} // namespace Impl
278+
263279
} // namespace Tools
264280
namespace Profiling {
265281

@@ -375,4 +391,9 @@ size_t get_new_variable_id();
375391

376392
} // namespace Kokkos
377393

394+
#ifdef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
395+
#undef KOKKOS_IMPL_PUBLIC_INCLUDE
396+
#undef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
397+
#endif
398+
378399
#endif

0 commit comments

Comments
 (0)