Skip to content

Commit 2479cf8

Browse files
shiltianmemfrob
authored andcommitted
[OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language
From this patch (plus some landed patches), `deviceRTLs` is taken as a regular OpenMP program with just `declare target` regions. In this way, ideally, `deviceRTLs` can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics. Here're a list of changes in this patch. 1. For NVPTX, `DEVICE` is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove `DEVICE` or probably some other macros. 2. Shared variable is implemented with OpenMP allocator, which is defined in `allocator.h`. Again, this feature is not available on AMDGCN, so two macros are redefined properly. 3. CUDA header `cuda.h` is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation. 4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as `libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc`, such as `libomptarget-nvptx-cuda_80-sm_20.bc`. With this change, there are also multiple features to be expected in the near future: 1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version. 2. Atomic operations used in `deviceRTLs` can be replaced by `omp atomic` if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong. 3. Target specific parts will be wrapped into `declare variant` with `isa` selector if it can work properly. No target specific macro is needed anymore. 4. (Maybe more...) Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D94745
1 parent adb2e59 commit 2479cf8

File tree

12 files changed

+207
-119
lines changed

12 files changed

+207
-119
lines changed

clang/lib/Driver/ToolChains/Cuda.cpp

Lines changed: 20 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -712,33 +712,30 @@ void CudaToolChain::addClangTargetOptions(
712712
CC1Args.push_back("-mlink-builtin-bitcode");
713713
CC1Args.push_back(DriverArgs.MakeArgString(LibDeviceFile));
714714

715+
std::string CudaVersionStr;
716+
715717
// New CUDA versions often introduce new instructions that are only supported
716718
// by new PTX version, so we need to raise PTX level to enable them in NVPTX
717719
// back-end.
718720
const char *PtxFeature = nullptr;
719721
switch (CudaInstallation.version()) {
720-
case CudaVersion::CUDA_110:
721-
PtxFeature = "+ptx70";
722-
break;
723-
case CudaVersion::CUDA_102:
724-
PtxFeature = "+ptx65";
725-
break;
726-
case CudaVersion::CUDA_101:
727-
PtxFeature = "+ptx64";
728-
break;
729-
case CudaVersion::CUDA_100:
730-
PtxFeature = "+ptx63";
731-
break;
732-
case CudaVersion::CUDA_92:
733-
PtxFeature = "+ptx61";
734-
break;
735-
case CudaVersion::CUDA_91:
736-
PtxFeature = "+ptx61";
737-
break;
738-
case CudaVersion::CUDA_90:
739-
PtxFeature = "+ptx60";
722+
#define CASE_CUDA_VERSION(CUDA_VER, PTX_VER) \
723+
case CudaVersion::CUDA_##CUDA_VER: \
724+
CudaVersionStr = #CUDA_VER; \
725+
PtxFeature = "+ptx" #PTX_VER; \
740726
break;
727+
CASE_CUDA_VERSION(110, 70);
728+
CASE_CUDA_VERSION(102, 65);
729+
CASE_CUDA_VERSION(101, 64);
730+
CASE_CUDA_VERSION(100, 63);
731+
CASE_CUDA_VERSION(92, 61);
732+
CASE_CUDA_VERSION(91, 61);
733+
CASE_CUDA_VERSION(90, 60);
734+
#undef CASE_CUDA_VERSION
741735
default:
736+
// If unknown CUDA version, we take it as CUDA 8.0. Same assumption is also
737+
// made in libomptarget/deviceRTLs.
738+
CudaVersionStr = "80";
742739
PtxFeature = "+ptx42";
743740
}
744741
CC1Args.append({"-target-feature", PtxFeature});
@@ -784,8 +781,9 @@ void CudaToolChain::addClangTargetOptions(
784781
} else {
785782
bool FoundBCLibrary = false;
786783

787-
std::string LibOmpTargetName =
788-
"libomptarget-nvptx-" + GpuArch.str() + ".bc";
784+
std::string LibOmpTargetName = "libomptarget-nvptx-cuda_" +
785+
CudaVersionStr + "-" + GpuArch.str() +
786+
".bc";
789787

790788
for (StringRef LibraryPath : LibraryPaths) {
791789
SmallString<128> LibOmpTargetFile(LibraryPath);

clang/test/Driver/openmp-offload-gpu.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@
164164
// RUN: -fopenmp-relocatable-target -save-temps -no-canonical-prefixes %s 2>&1 \
165165
// RUN: | FileCheck -check-prefix=CHK-BCLIB-USER %s
166166

167-
// CHK-BCLIB: clang{{.*}}-triple{{.*}}nvptx64-nvidia-cuda{{.*}}-mlink-builtin-bitcode{{.*}}libomptarget-nvptx-sm_20.bc
167+
// CHK-BCLIB: clang{{.*}}-triple{{.*}}nvptx64-nvidia-cuda{{.*}}-mlink-builtin-bitcode{{.*}}libomptarget-nvptx-cuda_80-sm_20.bc
168168
// CHK-BCLIB-USER: clang{{.*}}-triple{{.*}}nvptx64-nvidia-cuda{{.*}}-mlink-builtin-bitcode{{.*}}libomptarget-nvptx-test.bc
169169
// CHK-BCLIB-NOT: {{error:|warning:}}
170170

@@ -177,7 +177,7 @@
177177
// RUN: -fopenmp-relocatable-target -save-temps -no-canonical-prefixes %s 2>&1 \
178178
// RUN: | FileCheck -check-prefix=CHK-BCLIB-WARN %s
179179

180-
// CHK-BCLIB-WARN: No library 'libomptarget-nvptx-sm_20.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.
180+
// CHK-BCLIB-WARN: No library 'libomptarget-nvptx-cuda_80-sm_20.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.
181181

182182
/// ###########################################################################
183183

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@
2626
#define DEVICE __attribute__((device))
2727
#define INLINE inline DEVICE
2828
#define NOINLINE __attribute__((noinline)) DEVICE
29-
#define SHARED __attribute__((shared))
29+
#define SHARED(NAME) __attribute__((shared)) NAME
30+
#define EXTERN_SHARED(NAME) __attribute__((shared)) NAME
3031
#define ALIGN(N) __attribute__((aligned(N)))
3132

3233
////////////////////////////////////////////////////////////////////////////////
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
//===--------- allocator.h - OpenMP target memory allocator ------- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
//
9+
// Macros for allocating variables in different address spaces.
10+
//
11+
//===----------------------------------------------------------------------===//
12+
13+
#ifndef OMPTARGET_ALLOCATOR_H
14+
#define OMPTARGET_ALLOCATOR_H
15+
16+
#if _OPENMP
17+
// Follows the pattern in interface.h
18+
// Clang sema checks this type carefully, needs to closely match that from omp.h
19+
typedef enum omp_allocator_handle_t {
20+
omp_null_allocator = 0,
21+
omp_default_mem_alloc = 1,
22+
omp_large_cap_mem_alloc = 2,
23+
omp_const_mem_alloc = 3,
24+
omp_high_bw_mem_alloc = 4,
25+
omp_low_lat_mem_alloc = 5,
26+
omp_cgroup_mem_alloc = 6,
27+
omp_pteam_mem_alloc = 7,
28+
omp_thread_mem_alloc = 8,
29+
KMP_ALLOCATOR_MAX_HANDLE = ~(0U)
30+
} omp_allocator_handle_t;
31+
32+
#define __PRAGMA(STR) _Pragma(#STR)
33+
#define OMP_PRAGMA(STR) __PRAGMA(omp STR)
34+
35+
#define SHARED(NAME) \
36+
NAME [[clang::loader_uninitialized]]; \
37+
OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))
38+
39+
#define EXTERN_SHARED(NAME) \
40+
NAME; \
41+
OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))
42+
#endif
43+
44+
#endif // OMPTARGET_ALLOCATOR_H

openmp/libomptarget/deviceRTLs/common/omptarget.h

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@
1414
#ifndef OMPTARGET_H
1515
#define OMPTARGET_H
1616

17-
#include "target_impl.h"
18-
#include "common/debug.h" // debug
19-
#include "interface.h" // interfaces with omp, compiler, and user
17+
#include "common/allocator.h"
18+
#include "common/debug.h" // debug
2019
#include "common/state-queue.h"
2120
#include "common/support.h"
21+
#include "interface.h" // interfaces with omp, compiler, and user
22+
#include "target_impl.h"
2223

2324
#define OMPTARGET_NVPTX_VERSION 1.1
2425

@@ -71,8 +72,8 @@ class omptarget_nvptx_SharedArgs {
7172
uint32_t nArgs;
7273
};
7374

74-
extern DEVICE SHARED omptarget_nvptx_SharedArgs
75-
omptarget_nvptx_globalArgs;
75+
extern DEVICE
76+
omptarget_nvptx_SharedArgs EXTERN_SHARED(omptarget_nvptx_globalArgs);
7677

7778
// Worker slot type which is initialized with the default worker slot
7879
// size of 4*32 bytes.
@@ -94,7 +95,7 @@ struct DataSharingStateTy {
9495
__kmpc_impl_lanemask_t ActiveThreads[DS_Max_Warp_Number];
9596
};
9697

97-
extern DEVICE SHARED DataSharingStateTy DataSharingState;
98+
extern DEVICE DataSharingStateTy EXTERN_SHARED(DataSharingState);
9899

99100
////////////////////////////////////////////////////////////////////////////////
100101
// task ICV and (implicit & explicit) task state
@@ -273,9 +274,9 @@ class omptarget_nvptx_ThreadPrivateContext {
273274
/// Memory manager for statically allocated memory.
274275
class omptarget_nvptx_SimpleMemoryManager {
275276
private:
276-
ALIGN(128) struct MemDataTy {
277+
struct MemDataTy {
277278
volatile unsigned keys[OMP_STATE_COUNT];
278-
} MemData[MAX_SM];
279+
} MemData[MAX_SM] ALIGN(128);
279280

280281
INLINE static uint32_t hash(unsigned key) {
281282
return key & (OMP_STATE_COUNT - 1);
@@ -294,27 +295,32 @@ class omptarget_nvptx_SimpleMemoryManager {
294295

295296
extern DEVICE omptarget_nvptx_SimpleMemoryManager
296297
omptarget_nvptx_simpleMemoryManager;
297-
extern DEVICE SHARED uint32_t usedMemIdx;
298-
extern DEVICE SHARED uint32_t usedSlotIdx;
299-
extern DEVICE SHARED uint8_t
300-
parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
301-
extern DEVICE SHARED uint16_t threadLimit;
302-
extern DEVICE SHARED uint16_t threadsInTeam;
303-
extern DEVICE SHARED uint16_t nThreads;
304-
extern DEVICE SHARED
305-
omptarget_nvptx_ThreadPrivateContext *omptarget_nvptx_threadPrivateContext;
306-
307-
extern DEVICE SHARED uint32_t execution_param;
308-
extern DEVICE SHARED void *ReductionScratchpadPtr;
298+
extern DEVICE uint32_t EXTERN_SHARED(usedMemIdx);
299+
extern DEVICE uint32_t EXTERN_SHARED(usedSlotIdx);
300+
#if _OPENMP
301+
extern DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
302+
#pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc)
303+
#else
304+
extern DEVICE
305+
uint8_t EXTERN_SHARED(parallelLevel)[MAX_THREADS_PER_TEAM / WARPSIZE];
306+
#endif
307+
extern DEVICE uint16_t EXTERN_SHARED(threadLimit);
308+
extern DEVICE uint16_t EXTERN_SHARED(threadsInTeam);
309+
extern DEVICE uint16_t EXTERN_SHARED(nThreads);
310+
extern DEVICE omptarget_nvptx_ThreadPrivateContext *
311+
EXTERN_SHARED(omptarget_nvptx_threadPrivateContext);
312+
313+
extern DEVICE uint32_t EXTERN_SHARED(execution_param);
314+
extern DEVICE void *EXTERN_SHARED(ReductionScratchpadPtr);
309315

310316
////////////////////////////////////////////////////////////////////////////////
311317
// work function (outlined parallel/simd functions) and arguments.
312318
// needed for L1 parallelism only.
313319
////////////////////////////////////////////////////////////////////////////////
314320

315321
typedef void *omptarget_nvptx_WorkFn;
316-
extern volatile DEVICE SHARED omptarget_nvptx_WorkFn
317-
omptarget_nvptx_workFn;
322+
extern volatile DEVICE
323+
omptarget_nvptx_WorkFn EXTERN_SHARED(omptarget_nvptx_workFn);
318324

319325
////////////////////////////////////////////////////////////////////////////////
320326
// get private data structures

openmp/libomptarget/deviceRTLs/common/src/omp_data.cu

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,9 @@
1111
//===----------------------------------------------------------------------===//
1212
#pragma omp declare target
1313

14-
#include "common/omptarget.h"
14+
#include "common/allocator.h"
1515
#include "common/device_environment.h"
16+
#include "common/omptarget.h"
1617

1718
////////////////////////////////////////////////////////////////////////////////
1819
// global device environment
@@ -28,44 +29,44 @@ DEVICE
2829
omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>
2930
omptarget_nvptx_device_State[MAX_SM];
3031

31-
DEVICE omptarget_nvptx_SimpleMemoryManager
32-
omptarget_nvptx_simpleMemoryManager;
33-
DEVICE SHARED uint32_t usedMemIdx;
34-
DEVICE SHARED uint32_t usedSlotIdx;
32+
DEVICE omptarget_nvptx_SimpleMemoryManager omptarget_nvptx_simpleMemoryManager;
33+
DEVICE uint32_t SHARED(usedMemIdx);
34+
DEVICE uint32_t SHARED(usedSlotIdx);
3535

36-
DEVICE SHARED uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
37-
DEVICE SHARED uint16_t threadLimit;
38-
DEVICE SHARED uint16_t threadsInTeam;
39-
DEVICE SHARED uint16_t nThreads;
36+
DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
37+
#pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc)
38+
DEVICE uint16_t SHARED(threadLimit);
39+
DEVICE uint16_t SHARED(threadsInTeam);
40+
DEVICE uint16_t SHARED(nThreads);
4041
// Pointer to this team's OpenMP state object
41-
DEVICE SHARED
42-
omptarget_nvptx_ThreadPrivateContext *omptarget_nvptx_threadPrivateContext;
42+
DEVICE omptarget_nvptx_ThreadPrivateContext *
43+
SHARED(omptarget_nvptx_threadPrivateContext);
4344

4445
////////////////////////////////////////////////////////////////////////////////
4546
// The team master sets the outlined parallel function in this variable to
4647
// communicate with the workers. Since it is in shared memory, there is one
4748
// copy of these variables for each kernel, instance, and team.
4849
////////////////////////////////////////////////////////////////////////////////
49-
volatile DEVICE SHARED omptarget_nvptx_WorkFn omptarget_nvptx_workFn;
50+
volatile DEVICE omptarget_nvptx_WorkFn SHARED(omptarget_nvptx_workFn);
5051

5152
////////////////////////////////////////////////////////////////////////////////
5253
// OpenMP kernel execution parameters
5354
////////////////////////////////////////////////////////////////////////////////
54-
DEVICE SHARED uint32_t execution_param;
55+
DEVICE uint32_t SHARED(execution_param);
5556

5657
////////////////////////////////////////////////////////////////////////////////
5758
// Data sharing state
5859
////////////////////////////////////////////////////////////////////////////////
59-
DEVICE SHARED DataSharingStateTy DataSharingState;
60+
DEVICE DataSharingStateTy SHARED(DataSharingState);
6061

6162
////////////////////////////////////////////////////////////////////////////////
6263
// Scratchpad for teams reduction.
6364
////////////////////////////////////////////////////////////////////////////////
64-
DEVICE SHARED void *ReductionScratchpadPtr;
65+
DEVICE void *SHARED(ReductionScratchpadPtr);
6566

6667
////////////////////////////////////////////////////////////////////////////////
6768
// Data sharing related variables.
6869
////////////////////////////////////////////////////////////////////////////////
69-
DEVICE SHARED omptarget_nvptx_SharedArgs omptarget_nvptx_globalArgs;
70+
DEVICE omptarget_nvptx_SharedArgs SHARED(omptarget_nvptx_globalArgs);
7071

7172
#pragma omp end declare target

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -208,8 +208,8 @@ EXTERN int32_t __kmpc_nvptx_teams_reduce_nowait_v2(
208208
: /*Master thread only*/ 1;
209209
uint32_t TeamId = GetBlockIdInKernel();
210210
uint32_t NumTeams = GetNumberOfBlocksInKernel();
211-
static SHARED unsigned Bound;
212-
static SHARED unsigned ChunkTeamCount;
211+
static unsigned SHARED(Bound);
212+
static unsigned SHARED(ChunkTeamCount);
213213

214214
// Block progress for teams greater than the current upper
215215
// limit. We always only allow a number of teams less or equal

0 commit comments

Comments
 (0)