Skip to content

Conversation

@jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Mar 18, 2025

Summary:
The --offload-arch option is very complicated, but roughly behaves as
the -march option for several compilations at once. This creates
problems when we try to combine multiple separate architectures into
one, as happens with SYCL, OpenMP, and HIP w/ SPIR-V.

The existing solution used by OpenMP is the -Xopenmp-target option,
this lets you select which --offload-arch options go to which
toolchain. This patch permits -Xarch_ to be used in the same way.

There are concerns about whether or not this falls into the -Xarch_
umbrella because it changes the driver behaviour, but I think this is the
easiest way to handle this problem. The existing solution seems to be
prefixing things and adding more magic handling into --offload-arch.
Like SYCL is doing nvidia_gpu_sm_89 instead of just -Xarch_nvptx64 --offload-arch=sm_89.

The only reason this is more complicated than just doing -Xarch_sm_89 -march=... is because we need to know to create multiple jobs for each
architecture.

Summary:
The `--offload-arch` option is very complicated, but roughly behaves as
the `-march` option for several compilations at once. This creates
problems when we try to compbine multiple separate architectures into
one, as happens with SYCL, OpenMP, and HIP w/ SPIR-V.

The existing solution used by OpenMP is the `-Xopenmp-target` option,
this lets you select which `--offload-arch` options go to which
toolchain. This patch premits `-Xarch_` to be used in the same way.

There are concerns about whether or not this falls into the `-Xarch_`
umbrella because it changes the driver behavior, but I think this is the
easiest way to handle this problem. The existing solutions seems to be
prefixing things and adding more magic handling into `--offload-arch`.
Like SPIRV is doing `nvidia_gpu_sm_89` instead of just `-Xarch_nvptx64
--offload-arch=sm_89`.

The only reason this is more complicated than just doing `-Xarch_sm_89
-march=...` is because we need to know to create multiple jobs for each
architecture.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Mar 18, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 18, 2025

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
The --offload-arch option is very complicated, but roughly behaves as
the -march option for several compilations at once. This creates
problems when we try to compbine multiple separate architectures into
one, as happens with SYCL, OpenMP, and HIP w/ SPIR-V.

The existing solution used by OpenMP is the -Xopenmp-target option,
this lets you select which --offload-arch options go to which
toolchain. This patch premits -Xarch_ to be used in the same way.

There are concerns about whether or not this falls into the -Xarch_
umbrella because it changes the driver behavior, but I think this is the
easiest way to handle this problem. The existing solutions seems to be
prefixing things and adding more magic handling into --offload-arch.
Like SPIRV is doing nvidia_gpu_sm_89 instead of just -Xarch_nvptx64 --offload-arch=sm_89.

The only reason this is more complicated than just doing -Xarch_sm_89 -march=... is because we need to know to create multiple jobs for each
architecture.


Full diff: https://github.com/llvm/llvm-project/pull/131884.diff

2 Files Affected:

  • (modified) clang/include/clang/Driver/Options.td (+1-2)
  • (modified) clang/test/Driver/offload-Xarch.c (+4)
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 66ae8f1c7f064..05fc6aaa266b5 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1129,13 +1129,12 @@ def fno_convergent_functions : Flag<["-"], "fno-convergent-functions">,
 // Common offloading options
 let Group = offload_Group in {
 def offload_arch_EQ : Joined<["--"], "offload-arch=">,
-  Visibility<[ClangOption, FlangOption]>, Flags<[NoXarchOption]>,
+  Visibility<[ClangOption, FlangOption]>,
   HelpText<"Specify an offloading device architecture for CUDA, HIP, or OpenMP. (e.g. sm_35). "
            "If 'native' is used the compiler will detect locally installed architectures. "
            "For HIP offloading, the device architecture can be followed by target ID features "
            "delimited by a colon (e.g. gfx908:xnack+:sramecc-). May be specified more than once.">;
 def no_offload_arch_EQ : Joined<["--"], "no-offload-arch=">,
-  Flags<[NoXarchOption]>,
   Visibility<[ClangOption, FlangOption]>,
   HelpText<"Remove CUDA/HIP offloading device architecture (e.g. sm_35, gfx906) from the list of devices to compile for. "
            "'all' resets the list to its default value.">;
diff --git a/clang/test/Driver/offload-Xarch.c b/clang/test/Driver/offload-Xarch.c
index 8856dac198465..8106dcfcd1354 100644
--- a/clang/test/Driver/offload-Xarch.c
+++ b/clang/test/Driver/offload-Xarch.c
@@ -14,6 +14,10 @@
 // RUN:   --target=x86_64-unknown-linux-gnu -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_52,sm_60 -nogpuinc \
 // RUN:   -Xopenmp-target=amdgcn-amd-amdhsa --offload-arch=gfx90a,gfx1030 -ccc-print-bindings -### %s 2>&1 \
 // RUN: | FileCheck -check-prefix=OPENMP %s
+// RUN: %clang -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda,amdgcn-amd-amdhsa -nogpulib \
+// RUN:   --target=x86_64-unknown-linux-gnu -Xarch_nvptx64 --offload-arch=sm_52,sm_60 -nogpuinc \
+// RUN:   -Xarch_amdgcn --offload-arch=gfx90a,gfx1030 -ccc-print-bindings -### %s 2>&1 \
+// RUN: | FileCheck -check-prefix=OPENMP %s
 
 // OPENMP: # "x86_64-unknown-linux-gnu" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[HOST_BC:.+]]"
 // OPENMP: # "amdgcn-amd-amdhsa" - "clang", inputs: ["[[INPUT]]", "[[HOST_BC]]"], output: "[[GFX1030_BC:.+]]"

@llvmbot
Copy link
Member

llvmbot commented Mar 18, 2025

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
The --offload-arch option is very complicated, but roughly behaves as
the -march option for several compilations at once. This creates
problems when we try to compbine multiple separate architectures into
one, as happens with SYCL, OpenMP, and HIP w/ SPIR-V.

The existing solution used by OpenMP is the -Xopenmp-target option,
this lets you select which --offload-arch options go to which
toolchain. This patch premits -Xarch_ to be used in the same way.

There are concerns about whether or not this falls into the -Xarch_
umbrella because it changes the driver behavior, but I think this is the
easiest way to handle this problem. The existing solutions seems to be
prefixing things and adding more magic handling into --offload-arch.
Like SPIRV is doing nvidia_gpu_sm_89 instead of just -Xarch_nvptx64 --offload-arch=sm_89.

The only reason this is more complicated than just doing -Xarch_sm_89 -march=... is because we need to know to create multiple jobs for each
architecture.


Full diff: https://github.com/llvm/llvm-project/pull/131884.diff

2 Files Affected:

  • (modified) clang/include/clang/Driver/Options.td (+1-2)
  • (modified) clang/test/Driver/offload-Xarch.c (+4)
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 66ae8f1c7f064..05fc6aaa266b5 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1129,13 +1129,12 @@ def fno_convergent_functions : Flag<["-"], "fno-convergent-functions">,
 // Common offloading options
 let Group = offload_Group in {
 def offload_arch_EQ : Joined<["--"], "offload-arch=">,
-  Visibility<[ClangOption, FlangOption]>, Flags<[NoXarchOption]>,
+  Visibility<[ClangOption, FlangOption]>,
   HelpText<"Specify an offloading device architecture for CUDA, HIP, or OpenMP. (e.g. sm_35). "
            "If 'native' is used the compiler will detect locally installed architectures. "
            "For HIP offloading, the device architecture can be followed by target ID features "
            "delimited by a colon (e.g. gfx908:xnack+:sramecc-). May be specified more than once.">;
 def no_offload_arch_EQ : Joined<["--"], "no-offload-arch=">,
-  Flags<[NoXarchOption]>,
   Visibility<[ClangOption, FlangOption]>,
   HelpText<"Remove CUDA/HIP offloading device architecture (e.g. sm_35, gfx906) from the list of devices to compile for. "
            "'all' resets the list to its default value.">;
diff --git a/clang/test/Driver/offload-Xarch.c b/clang/test/Driver/offload-Xarch.c
index 8856dac198465..8106dcfcd1354 100644
--- a/clang/test/Driver/offload-Xarch.c
+++ b/clang/test/Driver/offload-Xarch.c
@@ -14,6 +14,10 @@
 // RUN:   --target=x86_64-unknown-linux-gnu -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_52,sm_60 -nogpuinc \
 // RUN:   -Xopenmp-target=amdgcn-amd-amdhsa --offload-arch=gfx90a,gfx1030 -ccc-print-bindings -### %s 2>&1 \
 // RUN: | FileCheck -check-prefix=OPENMP %s
+// RUN: %clang -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda,amdgcn-amd-amdhsa -nogpulib \
+// RUN:   --target=x86_64-unknown-linux-gnu -Xarch_nvptx64 --offload-arch=sm_52,sm_60 -nogpuinc \
+// RUN:   -Xarch_amdgcn --offload-arch=gfx90a,gfx1030 -ccc-print-bindings -### %s 2>&1 \
+// RUN: | FileCheck -check-prefix=OPENMP %s
 
 // OPENMP: # "x86_64-unknown-linux-gnu" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[HOST_BC:.+]]"
 // OPENMP: # "amdgcn-amd-amdhsa" - "clang", inputs: ["[[INPUT]]", "[[HOST_BC]]"], output: "[[GFX1030_BC:.+]]"

@bader
Copy link
Contributor

bader commented Mar 18, 2025

Like SPIRV is doing nvidia_gpu_sm_89 instead of just -Xarch_nvptx64 --offload-arch=sm_89.

Like SYCL?

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhuber6, thank you for helping with the common offload infrastructure!

It seems that if I want to target NVIDIA RTX 4080, I have to provide at least four flags:

  1. offloading mode (e.g. -fopenmp)
  2. offloading target (e.g. -fopenmp-targets=nvptx64)
  3. offloading architecture using technically two flags: -Xarch_ and --offload-arch= (e.g. -Xarch_nvptx64 --offload-arch=sm_89)

As a user, I wish to have simpler command line interface when I don't need to configure device toolchain - just to specify exact device to tune for. At the same time, I agree that we need this interface for configuring device toolchains.

Tagging @mdtoguchi, @Naghasan for awareness.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Mar 18, 2025

Thanks, @Artem-B had the initial hangups, so I'll defer to him for the final +1. I'd prefer this solution to continuously prefixing things in offload-arch however.

As a user, I wish to have simpler command line interface when I don't need to configure device toolchain - just to specify exact device to tune for. At the same time, I agree that we need this interface for configuring device toolchains.

Yeah, I think things necessarily start getting complicated when you combine many different architectures into one clang job. We could theoretically just keep putting things in --offload-arch but soon the complexity gets pretty similar.

@jhuber6 jhuber6 merged commit 561dcb2 into llvm:main Mar 21, 2025
14 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 21, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-amdgpu-runtime running on omp-vega20-0 while building clang at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/30/builds/18117

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/gpupgo/pgo2.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo2.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-generate
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo2.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-generate
# note: command had no output on stdout or stderr
# RUN: at line 2
env LLVM_PROFILE_FILE=pgo2.c.llvm.profraw      /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo2.c.tmp 2>&1
# executed command: env LLVM_PROFILE_FILE=pgo2.c.llvm.profraw /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo2.c.tmp
# note: command had no output on stdout or stderr
# RUN: at line 4
llvm-profdata show --all-functions --counts      pgo2.c.llvm.profraw | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c      --check-prefix="LLVM-HOST"
# executed command: llvm-profdata show --all-functions --counts pgo2.c.llvm.profraw
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c --check-prefix=LLVM-HOST
# note: command had no output on stdout or stderr
# RUN: at line 7
llvm-profdata show --all-functions --counts      amdgcn-amd-amdhsa.pgo2.c.llvm.profraw      | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c --check-prefix="LLVM-DEVICE"
# executed command: llvm-profdata show --all-functions --counts amdgcn-amd-amdhsa.pgo2.c.llvm.profraw
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c --check-prefix=LLVM-DEVICE
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c:81:17: error: LLVM-DEVICE: expected string not found in input
# | // LLVM-DEVICE: Block counts: [10, 2, 1]
# |                 ^
# | <stdin>:4:13: note: scanning from here
# |  Counters: 3
# |             ^
# | <stdin>:5:2: note: possible intended match here
# |  Block counts: [10, 3, 1]
# |  ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo2.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             1: Counters: 
# |             2:  __omp_offloading_802_b3a8121_main_l61: 
# |             3:  Hash: 0x07735b6a1ad4d6e5 
# |             4:  Counters: 3 
# | check:81'0                 X error: no match found
# |             5:  Block counts: [10, 3, 1] 
# | check:81'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~
# | check:81'1      ?                         possible intended match
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants