Skip to content

[polly][CUDA][clang] How to only enable polly flags for host code compilation when using clang++ for CUDA compilation? #143014

@BwL1289

Description

@BwL1289

We are compiling CUDA libraries (i.e. cutlass, cudnn_frontend, nccl, etc) from sources using clang++ for both host and device code.

At linkage time with polly enabled (-mllvm=-polly -mllvm=-polly-vectorizer=stripmine), for both host and device code, compilation can take hours.

Without polly, compilation drops to a few minutes.

Since polly is only relevant for host code anyway, how can I enable polly flags only for host code compilation and disable it for device code compilation?

Some relevant context:

  • Using CMake, here are our default CUDAFLAGS. Using these elapsed compilation time for cutlass was ~4 hours.
export CUDAFLAGS="
-O3 \
-flto=thin -ffat-lto-objects -Wl,--lto-whole-program-visibility \
-mllvm=-polly -mllvm=-polly-vectorizer=stripmine \
-pipe -Qunused-arguments -fident -fcolor-diagnostics \
-Wno-cuda-compat
"
  • Removing polly, like the following, drops cutlass compilation time to ~9 minutes:
export CUDAFLAGS="
-O3 \
-flto=thin -ffat-lto-objects -Wl,--lto-whole-program-visibility \
-pipe -Qunused-arguments -fident -fcolor-diagnostics \
-Wno-cuda-compat
"
  • We've also tried various experiments like removing just --lto-whole-program-visibility. All other experiments yielded the same results — cutlass compilation time was ~4 hours, with 99% of time spent on linking steps.

Let me know if there's other context I can provide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cudapollyquestionA question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions