-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Closed
Labels
cudapollyquestionA question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!
Description
We are compiling CUDA libraries (i.e. cutlass, cudnn_frontend, nccl, etc) from sources using clang++ for both host and device code.
At linkage time with polly enabled (-mllvm=-polly -mllvm=-polly-vectorizer=stripmine), for both host and device code, compilation can take hours.
Without polly, compilation drops to a few minutes.
Since polly is only relevant for host code anyway, how can I enable polly flags only for host code compilation and disable it for device code compilation?
Some relevant context:
- Using CMake, here are our default
CUDAFLAGS. Using these elapsed compilation time forcutlasswas ~4 hours.
export CUDAFLAGS="
-O3 \
-flto=thin -ffat-lto-objects -Wl,--lto-whole-program-visibility \
-mllvm=-polly -mllvm=-polly-vectorizer=stripmine \
-pipe -Qunused-arguments -fident -fcolor-diagnostics \
-Wno-cuda-compat
"
- Removing polly, like the following, drops
cutlasscompilation time to ~9 minutes:
export CUDAFLAGS="
-O3 \
-flto=thin -ffat-lto-objects -Wl,--lto-whole-program-visibility \
-pipe -Qunused-arguments -fident -fcolor-diagnostics \
-Wno-cuda-compat
"
- We've also tried various experiments like removing just
--lto-whole-program-visibility. All other experiments yielded the same results —cutlasscompilation time was ~4 hours, with 99% of time spent on linking steps.
Let me know if there's other context I can provide.
Metadata
Metadata
Assignees
Labels
cudapollyquestionA question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!