-
Notifications
You must be signed in to change notification settings - Fork 367
Description
Type of Bug
Build time
Component
Thrust
Describe the bug
The compile-time for thrust::tabulate can be very slow when given a large unary op functor. Simple functors with only a few lines of code and very large functors with many lines of code both scale and compile equally well. There seems to be range of sizes where the compile time and the output object file size both scale poorly.
Ran several simple tests using the nvcc compiler on simple to complex and small to large functors and compared them to compiling the same code logic using thrust::transform with cuda::counting_iterator input parameters. This allows comparing equivalent function with the same functors.
Compile-time benchmark: tabulate vs transform [-O3] [arch=sm_86, 3 runs]
Level tabulate (s) transform (s) tab KB tfm KB diff (s) ratio
-----------------------------------------------------------------------------------------------------
L1_trivial 6.56 6.50 1160.2 1158.6 -0.06 0.991x
L2_sqrt_div 6.58 6.56 1160.2 1158.7 -0.02 0.996x
L3_chained_math 6.76 6.57 1188.2 1166.7 -0.19 0.972x
L4_branchy_8 6.69 6.60 1168.2 1158.7 -0.09 0.987x
L5_branchy_32 6.83 6.62 1196.2 1162.7 -0.22 0.968x
L6_large_50stmts 8.39 6.97 1496.3 1230.7 -1.42 0.831x
L7_large_100stmts 11.27 7.46 2100.3 1306.7 -3.81 0.662x
L8_large_150stmts 14.76 8.03 2492.3 1382.7 -6.73 0.544x
L9_large_200stmts 16.59 7.95 2776.3 1370.7 -8.65 0.479x
L10_large_250stmts 20.45 8.45 3152.3 1422.7 -12.01 0.413x
L11_large_300stmts 7.36 7.32 1304.3 1298.7 -0.04 0.995x
L12_large_350stmts 7.56 7.45 1324.3 1322.7 -0.10 0.986x
L13_large_400stmts 7.76 7.70 1348.3 1346.7 -0.06 0.993x
L14_large_450stmts 7.94 7.87 1372.3 1370.7 -0.07 0.992x
L15_large_500stmts 8.14 8.09 1396.3 1394.7 -0.05 0.994x
L16_fat_struct_10 6.85 6.60 1204.3 1166.7 -0.25 0.963x
L17_fat_struct_50 8.28 6.85 1364.3 1202.7 -1.43 0.827x
The L1-L5 and L16-L17 are different styles of functors though relatively small amounts of code. These did not become a factor so they are included here only as a baseline reference. Note the transform column is quite linear and provides impressive compile time overall.
Between the 50-statements and the 500-statements shows the issue with tabulate. It takes almost 2x more time and produces 2x larger object file size peaking around 250-statements and then drops back down at 300-statements.
Further investigating of this section shows a relatively sharp drop at 260 -- increasing linearly to 295 and then dropping back to equal transform around 300.
This issue was found when building libcudf and was solved by changing some tabulate calls to the equivalent transform in this PR: rapidsai/cudf#21793
The PR description indicates a significant change in build time for the those source files with larger unary-ops.
How to Reproduce
This python script was used to compare compiling a simple tabulate and its equivalent transform
bench_compile.py
This script was generated by Claude and it had some guesses on why the difference in compile-time which I'm very skeptical about. But I can provide those guesses if you want.
The nvcc compiler used is 13.2 though we've seen the issue in libcudf in 12.8 as well.
The script hardcodes sm_86 to help minimize runtime a bit.
The script takes 2 parameters:
--O3- adds the-O3compile optimize parameter (this appeared to have no effect)--probe-cliff- runs through the 250-300 statement tests with a step of 5
Expected behavior
I was not able to satisfactorily trace down the difference in the coding between tabulate vs transform and both eventually call CUB DeviceTransform.
I'm happy that transform is so well tuned but it seems tabulate could be improved if not just by calling transform directly with cuda::counting_iterators.
Reproduction link
No response
Operating System
No response
nvidia-smi output
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T400 4GB On | 00000000:01:00.0 On | N/A |
| 38% 38C P5 N/A / 31W | 1040MiB / 4096MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:C1:00.0 Off | Off |
| 30% 36C P8 29W / 300W | 15MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3701 G /usr/lib/xorg/Xorg 325MiB |
| 0 N/A N/A 3884 G /usr/bin/gnome-shell 251MiB |
| 0 N/A N/A 3993 G ...bin/snapd-desktop-integration 71MiB |
| 0 N/A N/A 4491 G ...exec/xdg-desktop-portal-gnome 2MiB |
| 0 N/A N/A 654492 G ...rack-uuid=3190708988185955192 95MiB |
| 0 N/A N/A 686406 G /usr/share/code/code 32MiB |
| 0 N/A N/A 707117 G .../7967/usr/lib/firefox/firefox 180MiB |
| 1 N/A N/A 3701 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
NVCC version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0
Though we've seen this issue with 12.9 and 12.8 (I think) as well.