Skip to content

[BUG]: Compile time for thrust::tabulate with large unary op functor #8122

@davidwendt

Description

@davidwendt

Type of Bug

Build time

Component

Thrust

Describe the bug

The compile-time for thrust::tabulate can be very slow when given a large unary op functor. Simple functors with only a few lines of code and very large functors with many lines of code both scale and compile equally well. There seems to be range of sizes where the compile time and the output object file size both scale poorly.

Ran several simple tests using the nvcc compiler on simple to complex and small to large functors and compared them to compiling the same code logic using thrust::transform with cuda::counting_iterator input parameters. This allows comparing equivalent function with the same functors.

Compile-time benchmark: tabulate vs transform [-O3]  [arch=sm_86, 3 runs]
Level                  tabulate (s) transform (s)       tab KB       tfm KB     diff (s)        ratio
-----------------------------------------------------------------------------------------------------
L1_trivial                     6.56         6.50       1160.2       1158.6        -0.06        0.991x
L2_sqrt_div                    6.58         6.56       1160.2       1158.7        -0.02        0.996x
L3_chained_math                6.76         6.57       1188.2       1166.7        -0.19        0.972x
L4_branchy_8                   6.69         6.60       1168.2       1158.7        -0.09        0.987x
L5_branchy_32                  6.83         6.62       1196.2       1162.7        -0.22        0.968x
L6_large_50stmts               8.39         6.97       1496.3       1230.7        -1.42        0.831x
L7_large_100stmts             11.27         7.46       2100.3       1306.7        -3.81        0.662x
L8_large_150stmts             14.76         8.03       2492.3       1382.7        -6.73        0.544x
L9_large_200stmts             16.59         7.95       2776.3       1370.7        -8.65        0.479x
L10_large_250stmts            20.45         8.45       3152.3       1422.7       -12.01        0.413x
L11_large_300stmts             7.36         7.32       1304.3       1298.7        -0.04        0.995x
L12_large_350stmts             7.56         7.45       1324.3       1322.7        -0.10        0.986x
L13_large_400stmts             7.76         7.70       1348.3       1346.7        -0.06        0.993x
L14_large_450stmts             7.94         7.87       1372.3       1370.7        -0.07        0.992x
L15_large_500stmts             8.14         8.09       1396.3       1394.7        -0.05        0.994x
L16_fat_struct_10              6.85         6.60       1204.3       1166.7        -0.25        0.963x
L17_fat_struct_50              8.28         6.85       1364.3       1202.7        -1.43        0.827x

The L1-L5 and L16-L17 are different styles of functors though relatively small amounts of code. These did not become a factor so they are included here only as a baseline reference. Note the transform column is quite linear and provides impressive compile time overall.

Between the 50-statements and the 500-statements shows the issue with tabulate. It takes almost 2x more time and produces 2x larger object file size peaking around 250-statements and then drops back down at 300-statements.

Further investigating of this section shows a relatively sharp drop at 260 -- increasing linearly to 295 and then dropping back to equal transform around 300.

Image

This issue was found when building libcudf and was solved by changing some tabulate calls to the equivalent transform in this PR: rapidsai/cudf#21793
The PR description indicates a significant change in build time for the those source files with larger unary-ops.

How to Reproduce

This python script was used to compare compiling a simple tabulate and its equivalent transform
bench_compile.py
This script was generated by Claude and it had some guesses on why the difference in compile-time which I'm very skeptical about. But I can provide those guesses if you want.

The nvcc compiler used is 13.2 though we've seen the issue in libcudf in 12.8 as well.
The script hardcodes sm_86 to help minimize runtime a bit.
The script takes 2 parameters:

  • --O3 - adds the -O3 compile optimize parameter (this appeared to have no effect)
  • --probe-cliff - runs through the 250-300 statement tests with a step of 5

Expected behavior

I was not able to satisfactorily trace down the difference in the coding between tabulate vs transform and both eventually call CUB DeviceTransform.
I'm happy that transform is so well tuned but it seems tabulate could be improved if not just by calling transform directly with cuda::counting_iterators.

Reproduction link

No response

Operating System

No response

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T400 4GB                On  |   00000000:01:00.0  On |                  N/A |
| 38%   38C    P5            N/A  /   31W |    1040MiB /   4096MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               On  |   00000000:C1:00.0 Off |                  Off |
| 30%   36C    P8             29W /  300W |      15MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3701      G   /usr/lib/xorg/Xorg                      325MiB |
|    0   N/A  N/A            3884      G   /usr/bin/gnome-shell                    251MiB |
|    0   N/A  N/A            3993      G   ...bin/snapd-desktop-integration         71MiB |
|    0   N/A  N/A            4491      G   ...exec/xdg-desktop-portal-gnome          2MiB |
|    0   N/A  N/A          654492      G   ...rack-uuid=3190708988185955192         95MiB |
|    0   N/A  N/A          686406      G   /usr/share/code/code                     32MiB |
|    0   N/A  N/A          707117      G   .../7967/usr/lib/firefox/firefox        180MiB |
|    1   N/A  N/A            3701      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0

Though we've seen this issue with 12.9 and 12.8 (I think) as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions