[BUG]: Compile time for thrust::tabulate with large unary op functor

### Type of Bug

Build time

### Component

Thrust

### Describe the bug

The compile-time for `thrust::tabulate` can be very slow when given a large unary op functor. Simple functors with only a few lines of code and very large functors with many lines of code both scale and compile equally well. There seems to be range of sizes where the compile time and the output object file size both scale poorly.

Ran several simple tests using the nvcc compiler on simple to complex and small to large functors and compared them to compiling the same code logic using `thrust::transform` with `cuda::counting_iterator` input parameters. This allows comparing equivalent function with the same functors.

```
Compile-time benchmark: tabulate vs transform [-O3]  [arch=sm_86, 3 runs]
Level                  tabulate (s) transform (s)       tab KB       tfm KB     diff (s)        ratio
-----------------------------------------------------------------------------------------------------
L1_trivial                     6.56         6.50       1160.2       1158.6        -0.06        0.991x
L2_sqrt_div                    6.58         6.56       1160.2       1158.7        -0.02        0.996x
L3_chained_math                6.76         6.57       1188.2       1166.7        -0.19        0.972x
L4_branchy_8                   6.69         6.60       1168.2       1158.7        -0.09        0.987x
L5_branchy_32                  6.83         6.62       1196.2       1162.7        -0.22        0.968x
L6_large_50stmts               8.39         6.97       1496.3       1230.7        -1.42        0.831x
L7_large_100stmts             11.27         7.46       2100.3       1306.7        -3.81        0.662x
L8_large_150stmts             14.76         8.03       2492.3       1382.7        -6.73        0.544x
L9_large_200stmts             16.59         7.95       2776.3       1370.7        -8.65        0.479x
L10_large_250stmts            20.45         8.45       3152.3       1422.7       -12.01        0.413x
L11_large_300stmts             7.36         7.32       1304.3       1298.7        -0.04        0.995x
L12_large_350stmts             7.56         7.45       1324.3       1322.7        -0.10        0.986x
L13_large_400stmts             7.76         7.70       1348.3       1346.7        -0.06        0.993x
L14_large_450stmts             7.94         7.87       1372.3       1370.7        -0.07        0.992x
L15_large_500stmts             8.14         8.09       1396.3       1394.7        -0.05        0.994x
L16_fat_struct_10              6.85         6.60       1204.3       1166.7        -0.25        0.963x
L17_fat_struct_50              8.28         6.85       1364.3       1202.7        -1.43        0.827x

```
The `L1-L5` and `L16-L17` are different styles of functors though relatively small amounts of code. These did not become a factor so they are included here only as a baseline reference. Note the `transform` column is quite linear and provides impressive compile time overall.

Between the 50-statements and the 500-statements shows the issue with `tabulate`. It takes almost 2x more time and produces 2x larger object file size peaking around 250-statements and then drops back down at 300-statements.

Further investigating of this section shows a relatively sharp drop at 260 -- increasing linearly to 295 and then dropping back to equal transform around 300.

<img width="557" height="325" alt="Image" src="https://github.com/user-attachments/assets/5cd23b90-d8a8-482b-85d7-4a4a4670e762" />


This issue was found when building libcudf and was solved by changing some `tabulate` calls to the equivalent `transform` in this PR: https://github.com/rapidsai/cudf/pull/21793
The PR description indicates a significant change in build time for the those source files with larger unary-ops.


### How to Reproduce

This python script was used to compare compiling a simple `tabulate` and its equivalent `transform`
[bench_compile.py](https://github.com/user-attachments/files/26149378/bench_compile.py)
This script was generated by Claude and it had some guesses on why the difference in compile-time which I'm very skeptical about. But I can provide those guesses if you want.

The nvcc compiler used is 13.2 though we've seen the issue in libcudf in 12.8 as well.
The script hardcodes `sm_86` to help minimize runtime a bit.
The script takes 2 parameters:
- `--O3` - adds the `-O3` compile optimize parameter (this appeared to have no effect)
- `--probe-cliff` - runs through the 250-300 statement tests with a step of 5


### Expected behavior

I was not able to satisfactorily trace down the difference in the coding between `tabulate` vs `transform` and both eventually call CUB `DeviceTransform`.
I'm happy that `transform` is so well tuned but it seems `tabulate` could be improved if not just by calling `transform` directly with `cuda::counting_iterator`s.

### Reproduction link

_No response_

### Operating System

_No response_

### nvidia-smi output

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T400 4GB                On  |   00000000:01:00.0  On |                  N/A |
| 38%   38C    P5            N/A  /   31W |    1040MiB /   4096MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               On  |   00000000:C1:00.0 Off |                  Off |
| 30%   36C    P8             29W /  300W |      15MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3701      G   /usr/lib/xorg/Xorg                      325MiB |
|    0   N/A  N/A            3884      G   /usr/bin/gnome-shell                    251MiB |
|    0   N/A  N/A            3993      G   ...bin/snapd-desktop-integration         71MiB |
|    0   N/A  N/A            4491      G   ...exec/xdg-desktop-portal-gnome          2MiB |
|    0   N/A  N/A          654492      G   ...rack-uuid=3190708988185955192         95MiB |
|    0   N/A  N/A          686406      G   /usr/share/code/code                     32MiB |
|    0   N/A  N/A          707117      G   .../7967/usr/lib/firefox/firefox        180MiB |
|    1   N/A  N/A            3701      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

```

### NVCC version

```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0
```
Though we've seen this issue with 12.9 and 12.8 (I think) as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Compile time for thrust::tabulate with large unary op functor #8122

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Compile time for thrust::tabulate with large unary op functor #8122

Description

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions