CUDA 13.0 and 13.1 respectively added support for controlling the number of threads used during compilation:
- CU_JIT_SPLIT_COMPILE - for "running compiler optimizations"
- CU_JIT_BINARY_LOADER_THREAD_COUNT - for "device code compilation"
the distinction is not entirely clear. Regardless, we should support these.
There's a problem, which is that the PTX compilation library does not have a string option for CU_JIT_BINARY_LOADER_THREAD_COUNT.