[Bug]: Tensor parallelism on A40 results in CUDA illegal instruction error

### System Info

Nvidia A40 GPUs, AMD Epyc 7352 CPU, NGC 25.09 PyTorch image (CUDA 13.0, NCCL 2.27.7, PyTorch 2.9.0a0), TensorRT-LLM 1.2.0rc2.

### Who can help?

@Funatiq 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Using the official `quickstart_advanced.py` script for the LLM API (PyTorch workflow) is enough to trigger the issue. In this case, we are using a Qwen3 1.7B model as an example:

```
python quickstart_advanced.py --tp_size 2 --pp_size 1 --max_batch_size 64 --use_cuda_graph --model_dir ./Qwen3-1.7B
```

### Expected behavior

The model should load correctly and inference should proceed as normal.

### actual behavior

During initialization time (specifically, during CUDA graph capturing) a fatal `CUDA error: an illegal instruction was encountered` is raised. If CUDA graph capturing is disabled, the model loads correctly, and the error occurs at the first batch of requests with a sufficiently large size. In general, the error only arises if `--max_batch_size` is large enough (in our experiments, >= 32).

Looking at the traceback (omitting some lines for brevity) reveals that the error happens during execution of a linear module:

```
  File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/attention.py", line 548, in forward
    qkv = self.qkv_proj(hidden_states)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/linear.py", line 2041, in forward
    output = self.apply_linear(input, self.bias, lora_params, layer_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/linear.py", line 1994, in apply_linear
    output = self.quant_method.apply(self, input, bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/linear.py", line 312, in apply
    output = F.linear(input, module.weight, bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Exception ignored in: <function PyTorchModelEngine.__del__ at 0x7fee8dfdde40>
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 976, in __del__
    release_gc()
  File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_utils.py", line 728, in release_gc
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    torch._C._cuda_emptyCache()
torch.AcceleratorError: CUDA error: an illegal instruction was encountered
```

### additional notes

The issue only occurs on A40 (compute capability 86) but not on H100 or GB200. Furthermore, this is strictly connected to use of tensor parallelism. The issue appears to be only present in TensorRT-LLM `1.2.0rc2`, and `1.2.0rc0.post1` does not suffer from it.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Tensor parallelism on A40 results in CUDA illegal instruction error #9086

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Tensor parallelism on A40 results in CUDA illegal instruction error #9086

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions