-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Bug]: Tensor parallelism on A40 results in CUDA illegal instruction error #9086
Description
System Info
Nvidia A40 GPUs, AMD Epyc 7352 CPU, NGC 25.09 PyTorch image (CUDA 13.0, NCCL 2.27.7, PyTorch 2.9.0a0), TensorRT-LLM 1.2.0rc2.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Using the official quickstart_advanced.py script for the LLM API (PyTorch workflow) is enough to trigger the issue. In this case, we are using a Qwen3 1.7B model as an example:
python quickstart_advanced.py --tp_size 2 --pp_size 1 --max_batch_size 64 --use_cuda_graph --model_dir ./Qwen3-1.7B
Expected behavior
The model should load correctly and inference should proceed as normal.
actual behavior
During initialization time (specifically, during CUDA graph capturing) a fatal CUDA error: an illegal instruction was encountered is raised. If CUDA graph capturing is disabled, the model loads correctly, and the error occurs at the first batch of requests with a sufficiently large size. In general, the error only arises if --max_batch_size is large enough (in our experiments, >= 32).
Looking at the traceback (omitting some lines for brevity) reveals that the error happens during execution of a linear module:
File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/attention.py", line 548, in forward
qkv = self.qkv_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/linear.py", line 2041, in forward
output = self.apply_linear(input, self.bias, lora_params, layer_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/linear.py", line 1994, in apply_linear
output = self.quant_method.apply(self, input, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/modules/linear.py", line 312, in apply
output = F.linear(input, module.weight, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Exception ignored in: <function PyTorchModelEngine.__del__ at 0x7fee8dfdde40>
Traceback (most recent call last):
File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 976, in __del__
release_gc()
File "/virtualenv/lib/python3.12/site-packages/tensorrt_llm/_utils.py", line 728, in release_gc
torch.cuda.empty_cache()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
torch._C._cuda_emptyCache()
torch.AcceleratorError: CUDA error: an illegal instruction was encountered
additional notes
The issue only occurs on A40 (compute capability 86) but not on H100 or GB200. Furthermore, this is strictly connected to use of tensor parallelism. The issue appears to be only present in TensorRT-LLM 1.2.0rc2, and 1.2.0rc0.post1 does not suffer from it.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.