-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't working
Description
System Info
NVIDIA B200
Ubuntu 24.04
NVIDIA Driver 580.65.06
TensorRT-LLM version : https://github.com/NVIDIA/TensorRT-LLM/tree/fba351a211021e345ef0e76a9439a81af0e7c785 (Commits on Oct 6, 2025)
Checkpoint : https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B-FP8
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
trtllm-serve command
CUDA_VISIBLE_DEVICES=0 trtllm-serve EXAONE-4.0-32B-FP8 --backend pytorch --tp_size 1
Expected behavior
The FP8 model should be served successfully through trtllm-serve.
actual behavior
Error message
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
warnings.warn(
[10/09/2025-08:03:24] [TRT-LLM] [I] Using LLM with PyTorch backend
[10/09/2025-08:03:24] [TRT-LLM] [I] Set nccl_plugin to None.
[10/09/2025-08:03:24] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[10/09/2025-08:03:24] [TRT-LLM] [I] Found quantization_config field in EXAONE-4.0-32B-FP8/config.json, pre-quantized checkpoint is used.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type exaone4 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
EXAONE-4.0-32B-FP8
rank 0 using MpiPoolSession to spawn MPI processes
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
Multiple distributions found for package optimum. Picked distribution: optimum
/usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[10/09/2025-08:03:36] [TRT-LLM] [I] ATTENTION RUNTIME FEATURES: AttentionRuntimeFeatures(chunked_prefill=False, cache_reuse=True, has_speculative_draft_tokens=False, chunk_size=8192, chunked_prefill_buffer_batch_size=4)
EXAONE-4.0-32B-FP8
[10/09/2025-08:03:37] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[10/09/2025-08:03:37] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
`torch_dtype` is deprecated! Use `dtype` instead!
[10/09/2025-08:03:37] [TRT-LLM] [I] Use 30.79 GB for model weights.
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching 30.79GB checkpoint files.
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00006-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00002-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00004-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00003-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00007-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00001-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00005-of-00007.safetensors to memory...
[10/09/2025-08:03:41] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00007-of-00007.safetensors.
[10/09/2025-08:03:43] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00004-of-00007.safetensors.
[10/09/2025-08:03:43] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00003-of-00007.safetensors.
[10/09/2025-08:03:44] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00002-of-00007.safetensors.
[10/09/2025-08:03:44] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00006-of-00007.safetensors.
[10/09/2025-08:03:45] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00001-of-00007.safetensors.
[10/09/2025-08:03:45] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00005-of-00007.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 7/7 [00:00<00:00, 53.37it/s]
Loading weights: 100%|██████████| 1353/1353 [00:05<00:00, 245.54it/s]
Model init total -- 13.90s
[10/09/2025-08:03:51] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/09/2025-08:03:51] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/09/2025-08:03:51] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 36.22554016113281 of total memory 44.587799072265625, respectively). The smaller value will be used.
[10/09/2025-08:03:51] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/09/2025-08:03:51] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/09/2025-08:03:51] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/09/2025-08:03:51] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.01 GiB for max tokens in paged KV cache (8224).
[10/09/2025-08:03:51] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/09/2025-08:03:51] [TRT-LLM] [I] cache_transceiver is disabled
[10/09/2025-08:03:51] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 118055936 bytes
[10/09/2025-08:04:05] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/09/2025-08:04:05] [TRT-LLM] [E] Failed to initialize executor on rank 0: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[10/09/2025-08:04:05] [TRT-LLM] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 371, in worker_main
worker: GenerationExecutorWorker = worker_cls(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 69, in __init__
self.setup_engine()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 187, in setup_engine
self.engine = _create_py_executor(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 157, in _create_py_executor
_executor = create_executor(**args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 605, in create_py_executor
py_executor = create_py_executor_instance(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 740, in create_py_executor_instance
return PyExecutor(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 249, in __init__
self.model_engine.warmup(self.resource_manager)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 432, in wrapper
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 676, in warmup
self.forward(batch,
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/utils.py", line 73, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2203, in forward
outputs = self._forward_step(inputs, gather_ids,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2266, in _forward_step
logits = self.model_forward(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2250, in model_forward
return self.model.forward(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 538, in forward
output = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 231, in forward
hidden_states = decoder_layer(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 177, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/gated_mlp.py", line 148, in forward
h1 = self.gate_up_proj(x)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1993, in forward
output = self.apply_linear(input, self.bias, lora_params, layer_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1946, in apply_linear
output = self.quant_method.apply(self, input, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 629, in apply
output = torch.ops.trtllm.fp8_swap_ab_gemm(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1208, in __call__
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 344, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 51, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 893, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 377, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py", line 971, in fp8_swap_ab_gemm
_, best_tactic = tuner.choose_one(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 603, in choose_one
tensors = self._prepare_input_tensors(p, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 929, in _prepare_input_tensors
tensor = self._create_tensor_like(inputs[i], p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 921, in _create_tensor_like
return torch.zeros(shapes, dtype=dtype, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[10/09/2025-08:04:05] [TRT-LLM] [I] get signal from executor worker
[10/09/2025-08:04:05] [TRT-LLM] [E] Executor worker initialization error: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 371, in worker_main
worker: GenerationExecutorWorker = worker_cls(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 69, in __init__
self.setup_engine()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 187, in setup_engine
self.engine = _create_py_executor(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 157, in _create_py_executor
_executor = create_executor(**args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 605, in create_py_executor
py_executor = create_py_executor_instance(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 740, in create_py_executor_instance
return PyExecutor(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 249, in __init__
self.model_engine.warmup(self.resource_manager)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 432, in wrapper
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 676, in warmup
self.forward(batch,
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/utils.py", line 73, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2203, in forward
outputs = self._forward_step(inputs, gather_ids,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2266, in _forward_step
logits = self.model_forward(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2250, in model_forward
return self.model.forward(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 538, in forward
output = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 231, in forward
hidden_states = decoder_layer(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 177, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/gated_mlp.py", line 148, in forward
h1 = self.gate_up_proj(x)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1993, in forward
output = self.apply_linear(input, self.bias, lora_params, layer_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1946, in apply_linear
output = self.quant_method.apply(self, input, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 629, in apply
output = torch.ops.trtllm.fp8_swap_ab_gemm(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1208, in __call__
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 344, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 51, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 893, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 377, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py", line 971, in fp8_swap_ab_gemm
_, best_tactic = tuner.choose_one(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 603, in choose_one
tensors = self._prepare_input_tensors(p, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 929, in _prepare_input_tensors
tensor = self._create_tensor_like(inputs[i], p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 921, in _create_tensor_like
return torch.zeros(shapes, dtype=dtype, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/trtllm-serve", line 7, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1442, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1363, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1830, in invoke
Exception ignored in: <function PyTorchModelEngine.__del__ at 0x7d01c6f61260>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 841, in __del__
release_gc()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_utils.py", line 724, in release_gc
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1226, in invoke
torch.cuda.empty_cache()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
torch._C._cuda_emptyCache()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 794, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 384, in serve
launch_server(host, port, llm_args, metadata_server_cfg, server_role)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 164, in launch_server
llm = PyTorchLLM(**llm_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1098, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 983, in __init__
super().__init__(model,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 228, in __init__
self._build_model()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1042, in _build_model
self._executor = self._executor_cls.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/executor.py", line 510, in create
return GenerationExecutorProxy(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 107, in __init__
self._start_executor_workers(worker_kwargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 341, in _start_executor_workers
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaStreamDestroy(stream): an illegal memory access was encountered (/src/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaStream.h:128)
1 0x7d02ce65ba4b void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 139
2 0x7d02ce6dc2f2 std::_Sp_counted_ptr_inplace<tensorrt_llm::runtime::CudaStream, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 98
3 0x7d02d2aa7cc4 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 44
4 0x7d02ac8c9646 std::_Sp_counted_ptr_inplace<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheTransferManager, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 118
5 0x7d02d2aa7cc4 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 44
6 0x7d02ac8b0084 tensorrt_llm::batch_manager::kv_cache_manager::WindowBlockManager::~WindowBlockManager() + 372
7 0x7d02ac8c66fa tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::~KVCacheManager() + 282
8 0x7d02d2b7ffb3 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x118fb3) [0x7d02d2b7ffb3]
9 0x575ece /usr/bin/python() [0x575ece]
10 0x575c1c /usr/bin/python() [0x575c1c]
11 0x59ecc5 /usr/bin/python() [0x59ecc5]
12 0x575ece /usr/bin/python() [0x575ece]
13 0x575c1c /usr/bin/python() [0x575c1c]
14 0x59ecc5 /usr/bin/python() [0x59ecc5]
15 0x575ece /usr/bin/python() [0x575ece]
16 0x575c1c /usr/bin/python() [0x575c1c]
17 0x59ecc5 /usr/bin/python() [0x59ecc5]
18 0x5799c2 /usr/bin/python() [0x5799c2]
19 0x59eae9 /usr/bin/python() [0x59eae9]
20 0x558e61 /usr/bin/python() [0x558e61]
21 0x610215 /usr/bin/python() [0x610215]
22 0x610225 /usr/bin/python() [0x610225]
23 0x610225 /usr/bin/python() [0x610225]
24 0x610225 /usr/bin/python() [0x610225]
25 0x610225 /usr/bin/python() [0x610225]
26 0x5529d1 /usr/bin/python() [0x5529d1]
27 0x61cc7d /usr/bin/python() [0x61cc7d]
28 0x61bf41 /usr/bin/python() [0x61bf41]
29 0x5fbc63 /usr/bin/python() [0x5fbc63]
30 0x5e149f _PyEval_EvalFrameDefault + 46159
31 0x549d57 /usr/bin/python() [0x549d57]
32 0x54b5e3 PyObject_CallMethodObjArgs + 227
33 0x5fd3d5 PyImport_ImportModuleLevelObject + 917
34 0x5dbc9a _PyEval_EvalFrameDefault + 23626
35 0x5d500b PyEval_EvalCode + 347
36 0x5d2dfc /usr/bin/python() [0x5d2dfc]
37 0x581bcd /usr/bin/python() [0x581bcd]
38 0x5dad16 _PyEval_EvalFrameDefault + 19654
39 0x549d57 /usr/bin/python() [0x549d57]
40 0x54b5e3 PyObject_CallMethodObjArgs + 227
41 0x5fd3d5 PyImport_ImportModuleLevelObject + 917
42 0x5dbc9a _PyEval_EvalFrameDefault + 23626
43 0x5d500b PyEval_EvalCode + 347
44 0x5d2dfc /usr/bin/python() [0x5d2dfc]
45 0x581bcd /usr/bin/python() [0x581bcd]
46 0x5dad16 _PyEval_EvalFrameDefault + 19654
47 0x549d57 /usr/bin/python() [0x549d57]
48 0x54b5e3 PyObject_CallMethodObjArgs + 227
49 0x5fd3d5 PyImport_ImportModuleLevelObject + 917
50 0x5dbc9a _PyEval_EvalFrameDefault + 23626
51 0x5d500b PyEval_EvalCode + 347
52 0x5d2dfc /usr/bin/python() [0x5d2dfc]
53 0x581bcd /usr/bin/python() [0x581bcd]
54 0x5dad16 _PyEval_EvalFrameDefault + 19654
55 0x549d57 /usr/bin/python() [0x549d57]
56 0x54b5e3 PyObject_CallMethodObjArgs + 227
57 0x5fd3d5 PyImport_ImportModuleLevelObject + 917
58 0x5dbc9a _PyEval_EvalFrameDefault + 23626
59 0x6e1ad3 /usr/bin/python() [0x6e1ad3]
60 0x6b10ee Py_FinalizeEx + 78
61 0x6bcb01 Py_RunMain raise RuntimeError(
+ 641
62 0x6bc71d Py_BytesMain + 45
63 0x7d05d73311ca /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7d05d73311ca]
64 0x7d05d733128b __libc_start_main + 139
65 0x6575a5 _start + 37
[:04705] *** Process received signal ***
[:04705] Signal: Aborted (6)
[:04705] Signal code: (-6)
[:04705] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7d05d734c330]
[:04705] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7d05d73a5b2c]
[:04705] [ 2] RuntimeError: Executor worker returned error
/usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7d05d734c27e]
[:04705] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7d05d732f8ff]
[:04705] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7d05c0134ff5]
[:04705] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7d05c014a0da]
[:04705] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7d05c01348e6]
[:04705] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7d05c01498ba]
[:04705] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7d05cc051b06]
[:04705] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7d05cc0521f1]
[:04705] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x44)[0x7d05c014a384]
[:04705] [11] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(_ZN12tensorrt_llm6common5checkI9cudaErrorEEvT_PKcS5_i+0xc7)[0x7d02ce65ba87]
[:04705] [12] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(_ZNSt23_Sp_counted_ptr_inplaceIN12tensorrt_llm7runtime10CudaStreamESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x62)[0x7d02ce6dc2f2]
[:04705] [13] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x2c)[0x7d02d2aa7cc4]
[:04705] [14] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZNSt23_Sp_counted_ptr_inplaceIN12tensorrt_llm13batch_manager16kv_cache_manager22KVCacheTransferManagerESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x76)[0x7d02ac8c9646]
[:04705] [15] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x2c)[0x7d02d2aa7cc4]
[:04705] [16] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManagerD1Ev+0x174)[0x7d02ac8b0084]
[:04705] [17] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManagerD1Ev+0x11a)[0x7d02ac8c66fa]
[:04705] [18] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x118fb3)[0x7d02d2b7ffb3]
[:04705] [19] /usr/bin/python[0x575ece]
[:04705] [20] /usr/bin/python[0x575c1c]
[:04705] [21] /usr/bin/python[0x59ecc5]
[:04705] [22] /usr/bin/python[0x575ece]
[:04705] [23] /usr/bin/python[0x575c1c]
[:04705] [24] /usr/bin/python[0x59ecc5]
[:04705] [25] /usr/bin/python[0x575ece]
[:04705] [26] /usr/bin/python[0x575c1c]
[:04705] [27] /usr/bin/python[0x59ecc5]
[:04705] [28] /usr/bin/python[0x5799c2]
[:04705] [29] /usr/bin/python[0x59eae9]
[:04705] *** End of error message ***
additional notes
--tp_size 2 works, however it generates different outputs compared to bf16 and causes errors during sampling.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't working