Skip to content

[Bug]: EXAONE 4.0 FP8 trtllm-serve failure #8221

@lkm2835

Description

@lkm2835

System Info

NVIDIA B200
Ubuntu 24.04
NVIDIA Driver 580.65.06
TensorRT-LLM version : https://github.com/NVIDIA/TensorRT-LLM/tree/fba351a211021e345ef0e76a9439a81af0e7c785 (Commits on Oct 6, 2025)
Checkpoint : https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B-FP8

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

trtllm-serve command

CUDA_VISIBLE_DEVICES=0 trtllm-serve EXAONE-4.0-32B-FP8 --backend pytorch --tp_size 1

Expected behavior

The FP8 model should be served successfully through trtllm-serve.

actual behavior

Error message

[TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
[10/09/2025-08:03:24] [TRT-LLM] [I] Using LLM with PyTorch backend
[10/09/2025-08:03:24] [TRT-LLM] [I] Set nccl_plugin to None.
[10/09/2025-08:03:24] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[10/09/2025-08:03:24] [TRT-LLM] [I] Found quantization_config field in EXAONE-4.0-32B-FP8/config.json, pre-quantized checkpoint is used.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type exaone4 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
EXAONE-4.0-32B-FP8
rank 0 using MpiPoolSession to spawn MPI processes
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[10/09/2025-08:03:24] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
Multiple distributions found for package optimum. Picked distribution: optimum
/usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[10/09/2025-08:03:36] [TRT-LLM] [I] ATTENTION RUNTIME FEATURES:  AttentionRuntimeFeatures(chunked_prefill=False, cache_reuse=True, has_speculative_draft_tokens=False, chunk_size=8192, chunked_prefill_buffer_batch_size=4)
EXAONE-4.0-32B-FP8
[10/09/2025-08:03:37] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[10/09/2025-08:03:37] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
`torch_dtype` is deprecated! Use `dtype` instead!
[10/09/2025-08:03:37] [TRT-LLM] [I] Use 30.79 GB for model weights.
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching 30.79GB checkpoint files.
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00006-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00002-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00004-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00003-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00007-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00001-of-00007.safetensors to memory...
[10/09/2025-08:03:37] [TRT-LLM] [I] Prefetching EXAONE-4.0-32B-FP8/model-00005-of-00007.safetensors to memory...
[10/09/2025-08:03:41] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00007-of-00007.safetensors.
[10/09/2025-08:03:43] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00004-of-00007.safetensors.
[10/09/2025-08:03:43] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00003-of-00007.safetensors.
[10/09/2025-08:03:44] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00002-of-00007.safetensors.
[10/09/2025-08:03:44] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00006-of-00007.safetensors.
[10/09/2025-08:03:45] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00001-of-00007.safetensors.
[10/09/2025-08:03:45] [TRT-LLM] [I] Finished prefetching EXAONE-4.0-32B-FP8/model-00005-of-00007.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 7/7 [00:00<00:00, 53.37it/s]
Loading weights: 100%|██████████| 1353/1353 [00:05<00:00, 245.54it/s]
Model init total -- 13.90s
[10/09/2025-08:03:51] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 131072
[10/09/2025-08:03:51] [TRT-LLM] [I] Using Sampler: TorchSampler
[10/09/2025-08:03:51] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8999999761581421 and 8224 with free memory 36.22554016113281 of total memory 44.587799072265625, respectively). The smaller value will be used.
[10/09/2025-08:03:51] [TRT-LLM] [W] Attention window size 131073 exceeds upper bound 8224 for available blocks. Reducing to 8224.
[10/09/2025-08:03:51] [TRT-LLM] [W] Adjusted max_attention_window_vec to [8224]
[10/09/2025-08:03:51] [TRT-LLM] [W] Adjusted window size 131073 to 8224 in blocks_per_window
[10/09/2025-08:03:51] [TRT-LLM] [W] Adjusted max_seq_len to 8224
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 4097 [window size=8224], tokens per block=32, primary blocks=257, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.01 GiB for max tokens in paged KV cache (8224).
[10/09/2025-08:03:51] [TRT-LLM] [I] max_seq_len=8224, max_num_requests=2048, max_num_tokens=8192, max_batch_size=2048
[10/09/2025-08:03:51] [TRT-LLM] [I] cache_transceiver is disabled
[10/09/2025-08:03:51] [TRT-LLM] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 118055936 bytes
[10/09/2025-08:04:05] [TRT-LLM] [I] [Autotuner] Autotuning process ends
[10/09/2025-08:04:05] [TRT-LLM] [E] Failed to initialize executor on rank 0: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[10/09/2025-08:04:05] [TRT-LLM] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 371, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 69, in __init__
    self.setup_engine()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 187, in setup_engine
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 157, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 605, in create_py_executor
    py_executor = create_py_executor_instance(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 740, in create_py_executor_instance
    return PyExecutor(
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 249, in __init__
    self.model_engine.warmup(self.resource_manager)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 432, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 676, in warmup
    self.forward(batch,
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/utils.py", line 73, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2203, in forward
    outputs = self._forward_step(inputs, gather_ids,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2266, in _forward_step
    logits = self.model_forward(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2250, in model_forward
    return self.model.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 538, in forward
    output = self.model(
             ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 231, in forward
    hidden_states = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 177, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/gated_mlp.py", line 148, in forward
    h1 = self.gate_up_proj(x)
         ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1993, in forward
    output = self.apply_linear(input, self.bias, lora_params, layer_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1946, in apply_linear
    output = self.quant_method.apply(self, input, bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 629, in apply
    output = torch.ops.trtllm.fp8_swap_ab_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1208, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 344, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 893, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 377, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py", line 971, in fp8_swap_ab_gemm
    _, best_tactic = tuner.choose_one(
                     ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 603, in choose_one
    tensors = self._prepare_input_tensors(p, inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 929, in _prepare_input_tensors
    tensor = self._create_tensor_like(inputs[i], p)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 921, in _create_tensor_like
    return torch.zeros(shapes, dtype=dtype, device=device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[10/09/2025-08:04:05] [TRT-LLM] [I] get signal from executor worker
[10/09/2025-08:04:05] [TRT-LLM] [E] Executor worker initialization error: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 371, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 69, in __init__
    self.setup_engine()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 187, in setup_engine
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 157, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 605, in create_py_executor
    py_executor = create_py_executor_instance(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 740, in create_py_executor_instance
    return PyExecutor(
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 249, in __init__
    self.model_engine.warmup(self.resource_manager)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 432, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 676, in warmup
    self.forward(batch,
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/utils.py", line 73, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2203, in forward
    outputs = self._forward_step(inputs, gather_ids,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2266, in _forward_step
    logits = self.model_forward(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2250, in model_forward
    return self.model.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 538, in forward
    output = self.model(
             ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 231, in forward
    hidden_states = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_exaone4.py", line 177, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/gated_mlp.py", line 148, in forward
    h1 = self.gate_up_proj(x)
         ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1993, in forward
    output = self.apply_linear(input, self.bias, lora_params, layer_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 1946, in apply_linear
    output = self.quant_method.apply(self, input, bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 629, in apply
    output = torch.ops.trtllm.fp8_swap_ab_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1208, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 344, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 893, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 377, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py", line 971, in fp8_swap_ab_gemm
    _, best_tactic = tuner.choose_one(
                     ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 603, in choose_one
    tensors = self._prepare_input_tensors(p, inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 929, in _prepare_input_tensors
    tensor = self._create_tensor_like(inputs[i], p)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/autotuner.py", line 921, in _create_tensor_like
    return torch.zeros(shapes, dtype=dtype, device=device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1363, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1830, in invoke
Exception ignored in: <function PyTorchModelEngine.__del__ at 0x7d01c6f61260>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 841, in __del__
    release_gc()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_utils.py", line 724, in release_gc
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1226, in invoke
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    torch._C._cuda_emptyCache()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 384, in serve
    launch_server(host, port, llm_args, metadata_server_cfg, server_role)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 164, in launch_server
    llm = PyTorchLLM(**llm_args)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1098, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 983, in __init__
    super().__init__(model,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 228, in __init__
    self._build_model()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1042, in _build_model
    self._executor = self._executor_cls.create(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/executor.py", line 510, in create
    return GenerationExecutorProxy(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 107, in __init__
    self._start_executor_workers(worker_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 341, in _start_executor_workers
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaStreamDestroy(stream): an illegal memory access was encountered (/src/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaStream.h:128)
1       0x7d02ce65ba4b void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 139
2       0x7d02ce6dc2f2 std::_Sp_counted_ptr_inplace<tensorrt_llm::runtime::CudaStream, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 98
3       0x7d02d2aa7cc4 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 44
4       0x7d02ac8c9646 std::_Sp_counted_ptr_inplace<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheTransferManager, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 118
5       0x7d02d2aa7cc4 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 44
6       0x7d02ac8b0084 tensorrt_llm::batch_manager::kv_cache_manager::WindowBlockManager::~WindowBlockManager() + 372
7       0x7d02ac8c66fa tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::~KVCacheManager() + 282
8       0x7d02d2b7ffb3 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x118fb3) [0x7d02d2b7ffb3]
9             0x575ece /usr/bin/python() [0x575ece]
10            0x575c1c /usr/bin/python() [0x575c1c]
11            0x59ecc5 /usr/bin/python() [0x59ecc5]
12            0x575ece /usr/bin/python() [0x575ece]
13            0x575c1c /usr/bin/python() [0x575c1c]
14            0x59ecc5 /usr/bin/python() [0x59ecc5]
15            0x575ece /usr/bin/python() [0x575ece]
16            0x575c1c /usr/bin/python() [0x575c1c]
17            0x59ecc5 /usr/bin/python() [0x59ecc5]
18            0x5799c2 /usr/bin/python() [0x5799c2]
19            0x59eae9 /usr/bin/python() [0x59eae9]
20            0x558e61 /usr/bin/python() [0x558e61]
21            0x610215 /usr/bin/python() [0x610215]
22            0x610225 /usr/bin/python() [0x610225]
23            0x610225 /usr/bin/python() [0x610225]
24            0x610225 /usr/bin/python() [0x610225]
25            0x610225 /usr/bin/python() [0x610225]
26            0x5529d1 /usr/bin/python() [0x5529d1]
27            0x61cc7d /usr/bin/python() [0x61cc7d]
28            0x61bf41 /usr/bin/python() [0x61bf41]
29            0x5fbc63 /usr/bin/python() [0x5fbc63]
30            0x5e149f _PyEval_EvalFrameDefault + 46159
31            0x549d57 /usr/bin/python() [0x549d57]
32            0x54b5e3 PyObject_CallMethodObjArgs + 227
33            0x5fd3d5 PyImport_ImportModuleLevelObject + 917
34            0x5dbc9a _PyEval_EvalFrameDefault + 23626
35            0x5d500b PyEval_EvalCode + 347
36            0x5d2dfc /usr/bin/python() [0x5d2dfc]
37            0x581bcd /usr/bin/python() [0x581bcd]
38            0x5dad16 _PyEval_EvalFrameDefault + 19654
39            0x549d57 /usr/bin/python() [0x549d57]
40            0x54b5e3 PyObject_CallMethodObjArgs + 227
41            0x5fd3d5 PyImport_ImportModuleLevelObject + 917
42            0x5dbc9a _PyEval_EvalFrameDefault + 23626
43            0x5d500b PyEval_EvalCode + 347
44            0x5d2dfc /usr/bin/python() [0x5d2dfc]
45            0x581bcd /usr/bin/python() [0x581bcd]
46            0x5dad16 _PyEval_EvalFrameDefault + 19654
47            0x549d57 /usr/bin/python() [0x549d57]
48            0x54b5e3 PyObject_CallMethodObjArgs + 227
49            0x5fd3d5 PyImport_ImportModuleLevelObject + 917
50            0x5dbc9a _PyEval_EvalFrameDefault + 23626
51            0x5d500b PyEval_EvalCode + 347
52            0x5d2dfc /usr/bin/python() [0x5d2dfc]
53            0x581bcd /usr/bin/python() [0x581bcd]
54            0x5dad16 _PyEval_EvalFrameDefault + 19654
55            0x549d57 /usr/bin/python() [0x549d57]
56            0x54b5e3 PyObject_CallMethodObjArgs + 227
57            0x5fd3d5 PyImport_ImportModuleLevelObject + 917
58            0x5dbc9a _PyEval_EvalFrameDefault + 23626
59            0x6e1ad3 /usr/bin/python() [0x6e1ad3]
60            0x6b10ee Py_FinalizeEx + 78
61            0x6bcb01 Py_RunMain     raise RuntimeError(
+ 641
62            0x6bc71d Py_BytesMain + 45
63      0x7d05d73311ca /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7d05d73311ca]
64      0x7d05d733128b __libc_start_main + 139
65            0x6575a5 _start + 37
[:04705] *** Process received signal ***
[:04705] Signal: Aborted (6)
[:04705] Signal code:  (-6)
[:04705] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7d05d734c330]
[:04705] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7d05d73a5b2c]
[:04705] [ 2] RuntimeError: Executor worker returned error
/usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7d05d734c27e]
[:04705] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7d05d732f8ff]
[:04705] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7d05c0134ff5]
[:04705] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7d05c014a0da]
[:04705] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7d05c01348e6]
[:04705] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7d05c01498ba]
[:04705] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7d05cc051b06]
[:04705] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7d05cc0521f1]
[:04705] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x44)[0x7d05c014a384]
[:04705] [11] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(_ZN12tensorrt_llm6common5checkI9cudaErrorEEvT_PKcS5_i+0xc7)[0x7d02ce65ba87]
[:04705] [12] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(_ZNSt23_Sp_counted_ptr_inplaceIN12tensorrt_llm7runtime10CudaStreamESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x62)[0x7d02ce6dc2f2]
[:04705] [13] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x2c)[0x7d02d2aa7cc4]
[:04705] [14] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZNSt23_Sp_counted_ptr_inplaceIN12tensorrt_llm13batch_manager16kv_cache_manager22KVCacheTransferManagerESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x76)[0x7d02ac8c9646]
[:04705] [15] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x2c)[0x7d02d2aa7cc4]
[:04705] [16] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManagerD1Ev+0x174)[0x7d02ac8b0084]
[:04705] [17] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManagerD1Ev+0x11a)[0x7d02ac8c66fa]
[:04705] [18] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x118fb3)[0x7d02d2b7ffb3]
[:04705] [19] /usr/bin/python[0x575ece]
[:04705] [20] /usr/bin/python[0x575c1c]
[:04705] [21] /usr/bin/python[0x59ecc5]
[:04705] [22] /usr/bin/python[0x575ece]
[:04705] [23] /usr/bin/python[0x575c1c]
[:04705] [24] /usr/bin/python[0x59ecc5]
[:04705] [25] /usr/bin/python[0x575ece]
[:04705] [26] /usr/bin/python[0x575c1c]
[:04705] [27] /usr/bin/python[0x59ecc5]
[:04705] [28] /usr/bin/python[0x5799c2]
[:04705] [29] /usr/bin/python[0x59eae9]
[:04705] *** End of error message ***

additional notes

--tp_size 2 works, however it generates different outputs compared to bf16 and causes errors during sampling.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Pytorch<NV>Pytorch backend related issuesbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions