Skip to content

[Bug]: Segmentation fault on inference of batch of images #8651

@pythonjavaerlang

Description

@pythonjavaerlang

System Info

CPU: x86_64
RAM: 144 GB
GPU: Nvidia L4, 24 GB
Libraries:
TensorRT-LLM version: 1.2.0rc0
Docker container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0.post1
Nvidia driver: 535.261.03-1

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 535.261.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:05:00.0 Off | 0 |
| N/A 69C P0 34W / 72W | 1MiB / 23034MiB | 3% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

Python 3.12.3

pip3 show tensorrt_llm tensorrt torch
Name: tensorrt_llm
Version: 1.2.0rc0
Name: tensorrt
Version: 10.11.0.33
Name: torch
Version: 2.7.1

Log:

/code/trtllm_env/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/code/trtllm_env/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc0
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[10/24/2025-10:10:16] [TRT-LLM] [I] Loaded max_seq_len from engine config: 3072
[10/24/2025-10:10:16] [TRT-LLM] [I] Input token length: 147
[10/24/2025-10:10:16] [TRT-LLM] [I] Max sequence length: 3072
[10/24/2025-10:10:16] [TRT-LLM] [I] Safety margin: 50
[10/24/2025-10:10:16] [TRT-LLM] [I] Calculated max_new_tokens: 2875
[10/24/2025-10:10:16] [TRT-LLM] [I] Total if generated: 147 + 2875 = 3022 (limit: 3072)
[10/24/2025-10:10:16] [TRT-LLM] [I] Processing batch of 2 images
[10/24/2025-10:10:16] [TRT-LLM] [I] Input tokens per image: ~73
[10/24/2025-10:10:16] [TRT-LLM] [I] Max new tokens: 450
[TensorRT-LLM][INFO] Engine version 1.2.0rc0 found in the config file, assuming engine(s) built by new builder API.
[10/24/2025-10:10:18] [TRT-LLM] [I] Loading engine from /code/tensorrt_llm/tmp/trt_engines/Qwen2-VL-7B-Instruct/fp16/1-gpu/vision/model.engine
[10/24/2025-10:10:19] [TRT-LLM] [I] Creating session from engine /code/tensorrt_llm/tmp/trt_engines/Qwen2-VL-7B-Instruct/fp16/1-gpu/vision/model.engine
[10/24/2025-10:10:19] [TRT] [I] Loaded engine size: 1303 MiB
[10/24/2025-10:10:20] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +237, now: CPU 0, GPU 1530 (MiB)
[10/24/2025-10:10:20] [TRT-LLM] [I] Running LLM with C++ runner
[TensorRT-LLM][INFO] Engine version 1.2.0rc0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[10/24/2025-10:10:20] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[10/24/2025-10:10:20] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2_vl
[10/24/2025-10:10:20] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[10/24/2025-10:10:20] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[10/24/2025-10:10:20] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
[10/24/2025-10:10:20] [TRT-LLM] [I] Set dtype to float16.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set gemm_plugin to float16.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set qserve_gemm_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set identity_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set nccl_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set lora_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set dora_plugin to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set smooth_quant_plugins to True.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set moe_plugin to auto.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set context_fmha to True.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set paged_kv_cache to True.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set remove_input_padding to True.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set norm_quant_fusion to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set reduce_fusion to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set user_buffer to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set tokens_per_block to 128.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set multiple_profiles to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set paged_state to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set streamingllm to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set manage_weights to False.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set use_fused_mlp to True.
[10/24/2025-10:10:20] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[TensorRT-LLM][INFO] Engine version 1.2.0rc0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 4096 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] WARNING The logger passed into createInferRuntime differs from one already registered for an existing builder, runtime, or refitter. So the current new logger is ignored, and TensorRT will use the existing one which is returned by nvinfer1::getLogger() instead.
[TensorRT-LLM][INFO] Loaded engine size: 14549 MiB
[TensorRT-LLM][INFO] Engine load time 15395 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1000.03 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 16071 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 66.54 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.78 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 21.96 GiB, available: 4.96 GiB
[TensorRT-LLM][INFO] Blocks per window size:
[TensorRT-LLM][INFO] [windowSize=4096] {.primaryBlocks=617, .secondayBlocks=0}
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 32 [window size=4096], tokens per block=128, primary blocks=617, secondary blocks=0
[TensorRT-LLM][INFO] Number of tokens per block: 128.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.22 GiB for max tokens in paged KV cache (78976).
[TensorRT-LLM][INFO] CacheTransceiver is disabled.
[10/24/2025-10:10:36] [TRT-LLM] [I] Load engine takes: 16.353443145751953 sec
Inside preprocess: image_grid_thw shape: torch.Size([2, 3])
image_grid_thw[0]: tensor([ 1, 36, 36], device='cuda:0')
stop_words_list before: None
stop_words_list after: [None, None]
[xentime:31732:0:31936] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x5d)
==== backtrace (tid:  31936) ====
 0  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f09577e4774]
 1  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x3796a) [0x7f09577e496a]
 2  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x37ba8) [0x7f09577e4ba8]
 3  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager19PromptTuningBuffers28initializeChunkPtableBuffersERKNS_7runtime13BufferManagerERKNS2_11ModelConfigEiRKSt10shared_ptrINS0_10LlmRequestEE+0x14) [0x7f06bbfb7c54]
 4  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching30remapInputTokensForPromptTableERKSt10shared_ptrINS0_10LlmRequestEEbii+0x1d0) [0x7f06bbfdde10]
 5  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching28prefetchNextPromptTableChunkERKSt6vectorISt10shared_ptrINS0_10LlmRequestEESaIS5_EEbi+0x14d) [0x7f06bbfde08d]
 6  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching11executeStepERKSt6vectorISt10shared_ptrINS0_10LlmRequestEESaIS5_EES9_i+0x814) [0x7f06bbfdf154]
 7  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12executeBatchERKNS0_17ScheduledRequestsE+0xde) [0x7f06bbfdf73e]
 8  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12forwardAsyncERKNSt7__cxx114listISt10shared_ptrINS0_10LlmRequestEESaIS6_EEE+0x847) [0x7f06bbfee5d7]
 9  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl12forwardAsyncERNSt7__cxx114listISt10shared_ptrINS_13batch_manager10LlmRequestEESaIS8_EEE+0x1bc) [0x7f06bc14171c]
10  /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl13executionLoopEv+0x60e) [0x7f06bc148fde]
11  /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f09cae6edb4]
12  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f0a320b3aa4]
13  /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44) [0x7f0a32140a64]
=================================
[xentime:31732] *** Process received signal ***
[xentime:31732] Signal: Segmentation fault (11)
[xentime:31732] Signal code:  (-6)
[xentime:31732] Failing at address: 0x7bf4
[xentime:31732] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f0a3205c330]
[xentime:31732] [ 1] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager19PromptTuningBuffers28initializeChunkPtableBuffersERKNS_7runtime13BufferManagerERKNS2_11ModelConfigEiRKSt10shared_ptrINS0_10LlmRequestEE+0x14)[0x7f06bbfb7c54]
[xentime:31732] [ 2] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching30remapInputTokensForPromptTableERKSt10shared_ptrINS0_10LlmRequestEEbii+0x1d0)[0x7f06bbfdde10]
[xentime:31732] [ 3] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching28prefetchNextPromptTableChunkERKSt6vectorISt10shared_ptrINS0_10LlmRequestEESaIS5_EEbi+0x14d)[0x7f06bbfde08d]
[xentime:31732] [ 4] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching11executeStepERKSt6vectorISt10shared_ptrINS0_10LlmRequestEESaIS5_EES9_i+0x814)[0x7f06bbfdf154]
[xentime:31732] [ 5] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12executeBatchERKNS0_17ScheduledRequestsE+0xde)[0x7f06bbfdf73e]
[xentime:31732] [ 6] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12forwardAsyncERKNSt7__cxx114listISt10shared_ptrINS0_10LlmRequestEESaIS6_EEE+0x847)[0x7f06bbfee5d7]
[xentime:31732] [ 7] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl12forwardAsyncERNSt7__cxx114listISt10shared_ptrINS_13batch_manager10LlmRequestEESaIS8_EEE+0x1bc)[0x7f06bc14171c]
[xentime:31732] [ 8] /code/trtllm_env/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl13executionLoopEv+0x60e)[0x7f06bc148fde]
[xentime:31732] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4)[0x7f09cae6edb4]
[xentime:31732] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f0a320b3aa4]
[xentime:31732] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7f0a32140a64]
[xentime:31732] *** End of error message ***
Segmentation fault

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code:

AI_ENGINE_DIR = "/code/tensorrt_llm/tmp/trt_engines/Qwen2-VL-7B-Instruct/fp16/1-gpu"
AI_HF_MODEL_DIR = "/code/tensorrt_llm/Qwen2-VL-7B-Instruct"

AI_MODEL_ARGS = {
    # Required paths
    "engine_dir": AI_ENGINE_DIR,
    "hf_model_dir": AI_HF_MODEL_DIR,

    # Engine names
    "visual_engine_name": "model.engine",
    "audio_engine_name": "model.engine",

    # BATCH PROCESSING - Key change
    "batch_size": 2,  # Process 2 images at once (optimal for L4)

    # Generation parameters
    "max_new_tokens": 450,
    "num_beams": 1,
    "top_k": 5,
    "top_p": 0.95,
    "temperature": 0.7,
    "repetition_penalty": 1.0,

    # Input parameters
    "input_text": None,
    "image_path": None,
    "video_path": None,
    "video_num_frames": None,
    "audio_path": None,
    "path_sep": ",",
    "prompt_sep": ",",
    # Session and memory configuration
    "session": "cpp_llm_only",
    "kv_cache_free_gpu_memory_fraction": 0.85,  # Increased for batching
    "cross_kv_cache_fraction": 0.6,
    "multi_block_mode": True,

    # Feature flags
    "mm_embedding_offloading": None,
    "enable_context_fmha_fp32_acc": False,
    "enable_chunked_context": True,  # Helps with batching

    # Profiling and debugging
    "run_profiling": False,
    "profiling_iterations": 20,
    "check_accuracy": False,
    "debug_mode": False,

    # LoRA
    "lora_task_uids": None,

    # Logging
    "log_level": "info",

    # Token calculation parameters
    "safety_margin": 50,
    "min_new_tokens": 400,
    "force_max_seq_len": None,
}
AI_DEFAULT_PROMPT = (
            "Analyze this image and return JSON with: "
            "title (50-60 chars), description (150-300 chars), long_description (brief, 100-150 words), "
            "keywords (10 terms), primary_subject, secondary_subjects (max 3), "
            "colors (top 4), mood, category, subcategories (max 2), "
            "composition, lighting, setting, alt_text (125 chars), seo_title, "
            "meta_description (150-160 chars), "
            "target_keywords {primary: 3 terms, secondary: 3 terms, long_tail: 2 phrases}, "
            "use_cases (3 max), similar_concepts (3 max). "
            "Output valid JSON only."
)

def test_batch_processing():
    """Test batch processing with multiple images."""
    from tensorrt_llm.runtime import MultimodalModelRunner
    from transformers import AutoTokenizer, Qwen2VLProcessor
    import time

    logger.set_level("info")

    # Multiple images to process
    image_paths = [
        "/code/light/ai/P4280401.webp",
        "/code/light/10.jpg"
    ]

    # Load images
    images = []
    for image_path in image_paths:
        image = Image.open(image_path).convert("RGB")
        images.append(image.resize((504, 504)))

    # Same prompt for all (or different prompts per image)
    prompts = [AI_DEFAULT_PROMPT] * len(images)

    # Process inputs
    processor = Qwen2VLProcessor.from_pretrained(AI_MODEL_ARGS["hf_model_dir"])
    tokenizer = AutoTokenizer.from_pretrained(AI_MODEL_ARGS["hf_model_dir"])
    inputs = processor(text=prompts, images=images, return_tensors="pt")

    # Calculate max_new_tokens
    max_seq_len = get_max_seq_len(AI_MODEL_ARGS["engine_dir"])
    max_new_tokens = min(
        calculate_max_new_tokens(
            inputs.input_ids, max_seq_len,
            AI_MODEL_ARGS["safety_margin"],
            AI_MODEL_ARGS["min_new_tokens"]
        ),
        450
    )
    logger.info(f"Processing batch of {len(images)} images")
    logger.info(f"Input tokens per image: ~{len(inputs.input_ids[0]) // len(images)}")
    logger.info(f"Max new tokens: {max_new_tokens}")

    # Initialize model
    model = MultimodalModelRunner(Namespace(**AI_MODEL_ARGS))

    # Load visual data for all images
    visual_data = images

    # Run batch inference
    start_time = time.time()
    input_text, output_text = model.run(prompts, visual_data, None, max_new_tokens)
    total_time = time.time() - start_time

    logger.info(f"\n=== Batch Processing Results ===")
    logger.info(f"Total time: {total_time:.2f}s")
    logger.info(f"Time per image: {total_time / len(images):.2f}s")
    logger.info(f"Throughput: {len(images) / total_time:.2f} images/sec")

    # Process each result
    for i, result in enumerate(output_text):
        print(f"\n--- Image {i+1} ({image_paths[i]}) ---")
        print(result[0])
        print("...")

Expected behavior

It should return text, as expected from Qweb2-VL model.

actual behavior

Segmentation fault

additional notes

The tensorrt-llm code looks extremely unstable. In many cases event lint checks were not performed.
What is recommended way of performing inference ?
We are not expected to use Tensort-llm yet in prod ?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.MultimodalLabel for issues & PRs regarding Multimodal related objectsbugSomething isn't workingstalewaiting for feedback

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions