Skip to content

One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server) #8498

@jxiaof

Description

@jxiaof

Feature Request: One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server)

Background

Production deployment of multi-modal models (e.g. Wan2.2,, Flux, Qwen Image, Qwen-VL, LLaVA, Whisper, etc.) is increasingly important in both research and industry. The recommended toolchain (ONNX→TensorRT→Triton Server) provides strong optimization, but the current deployment workflow requires many manual steps and environment configurations, causing friction, errors, and slow iteration. Lack of an integrated, streamlined container ecosystem for multi-modal models is a notable limitation.

Current Workarounds & Pain Points

Users currently rely on community scripts (e.g. onnx2trt), custom Docker images, and manual trial-and-error with conversion tools (trtexec, python API, shell scripts) and config tuning. Pain points:

  • Operator compatibility errors during ONNX→TRT conversion (unsupported ops like OneHot, IsNan)
  • Dynamic shape/profile configuration complexity
  • Triton model repository config for multi-input/multi-output
  • CUDA/cuDNN/TensorRT version mismatch and environment setup issues
  • Lack of integrated validation tools for quantization workflow

Problem Statement

The need for a production-grade container ecosystem for multi-modal models is urgent. Current practices require:

  • Model export to ONNX format
  • Manual ONNX→TensorRT conversion and optimization
  • Manual setup and config of Triton Server model repository
  • Environment preparation and troubleshooting across CUDA, drivers, libraries

This piecemeal approach slows research, impedes industry adoption, and frequently produces errors.

Feature Request

Can TensorRT and Triton Server officially support a one-click, containerized workflow for multi-modal model deployment? The proposed solution includes:

  • Provide a standard Docker image or CLI tool that enables users to:
    1. Specify original model (Huggingface, PyTorch, etc.), config file, and input/output profile
    2. Automatically convert to ONNX, then to TRT engine
    3. Launch Triton Server in production-ready mode with optimized configs for multi-modal models
    4. Support dynamic shape profiles and auto-detection of model schemas (multi-input/multi-output)
    5. Optionally enable FP16/INT8 quantization and report accuracy/compatibility
  • Built-in troubleshooting log outputs for operator compatibility and deployment errors
  • Single command/compose file to automate conversion and deployment, reducing room for human error
  • Documentation and integration with common model repositories and cloud runtimes

Environment

  • Target Hardware: NVIDIA H100/H200, RTX5090, B200, B300
  • CUDA Version: 12.x (typical)
  • Typical Models: HuggingFace Transformers (CV, NLP, S2T), Diffusion Models, Vision Transformers, Foundation Models
  • Deployment Scale: Research prototyping through enterprise production

Benefits

  • Greatly lowers friction for AI engineers and teams deploying modern multi-modal models
  • Standardizes best practices for optimization and deployment
  • Improves reproducibility and troubleshooting
  • Accelerates model-to-production pipeline, especially as model/training practices evolve

Example Usage

trt-multimodal-deploy \
  --model black-forest-labs/FLUX.1-dev\
  --revision 357b94aee9d50ec88e5e6dd9550fd7f957cb1baa \
  --enable-triton \
  --config my_infer_config.yaml \
  --auto-profiles

Or Docker Compose/one-shot CLI:

docker run --rm \
  -v "./models:/models" \
  trt-multimodal:latest \
  --auto-deploy --model-config

Community Impact

A robust, one-click containerized ecosystem will make multi-modal model deployment via ONNX→TensorRT→Triton Server both efficient and accessible, helping drive the next generation of AI research and industrial innovation.


If you know of an existing integrated official solution for this in TensorRT/Triton Server, please provide a link. If not, please consider adding it to the roadmap.

Related Reference

Production workflow: model → onnx → trt → triton server
Models: Wan2.2, Flux, Qwen Image, Qwen-VL, LLaVA, Whisper

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions