-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Feature Request: One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server)
Background
Production deployment of multi-modal models (e.g. Wan2.2,, Flux, Qwen Image, Qwen-VL, LLaVA, Whisper, etc.) is increasingly important in both research and industry. The recommended toolchain (ONNX→TensorRT→Triton Server) provides strong optimization, but the current deployment workflow requires many manual steps and environment configurations, causing friction, errors, and slow iteration. Lack of an integrated, streamlined container ecosystem for multi-modal models is a notable limitation.
Current Workarounds & Pain Points
Users currently rely on community scripts (e.g. onnx2trt), custom Docker images, and manual trial-and-error with conversion tools (trtexec, python API, shell scripts) and config tuning. Pain points:
- Operator compatibility errors during ONNX→TRT conversion (unsupported ops like
OneHot,IsNan) - Dynamic shape/profile configuration complexity
- Triton model repository config for multi-input/multi-output
- CUDA/cuDNN/TensorRT version mismatch and environment setup issues
- Lack of integrated validation tools for quantization workflow
Problem Statement
The need for a production-grade container ecosystem for multi-modal models is urgent. Current practices require:
- Model export to ONNX format
- Manual ONNX→TensorRT conversion and optimization
- Manual setup and config of Triton Server model repository
- Environment preparation and troubleshooting across CUDA, drivers, libraries
This piecemeal approach slows research, impedes industry adoption, and frequently produces errors.
Feature Request
Can TensorRT and Triton Server officially support a one-click, containerized workflow for multi-modal model deployment? The proposed solution includes:
- Provide a standard Docker image or CLI tool that enables users to:
- Specify original model (Huggingface, PyTorch, etc.), config file, and input/output profile
- Automatically convert to ONNX, then to TRT engine
- Launch Triton Server in production-ready mode with optimized configs for multi-modal models
- Support dynamic shape profiles and auto-detection of model schemas (multi-input/multi-output)
- Optionally enable FP16/INT8 quantization and report accuracy/compatibility
- Built-in troubleshooting log outputs for operator compatibility and deployment errors
- Single command/compose file to automate conversion and deployment, reducing room for human error
- Documentation and integration with common model repositories and cloud runtimes
Environment
- Target Hardware: NVIDIA H100/H200, RTX5090, B200, B300
- CUDA Version: 12.x (typical)
- Typical Models: HuggingFace Transformers (CV, NLP, S2T), Diffusion Models, Vision Transformers, Foundation Models
- Deployment Scale: Research prototyping through enterprise production
Benefits
- Greatly lowers friction for AI engineers and teams deploying modern multi-modal models
- Standardizes best practices for optimization and deployment
- Improves reproducibility and troubleshooting
- Accelerates model-to-production pipeline, especially as model/training practices evolve
Example Usage
trt-multimodal-deploy \
--model black-forest-labs/FLUX.1-dev\
--revision 357b94aee9d50ec88e5e6dd9550fd7f957cb1baa \
--enable-triton \
--config my_infer_config.yaml \
--auto-profilesOr Docker Compose/one-shot CLI:
docker run --rm \
-v "./models:/models" \
trt-multimodal:latest \
--auto-deploy --model-configCommunity Impact
A robust, one-click containerized ecosystem will make multi-modal model deployment via ONNX→TensorRT→Triton Server both efficient and accessible, helping drive the next generation of AI research and industrial innovation.
If you know of an existing integrated official solution for this in TensorRT/Triton Server, please provide a link. If not, please consider adding it to the roadmap.
Related Reference
Production workflow: model → onnx → trt → triton server
Models: Wan2.2, Flux, Qwen Image, Qwen-VL, LLaVA, Whisper