One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server)

## Feature Request: One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server)

### Background

Production deployment of multi-modal models (e.g. Wan2.2,, Flux, Qwen Image, Qwen-VL, LLaVA, Whisper, etc.) is increasingly important in both research and industry. The recommended toolchain (ONNX→TensorRT→Triton Server) provides strong optimization, but the current deployment workflow requires many manual steps and environment configurations, causing friction, errors, and slow iteration. Lack of an integrated, streamlined container ecosystem for multi-modal models is a notable limitation.

### Current Workarounds & Pain Points
Users currently rely on community scripts (e.g. [onnx2trt](https://github.com/Howell-Yang/onnx2trt)), custom Docker images, and manual trial-and-error with conversion tools (trtexec, python API, shell scripts) and config tuning. Pain points:
- Operator compatibility errors during ONNX→TRT conversion (unsupported ops like `OneHot`, `IsNan`)
- Dynamic shape/profile configuration complexity
- Triton model repository config for multi-input/multi-output
- CUDA/cuDNN/TensorRT version mismatch and environment setup issues
- Lack of integrated validation tools for quantization workflow

### Problem Statement

The need for a production-grade container ecosystem for multi-modal models is urgent. Current practices require:
- Model export to ONNX format
- Manual ONNX→TensorRT conversion and optimization
- Manual setup and config of Triton Server model repository
- Environment preparation and troubleshooting across CUDA, drivers, libraries

This piecemeal approach slows research, impedes industry adoption, and frequently produces errors.

### Feature Request
Can TensorRT and Triton Server officially support a **one-click, containerized workflow for multi-modal model deployment**? The proposed solution includes:

- Provide a standard Docker image or CLI tool that enables users to:
  1. Specify original model (Huggingface, PyTorch, etc.), config file, and input/output profile
  2. Automatically convert to ONNX, then to TRT engine
  3. Launch Triton Server in production-ready mode with optimized configs for multi-modal models
  4. Support dynamic shape profiles and auto-detection of model schemas (multi-input/multi-output)
  5. Optionally enable FP16/INT8 quantization and report accuracy/compatibility
- Built-in troubleshooting log outputs for operator compatibility and deployment errors
- Single command/compose file to automate conversion and deployment, reducing room for human error
- Documentation and integration with common model repositories and cloud runtimes

### Environment
- Target Hardware: NVIDIA H100/H200, RTX5090, B200, B300
- CUDA Version: 12.x (typical)
- Typical Models: HuggingFace Transformers (CV, NLP, S2T), Diffusion Models, Vision Transformers, Foundation Models
- Deployment Scale: Research prototyping through enterprise production

### Benefits
- Greatly lowers friction for AI engineers and teams deploying modern multi-modal models
- Standardizes best practices for optimization and deployment
- Improves reproducibility and troubleshooting
- Accelerates model-to-production pipeline, especially as model/training practices evolve

### Example Usage
```bash
trt-multimodal-deploy \
  --model black-forest-labs/FLUX.1-dev\
  --revision 357b94aee9d50ec88e5e6dd9550fd7f957cb1baa \
  --enable-triton \
  --config my_infer_config.yaml \
  --auto-profiles
```
Or Docker Compose/one-shot CLI:
```bash
docker run --rm \
  -v "./models:/models" \
  trt-multimodal:latest \
  --auto-deploy --model-config
```

### Community Impact
A robust, one-click containerized ecosystem will make multi-modal model deployment via ONNX→TensorRT→Triton Server both efficient and accessible, helping drive the next generation of AI research and industrial innovation.

---
**If you know of an existing integrated official solution for this in TensorRT/Triton Server, please provide a link. If not, please consider adding it to the roadmap.**

#### Related Reference
Production workflow: model → onnx → trt → triton server
Models: Wan2.2, Flux, Qwen Image, Qwen-VL, LLaVA, Whisper


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server) #8498

Feature Request: One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server)

Background

Current Workarounds & Pain Points

Problem Statement

Feature Request

Environment

Benefits

Example Usage

Community Impact

Related Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server) #8498

Description

Feature Request: One-Click Containerized Workflow for Multi-Modal Model Deployment (ONNX→TRT + Triton Server)

Background

Current Workarounds & Pain Points

Problem Statement

Feature Request

Environment

Benefits

Example Usage

Community Impact

Related Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions