Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
Important
Security Requirement: Multimodal processing must be explicitly enabled at startup. See the relevant documentation for each backend for the necessary flags.
This prevents unintended processing of multimodal data from untrusted sources.
:maxdepth: 1
vLLM Multimodal <multimodal_vllm.md>
TensorRT-LLM Multimodal <multimodal_trtllm.md>
SGLang Multimodal <multimodal_sglang.md>
| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
|---|---|---|---|---|---|---|---|
| vLLM | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
| TRT-LLM | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
| SGLang | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP (PR #4668)
Pattern Key:
- EPD - All-in-one worker (Simple Aggregated)
- E/PD - Separate encode, combined prefill+decode
- E/P/D - All stages separate
- EP/D - Combined encode+prefill, separate decode
Status: ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
| Format | vLLM | TRT-LLM | SGLang |
|---|---|---|---|
| HTTP/HTTPS URL | ✅ | ✅ | ✅ |
| Data URL (Base64) | ✅ | ❌ | ❌ |
| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
-
Encoding: Is media encoding handled inline (within prefill) or by a separate Encode Worker?
- Inline: Simpler setup, encoding happens in the prefill worker
- Separate (EPD): Dedicated encode worker transfers embeddings via NIXL (RDMA), enabling independent scaling
-
Prefill/Decode: Are prefill and decode in the same worker or separate?
- Aggregated: Single worker handles both prefill and decode
- Disaggregated: Separate workers for prefill and decode, with KV cache transfer between them
These combine into four deployment patterns:
All processing happens within a single worker - the simplest setup.
HTTP Frontend (Rust)
↓
Worker (Python)
↓ image load + encode + prefill + decode
Response
| Component | Purpose |
|---|---|
| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
| Worker | Complete inference pipeline (encode + prefill + decode) |
When to use: Quick setup, smaller models, development/testing.
Encoding happens in a separate worker; prefill and decode share the same engine.
HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python)
↓ receives embeddings via NIXL, prefill + decode
Response
| Component | Purpose |
|---|---|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| PD Worker | Prefill + Decode with embeddings |
When to use: Offload vision encoding to separate GPU, scale encode workers independently.
Full disaggregation with separate workers for encoding, prefill, and decode. There are two variants of this workflow:
- Prefill-first, used by vLLM
- Decode-first, used by SGLang
Prefill-first:
HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
OR
Decode-first:
HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Decode Worker (Python)
↓ Bootstraps prefill worker
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
| Component | Purpose |
|---|---|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| Prefill Worker | Prefill only, transfers KV cache |
| Decode Worker | Decode only, token generation |
When to use: Maximum optimization, multi-node deployment, independent scaling of each phase.
Encoding is combined with prefill, with decode separate.
HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python)
↓ downloads media, encodes inline, prefill, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
| Component | Purpose |
|---|---|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
| Encode+Prefill Worker | Combined encoding and prefill |
| Decode Worker | Decode only, token generation |
Note: TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker. For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
When to use: Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
You can find example workflows and reference implementations for deploying multimodal models in: