Multimodal Inference Engine (similar to vLLM‑Omni) for MLX

Hello MLX team,

I’d like to request support for a multimodal inference engine in MLX, similar to vLLM‑Omni. The goal is to enable efficient inference for models like snu-aidas/Dynin-Omni, which support multiple tasks across text, image, speech, and video:

- t2t: text → text
- i2t: image → text (Image Understanding)
- s2t: speech → text (ASR)
- t2i: text → image (Image Generation)
- t2s: text → speech (TTS)
- i2i: image → image (Image Editing)
- v2t: video → text (Video Understanding)

Currently MLX is excellent for text models, but multimodal models require additional ops (vision/audio/video encoders, diffusion decoders, spectrogram transforms, etc.). A unified inference engine would make MLX much more powerful on Apple Silicon, allowing native acceleration for multimodal workloads.

**Request:**
- Provide MLX equivalents of common multimodal layers (ConvNets, spectrogram transforms, diffusion blocks).
- Add pipeline wrappers for multimodal tasks (similar to Hugging Face pipelines).
- Enable quantization and optimization across modalities for Apple Silicon.

This would allow MLX to serve as a true multimodal inference engine, unlocking models like Dynin‑Omni and future multimodal LLMs.

Thank you for considering this feature request!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal Inference Engine (similar to vLLM‑Omni) for MLX #993

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multimodal Inference Engine (similar to vLLM‑Omni) for MLX #993

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions