Skip to content

Multimodal Inference Engine (similar to vLLM‑Omni) for MLX #993

@Revive-Curiosity

Description

@Revive-Curiosity

Hello MLX team,

I’d like to request support for a multimodal inference engine in MLX, similar to vLLM‑Omni. The goal is to enable efficient inference for models like snu-aidas/Dynin-Omni, which support multiple tasks across text, image, speech, and video:

  • t2t: text → text
  • i2t: image → text (Image Understanding)
  • s2t: speech → text (ASR)
  • t2i: text → image (Image Generation)
  • t2s: text → speech (TTS)
  • i2i: image → image (Image Editing)
  • v2t: video → text (Video Understanding)

Currently MLX is excellent for text models, but multimodal models require additional ops (vision/audio/video encoders, diffusion decoders, spectrogram transforms, etc.). A unified inference engine would make MLX much more powerful on Apple Silicon, allowing native acceleration for multimodal workloads.

Request:

  • Provide MLX equivalents of common multimodal layers (ConvNets, spectrogram transforms, diffusion blocks).
  • Add pipeline wrappers for multimodal tasks (similar to Hugging Face pipelines).
  • Enable quantization and optimization across modalities for Apple Silicon.

This would allow MLX to serve as a true multimodal inference engine, unlocking models like Dynin‑Omni and future multimodal LLMs.

Thank you for considering this feature request!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions