-
Notifications
You must be signed in to change notification settings - Fork 485
Open
Description
Hello MLX team,
I’d like to request support for a multimodal inference engine in MLX, similar to vLLM‑Omni. The goal is to enable efficient inference for models like snu-aidas/Dynin-Omni, which support multiple tasks across text, image, speech, and video:
- t2t: text → text
- i2t: image → text (Image Understanding)
- s2t: speech → text (ASR)
- t2i: text → image (Image Generation)
- t2s: text → speech (TTS)
- i2i: image → image (Image Editing)
- v2t: video → text (Video Understanding)
Currently MLX is excellent for text models, but multimodal models require additional ops (vision/audio/video encoders, diffusion decoders, spectrogram transforms, etc.). A unified inference engine would make MLX much more powerful on Apple Silicon, allowing native acceleration for multimodal workloads.
Request:
- Provide MLX equivalents of common multimodal layers (ConvNets, spectrogram transforms, diffusion blocks).
- Add pipeline wrappers for multimodal tasks (similar to Hugging Face pipelines).
- Enable quantization and optimization across modalities for Apple Silicon.
This would allow MLX to serve as a true multimodal inference engine, unlocking models like Dynin‑Omni and future multimodal LLMs.
Thank you for considering this feature request!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels