Using ml-swift-lm with video - a few questions about the MediaProcessing pipeline #335

vade · 2026-01-15T19:40:24Z

vade
Jan 15, 2026

Hi

I'm the author of Fabric which is aiming to be a modern replacement for Quartz Composer - a content creation tool that made nodes and patching cool way before ComfyUI was a thing :)

I've integrated standard LLM calling by porting some of the code from the MLX-Swift examples which works great, but im trying to wrap my head around video understanding with VLLMs in the context of not having movie files to process (think live camera input, rendered content, processed content, etc)

In reviewing the code, in the MediaProcessing I see every video path assumes a fixed asset ( AVAsset ), which is also passed through to the prompting subsystem via UserInput, which has a VideoStruct, but i don't see a way to send in a set of frames I have in memory to be processed by the standard 'UserInput' or 'Prompt' methods into a VLLM?

I see that Image can take in a CIImage or existing MLArray which is awesome, but I'd love to know if it makes sense to expose an MLXArray, or array of CIImages as a sequence of fixed length to be processed by a vLLM as part of a UserInput ?

I do see ProcessedVideo as field on LMInput - am i right to assume the correct path would be something like injecting my own ProcessedVideo frames into a LMInput i get from the standard UserInput prompt - which would have video options enabled (since i have no asset to reference) ?

  try await modelContainer.perform { (context: ModelContext) -> Void in
//                print("LLM preparing user input")
                let lmInput = try await context.processor.prepare(input: userInput)
                
                lmInput.video = someFunctionToConvertMTLTexturesToProcessedVideo()
                
... continue with stream generation  and output as per example ?

vade · 2026-01-15T19:57:17Z

vade
Jan 15, 2026
Author

Oh, also, given that MLX is optimized for unified memory, does it make sense to provide IOSurface backed images - is there any advantage there from a client perspective there? I dont see any IOSurface backing options in the media pipeline in MLX, so my presumption is no?

3 replies

davidkoski Jan 15, 2026
Maintainer

Oh, also, given that MLX is optimized for unified memory, does it make sense to provide IOSurface backed images - is there any advantage there from a client perspective there? I dont see any IOSurface backing options in the media pipeline in MLX, so my presumption is no?

It depends what you want to do with the IOSurface. Certainly CIImage is a good way to encapsulate and use them for image processing. If you were to get frames from a video or a camera they are likely IOSurfaces under the hood and wrapped in a CIImage.

MLX doesn't have a great way to consume IOSurfaces yet.

vade Jan 15, 2026
Author

that YET is doing a lot of work haha :)

Most of my tooling has MTLTextures, which can of course be IOSurface backed, so thats good to know!

davidkoski Jan 15, 2026
Maintainer

ml-explore/mlx#2855 -- I think we are just waiting for the mlx-c API at this point

davidkoski · 2026-01-15T20:01:53Z

davidkoski
Jan 15, 2026
Maintainer

Largely the API that exists is what I had specific use cases for -- it can and should be extended. I think we need to consider the costs to current implementations and how we would consume these inputs.

For example, the Image case has a sort of canonical representation in asCIImage(). This allows VLMs to run any Image input through their pipeline by turning it into a CIImage, which the rest of the pipeline is expecting.

The image side does handle an array of Image inputs, which is almost what you are asking about, but I think most VLMs wouldn't understand that this is a temporal sequence (to whatever ability they have to handle that).

So having Video have something like case frames([Image]) might be a good way to approach it. I think then we need to think about the video processing pipeline and what the canonical representation looks like. It could be an array of frames, but perhaps keeping it in the video domain is useful. We can temporally subsample videos very efficiently. Another approach would be to just build some slightly higher level constructs that operated on Video to perform the operations we want.

We already have something close -- the asProcessedSequence call handles the temporal portion and the frameProcessor deals with a single frame as a CIImage. This could easily encapsulate iterating the .frames() case. If all the VLMs go through this then we are all set!

5 replies

vade Jan 15, 2026
Author

This makes sense to me (I think!)

when you say 'keep things in the video domain' is the presumption there that its a 'dense' sampling at video frame rates, and perhaps at native resolution? And the idea is some updated asProcessedSequence converts an array of VideoFrame objects into a sparsely samples set of ProcessedFrames at some sampling rate?

davidkoski Jan 15, 2026
Maintainer

Yes, exactly. I think it is good to be able to process the frames, e.g. for subsampling, in a streaming fashion.

If you want to make a PR to add this, go for it! Otherwise I can file an issue and work on it when I get a chance.

vade Jan 15, 2026
Author

Awesome, I'm happy to give it a go, and if you dont mind I may poke you with questions. Is it appropriate to open a Draft PR on the MLX repos to offer a place to discuss?

Thanks for your very prompt feedback.

davidkoski Jan 15, 2026
Maintainer

Yes, that would be great and happy to discuss either here or in the PR.

vade Jan 16, 2026
Author

Great, Ive opened a draft PR ml-explore/mlx-swift-lm#64 for the discussed feature.

vade · 2026-01-15T22:44:33Z

vade
Jan 15, 2026
Author

0 replies

Using ml-swift-lm with video - a few questions about the MediaProcessing pipeline #335

Uh oh!

Uh oh!

vade Jan 15, 2026

Replies: 3 comments · 8 replies

Uh oh!

vade Jan 15, 2026 Author

Uh oh!

davidkoski Jan 15, 2026 Maintainer

Uh oh!

vade Jan 15, 2026 Author

Uh oh!

davidkoski Jan 15, 2026 Maintainer

Uh oh!

davidkoski Jan 15, 2026 Maintainer

Uh oh!

vade Jan 15, 2026 Author

Uh oh!

davidkoski Jan 15, 2026 Maintainer

Uh oh!

vade Jan 15, 2026 Author

Uh oh!

davidkoski Jan 15, 2026 Maintainer

Uh oh!

vade Jan 16, 2026 Author

Uh oh!

vade Jan 15, 2026 Author

vade
Jan 15, 2026

Replies: 3 comments 8 replies

vade
Jan 15, 2026
Author

davidkoski Jan 15, 2026
Maintainer

vade Jan 15, 2026
Author

davidkoski Jan 15, 2026
Maintainer

davidkoski
Jan 15, 2026
Maintainer

vade Jan 15, 2026
Author

davidkoski Jan 15, 2026
Maintainer

vade Jan 15, 2026
Author

davidkoski Jan 15, 2026
Maintainer

vade Jan 16, 2026
Author

vade
Jan 15, 2026
Author