We currently do it in a hacky way: we take the files and transcribe them. Then reasoning goes as if the user provided text. We need to do the normal multi-modal inference.