Document uploading and referencing files on the web

We currently do it in a hacky way: we take the files and transcribe them. Then reasoning goes as if the user provided text.

We need to do the normal multi-modal inference.