Vision language model support

Hello! 💗 When trying to run benchmarks on vision language models (image-text-to-text) I realized this library doesn't support this task. It would be nice to have a support for it since these models are almost as mainstream as LLMs.