Multimodal model capabilties

This playground is about using image, text and audio as input for models to decide and reason about human input.

Vision interaction

Look at a single image

python inspect-image.py

Compare multiple images

python compare-images.py

launch the app

python voice-interaction/app.py