|
| 1 | +<!--- |
| 2 | +This markdown file has been generated from a script. Please do not edit it directly. |
| 3 | +For more details, check out: |
| 4 | +- the `generate.ts` script: https://github.com/huggingface/hub-docs/blob/main/scripts/inference-providers/scripts/generate.ts |
| 5 | +- the task template defining the sections in the page: https://github.com/huggingface/hub-docs/tree/main/scripts/inference-providers/templates/task/image-text-to-text.handlebars |
| 6 | +- the input jsonschema specifications used to generate the input markdown table: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/tasks/image-text-to-text/spec/input.json |
| 7 | +- the output jsonschema specifications used to generate the output markdown table: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/tasks/image-text-to-text/spec/output.json |
| 8 | +- the snippets used to generate the example: |
| 9 | + - curl: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/snippets/curl.ts |
| 10 | + - python: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/snippets/python.ts |
| 11 | + - javascript: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/snippets/js.ts |
| 12 | +- the "tasks" content for recommended models: https://huggingface.co/api/tasks |
| 13 | +---> |
| 14 | + |
| 15 | +## Image-Text to Text |
| 16 | + |
| 17 | +Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input. |
| 18 | + |
| 19 | +<Tip> |
| 20 | + |
| 21 | +For more details about the `image-text-to-text` task, check out its [dedicated page](https://huggingface.co/tasks/image-text-to-text)! You will find examples and related materials. |
| 22 | + |
| 23 | +</Tip> |
| 24 | + |
| 25 | +### Recommended models |
| 26 | + |
| 27 | +- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): Strong image-text-to-text model. |
| 28 | + |
| 29 | +Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-text-to-text&sort=trending). |
| 30 | + |
| 31 | +### Using the API |
| 32 | + |
| 33 | + |
| 34 | +<InferenceSnippet |
| 35 | + pipeline=image-text-to-text |
| 36 | + providersMapping={ {"hf-inference":{"modelId":"google/gemma-3-27b-it","providerModelId":"google/gemma-3-27b-it"},"hyperbolic":{"modelId":"Qwen/Qwen2.5-VL-7B-Instruct","providerModelId":"Qwen/Qwen2.5-VL-7B-Instruct"}} } |
| 37 | +/> |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +### API specification |
| 42 | + |
| 43 | +For the API specification of conversational image-text-to-text models, please refer to the [Chat Completion API documentation](https://huggingface.co/docs/inference-providers/tasks/chat-completion#api-specification). |
0 commit comments