Skip to content

Commit beafbcd

Browse files
authored
Add back "Image Text to Text" page (#1796)
* Add back Image Text to Text page * format * fix docs ?
1 parent 9272ad5 commit beafbcd

File tree

6 files changed

+70
-3
lines changed

6 files changed

+70
-3
lines changed

docs/inference-providers/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,8 @@
7070
title: Image Classification
7171
- local: tasks/image-segmentation
7272
title: Image Segmentation
73+
- local: tasks/image-text-to-text
74+
title: Image-Text to Text
7375
- local: tasks/image-to-image
7476
title: Image to Image
7577
- local: tasks/object-detection

docs/inference-providers/tasks/chat-completion.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ This is a subtask of [`text-generation`](https://huggingface.co/docs/inference-p
3333

3434
- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): Strong image-text-to-text model.
3535

36+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-text-to-text&sort=trending).
37+
3638
### API Playground
3739

3840
For Chat Completion models, we provide an interactive UI Playground for easier testing:
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
<!---
2+
This markdown file has been generated from a script. Please do not edit it directly.
3+
For more details, check out:
4+
- the `generate.ts` script: https://github.com/huggingface/hub-docs/blob/main/scripts/inference-providers/scripts/generate.ts
5+
- the task template defining the sections in the page: https://github.com/huggingface/hub-docs/tree/main/scripts/inference-providers/templates/task/image-text-to-text.handlebars
6+
- the input jsonschema specifications used to generate the input markdown table: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/tasks/image-text-to-text/spec/input.json
7+
- the output jsonschema specifications used to generate the output markdown table: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/tasks/image-text-to-text/spec/output.json
8+
- the snippets used to generate the example:
9+
- curl: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/snippets/curl.ts
10+
- python: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/snippets/python.ts
11+
- javascript: https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/snippets/js.ts
12+
- the "tasks" content for recommended models: https://huggingface.co/api/tasks
13+
--->
14+
15+
## Image-Text to Text
16+
17+
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.
18+
19+
<Tip>
20+
21+
For more details about the `image-text-to-text` task, check out its [dedicated page](https://huggingface.co/tasks/image-text-to-text)! You will find examples and related materials.
22+
23+
</Tip>
24+
25+
### Recommended models
26+
27+
- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): Strong image-text-to-text model.
28+
29+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-text-to-text&sort=trending).
30+
31+
### Using the API
32+
33+
34+
<InferenceSnippet
35+
pipeline=image-text-to-text
36+
providersMapping={ {"cerebras":{"modelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct","providerModelId":"llama-4-scout-17b-16e-instruct"},"cohere":{"modelId":"CohereLabs/aya-vision-8b","providerModelId":"c4ai-aya-vision-8b"},"featherless-ai":{"modelId":"mistralai/Mistral-Small-3.1-24B-Instruct-2503","providerModelId":"mistralai/Mistral-Small-3.1-24B-Instruct-2503"},"fireworks-ai":{"modelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct","providerModelId":"accounts/fireworks/models/llama4-scout-instruct-basic"},"groq":{"modelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct","providerModelId":"meta-llama/llama-4-scout-17b-16e-instruct"},"hf-inference":{"modelId":"google/gemma-3-27b-it","providerModelId":"google/gemma-3-27b-it"},"hyperbolic":{"modelId":"Qwen/Qwen2.5-VL-7B-Instruct","providerModelId":"Qwen/Qwen2.5-VL-7B-Instruct"},"nebius":{"modelId":"google/gemma-3-27b-it","providerModelId":"google/gemma-3-27b-it-fast"},"novita":{"modelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct","providerModelId":"meta-llama/llama-4-scout-17b-16e-instruct"},"nscale":{"modelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct","providerModelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct"},"sambanova":{"modelId":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","providerModelId":"Llama-4-Maverick-17B-128E-Instruct"},"together":{"modelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct","providerModelId":"meta-llama/Llama-4-Scout-17B-16E-Instruct"}} }
37+
conversational />
38+
39+
40+
41+
### API specification
42+
43+
For the API specification of conversational image-text-to-text models, please refer to the [Chat Completion API documentation](https://huggingface.co/docs/inference-providers/tasks/chat-completion#api-specification).

scripts/inference-providers/scripts/generate.ts

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -768,9 +768,6 @@ async function renderTemplate(
768768

769769
await Promise.all(
770770
TASKS_EXTENDED.map(async (task) => {
771-
if (task === "image-text-to-text") {
772-
return; // not generated -> merged with chat-completion
773-
}
774771
// @ts-ignore
775772
const rendered = await renderTemplate(task, "task", DATA);
776773
await writeTaskDoc(task, rendered);

scripts/inference-providers/templates/task/chat-completion.handlebars

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ This is a subtask of [`text-generation`](https://huggingface.co/docs/inference-p
1717
- [{{this.id}}](https://huggingface.co/{{this.id}}): {{this.description}}
1818
{{/each}}
1919

20+
{{{tips.listModelsLink.image-text-to-text}}}
21+
2022
### API Playground
2123

2224
For Chat Completion models, we provide an interactive UI Playground for easier testing:
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
## Image-Text to Text
2+
3+
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.
4+
5+
{{{tips.linksToTaskPage.image-text-to-text}}}
6+
7+
### Recommended models
8+
9+
{{#each recommendedModels.conversational-image-text-to-text}}
10+
- [{{this.id}}](https://huggingface.co/{{this.id}}): {{this.description}}
11+
{{/each}}
12+
13+
{{{tips.listModelsLink.image-text-to-text}}}
14+
15+
### Using the API
16+
17+
{{{snippets.conversational-image-text-to-text}}}
18+
19+
### API specification
20+
21+
For the API specification of conversational image-text-to-text models, please refer to the [Chat Completion API documentation](https://huggingface.co/docs/inference-providers/tasks/chat-completion#api-specification).

0 commit comments

Comments
 (0)