Skip to content

Commit 0bcbf66

Browse files
[Inference API] Add image-text-to-text task and fix generate script (#1440)
* Add image-text-to-text task and fix generate script * Run generate script * fix chat-completion and image-text-to-text docs * fix typo * Fix chat completion package reference links * regenerate inference api docs
1 parent 6a102ac commit 0bcbf66

24 files changed

+332
-86
lines changed

docs/api-inference/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@
3030
title: Image Segmentation
3131
- local: tasks/image-to-image
3232
title: Image to Image
33+
- local: tasks/image-text-to-text
34+
title: Image-Text to Text
3335
- local: tasks/object-detection
3436
title: Object Detection
3537
- local: tasks/question-answering

docs/api-inference/tasks/audio-classification.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,9 @@ For more details about the `audio-classification` task, check out its [dedicated
2929

3030
### Recommended models
3131

32+
- [ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://huggingface.co/ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition): An emotion recognition model.
3233

33-
This is only a subset of the supported models. Find the model that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=audio-classification&sort=trending).
34+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=audio-classification&sort=trending).
3435

3536
### Using the API
3637

@@ -39,19 +40,18 @@ This is only a subset of the supported models. Find the model that suits you bes
3940

4041
<curl>
4142
```bash
42-
curl https://api-inference.huggingface.co/models/<REPO_ID> \
43+
curl https://api-inference.huggingface.co/models/ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition \
4344
-X POST \
4445
--data-binary '@sample1.flac' \
4546
-H "Authorization: Bearer hf_***"
46-
4747
```
4848
</curl>
4949

5050
<python>
5151
```py
5252
import requests
5353

54-
API_URL = "https://api-inference.huggingface.co/models/<REPO_ID>"
54+
API_URL = "https://api-inference.huggingface.co/models/ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
5555
headers = {"Authorization": "Bearer hf_***"}
5656

5757
def query(filename):
@@ -71,7 +71,7 @@ To use the Python client, see `huggingface_hub`'s [package reference](https://hu
7171
async function query(filename) {
7272
const data = fs.readFileSync(filename);
7373
const response = await fetch(
74-
"https://api-inference.huggingface.co/models/<REPO_ID>",
74+
"https://api-inference.huggingface.co/models/ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition",
7575
{
7676
headers: {
7777
Authorization: "Bearer hf_***"

docs/api-inference/tasks/automatic-speech-recognition.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ For more details about the `automatic-speech-recognition` task, check out its [d
3232
- [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3): A powerful ASR model by OpenAI.
3333
- [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1): Powerful speaker diarization model.
3434

35-
This is only a subset of the supported models. Find the model that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=automatic-speech-recognition&sort=trending).
35+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=automatic-speech-recognition&sort=trending).
3636

3737
### Using the API
3838

@@ -45,7 +45,6 @@ curl https://api-inference.huggingface.co/models/openai/whisper-large-v3 \
4545
-X POST \
4646
--data-binary '@sample1.flac' \
4747
-H "Authorization: Bearer hf_***"
48-
4948
```
5049
</curl>
5150

@@ -108,7 +107,7 @@ To use the JavaScript client, see `huggingface.js`'s [package reference](https:/
108107
| **inputs*** | _string_ | The input audio data as a base64-encoded string. If no `parameters` are provided, you can also provide the audio data as a raw bytes payload. |
109108
| **parameters** | _object_ | Additional inference parameters for Automatic Speech Recognition |
110109
| **&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return_timestamps** | _boolean_ | Whether to output corresponding timestamps with the generated text |
111-
| **&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;generate** | _object_ | Ad-hoc parametrization of the text generation process |
110+
| **&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;generation_parameters** | _object_ | Ad-hoc parametrization of the text generation process |
112111
| **&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;temperature** | _number_ | The value used to modulate the next token probabilities. |
113112
| **&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;top_k** | _integer_ | The number of highest probability vocabulary tokens to keep for top-k-filtering. |
114113
| **&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;top_p** | _number_ | If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. |

docs/api-inference/tasks/chat-completion.md

Lines changed: 97 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,20 +14,23 @@ For more details, check out:
1414

1515
## Chat Completion
1616

17-
Generate a response given a list of messages.
18-
This is a subtask of [`text-generation`](./text_generation) designed to generate responses in a conversational context.
19-
20-
17+
Generate a response given a list of messages in a conversational context, supporting both conversational Language Models (LLMs) and conversational Vision-Language Models (VLMs).
18+
This is a subtask of [`text-generation`](https://huggingface.co/docs/api-inference/tasks/text-generation) and [`image-text-to-text`](https://huggingface.co/docs/api-inference/tasks/image-text-to-text).
2119

2220
### Recommended models
2321

22+
#### Conversational Large Language Models (LLMs)
23+
2424
- [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it): A text-generation model trained to follow instructions.
2525
- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct): Very powerful text generation model trained to follow instructions.
2626
- [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct): Small yet powerful text generation model.
2727
- [HuggingFaceH4/starchat2-15b-v0.1](https://huggingface.co/HuggingFaceH4/starchat2-15b-v0.1): Strong coding assistant model.
2828
- [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407): Very strong open-source large language model.
2929

30+
#### Conversational Vision-Language Models (VLMs)
3031

32+
- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct): Powerful vision language model with great visual understanding and reasoning capabilities.
33+
- [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct): Strong image-text-to-text model.
3134

3235
### Using the API
3336

@@ -37,6 +40,8 @@ The API supports:
3740
* Using grammars, constraints, and tools.
3841
* Streaming the output
3942

43+
#### Code snippet example for conversational LLMs
44+
4045

4146
<inferencesnippet>
4247

@@ -59,18 +64,15 @@ curl 'https://api-inference.huggingface.co/models/google/gemma-2-2b-it/v1/chat/c
5964
```py
6065
from huggingface_hub import InferenceClient
6166

62-
client = InferenceClient(
63-
"google/gemma-2-2b-it",
64-
token="hf_***",
65-
)
67+
client = InferenceClient(api_key="hf_***")
6668

6769
for message in client.chat_completion(
70+
model="google/gemma-2-2b-it",
6871
messages=[{"role": "user", "content": "What is the capital of France?"}],
6972
max_tokens=500,
7073
stream=True,
7174
):
7275
print(message.choices[0].delta.content, end="")
73-
7476
```
7577

7678
To use the Python client, see `huggingface_hub`'s [package reference](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion).
@@ -89,7 +91,93 @@ for await (const chunk of inference.chatCompletionStream({
8991
})) {
9092
process.stdout.write(chunk.choices[0]?.delta?.content || "");
9193
}
94+
```
95+
96+
To use the JavaScript client, see `huggingface.js`'s [package reference](https://huggingface.co/docs/huggingface.js/inference/classes/HfInference#chatcompletion).
97+
</js>
98+
99+
</inferencesnippet>
100+
101+
102+
103+
#### Code snippet example for conversational VLMs
92104
105+
106+
<inferencesnippet>
107+
108+
<curl>
109+
```bash
110+
curl 'https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct/v1/chat/completions' \
111+
-H "Authorization: Bearer hf_***" \
112+
-H 'Content-Type: application/json' \
113+
-d '{
114+
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
115+
"messages": [
116+
{
117+
"role": "user",
118+
"content": [
119+
{"type": "image_url", "image_url": {"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}},
120+
{"type": "text", "text": "Describe this image in one sentence."}
121+
]
122+
}
123+
],
124+
"max_tokens": 500,
125+
"stream": false
126+
}'
127+
128+
```
129+
</curl>
130+
131+
<python>
132+
```py
133+
from huggingface_hub import InferenceClient
134+
135+
client = InferenceClient(api_key="hf_***")
136+
137+
image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
138+
139+
for message in client.chat_completion(
140+
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
141+
messages=[
142+
{
143+
"role": "user",
144+
"content": [
145+
{"type": "image_url", "image_url": {"url": image_url}},
146+
{"type": "text", "text": "Describe this image in one sentence."},
147+
],
148+
}
149+
],
150+
max_tokens=500,
151+
stream=True,
152+
):
153+
print(message.choices[0].delta.content, end="")
154+
```
155+
156+
To use the Python client, see `huggingface_hub`'s [package reference](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion).
157+
</python>
158+
159+
<js>
160+
```js
161+
import { HfInference } from "@huggingface/inference";
162+
163+
const inference = new HfInference("hf_***");
164+
const imageUrl = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg";
165+
166+
for await (const chunk of inference.chatCompletionStream({
167+
model: "meta-llama/Llama-3.2-11B-Vision-Instruct",
168+
messages: [
169+
{
170+
"role": "user",
171+
"content": [
172+
{"type": "image_url", "image_url": {"url": imageUrl}},
173+
{"type": "text", "text": "Describe this image in one sentence."},
174+
],
175+
}
176+
],
177+
max_tokens: 500,
178+
})) {
179+
process.stdout.write(chunk.choices[0]?.delta?.content || "");
180+
}
93181
```
94182
95183
To use the JavaScript client, see `huggingface.js`'s [package reference](https://huggingface.co/docs/huggingface.js/inference/classes/HfInference#chatcompletion).

docs/api-inference/tasks/feature-extraction.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ For more details about the `feature-extraction` task, check out its [dedicated p
3131

3232
- [thenlper/gte-large](https://huggingface.co/thenlper/gte-large): A powerful feature extraction model for natural language processing tasks.
3333

34-
This is only a subset of the supported models. Find the model that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=feature-extraction&sort=trending).
34+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=feature-extraction&sort=trending).
3535

3636
### Using the API
3737

@@ -45,7 +45,6 @@ curl https://api-inference.huggingface.co/models/thenlper/gte-large \
4545
-d '{"inputs": "Today is a sunny day and I will get some ice cream."}' \
4646
-H 'Content-Type: application/json' \
4747
-H "Authorization: Bearer hf_***"
48-
4948
```
5049
</curl>
5150

docs/api-inference/tasks/fill-mask.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ For more details about the `fill-mask` task, check out its [dedicated page](http
2727
- [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased): The famous BERT model.
2828
- [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base): A multilingual model trained on 100 languages.
2929

30-
This is only a subset of the supported models. Find the model that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=fill-mask&sort=trending).
30+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=fill-mask&sort=trending).
3131

3232
### Using the API
3333

@@ -41,7 +41,6 @@ curl https://api-inference.huggingface.co/models/google-bert/bert-base-uncased \
4141
-d '{"inputs": "The answer to the universe is [MASK]."}' \
4242
-H 'Content-Type: application/json' \
4343
-H "Authorization: Bearer hf_***"
44-
4544
```
4645
</curl>
4746

docs/api-inference/tasks/image-classification.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,9 @@ For more details about the `image-classification` task, check out its [dedicated
2525
### Recommended models
2626

2727
- [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224): A strong image classification model.
28+
- [facebook/deit-base-distilled-patch16-224](https://huggingface.co/facebook/deit-base-distilled-patch16-224): A robust image classification model.
2829

29-
This is only a subset of the supported models. Find the model that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-classification&sort=trending).
30+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-classification&sort=trending).
3031

3132
### Using the API
3233

@@ -39,7 +40,6 @@ curl https://api-inference.huggingface.co/models/google/vit-base-patch16-224 \
3940
-X POST \
4041
--data-binary '@cats.jpg' \
4142
-H "Authorization: Bearer hf_***"
42-
4343
```
4444
</curl>
4545

docs/api-inference/tasks/image-segmentation.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ For more details about the `image-segmentation` task, check out its [dedicated p
2626

2727
- [nvidia/segformer-b0-finetuned-ade-512-512](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512): Semantic segmentation model trained on ADE20k benchmark dataset with 512x512 resolution.
2828

29-
This is only a subset of the supported models. Find the model that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-segmentation&sort=trending).
29+
Explore all available models and find the one that suits you best [here](https://huggingface.co/models?inference=warm&pipeline_tag=image-segmentation&sort=trending).
3030

3131
### Using the API
3232

@@ -39,7 +39,6 @@ curl https://api-inference.huggingface.co/models/nvidia/segformer-b0-finetuned-a
3939
-X POST \
4040
--data-binary '@cats.jpg' \
4141
-H "Authorization: Bearer hf_***"
42-
4342
```
4443
</curl>
4544

0 commit comments

Comments
 (0)