Skip to content

Commit 8c62f4a

Browse files
merveenoyanpcuencaVaibhavs10Merve Noyan
authored
Tasks: Add image-text-to-text pipeline and inference API to task page (huggingface#1039)
..and remove the long inference --------- Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: vb <[email protected]> Co-authored-by: Merve Noyan <[email protected]>
1 parent d01296c commit 8c62f4a

File tree

1 file changed

+36
-24
lines changed
  • packages/tasks/src/tasks/image-text-to-text

1 file changed

+36
-24
lines changed

packages/tasks/src/tasks/image-text-to-text/about.md

Lines changed: 36 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -32,39 +32,51 @@ Vision language models can recognize images through descriptions. When given det
3232

3333
## Inference
3434

35-
You can use the Transformers library to interact with vision-language models. You can load the model like below.
35+
You can use the Transformers library to interact with [vision-language models](https://huggingface.co/models?pipeline_tag=image-text-to-text&transformers). Specifically, `pipeline` makes it easy to infer models.
36+
37+
Initialize the pipeline first.
38+
39+
```python
40+
from transformers import pipeline
41+
42+
pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
43+
```
44+
45+
The model's built-in chat template will be used to format the conversational input. We can pass the image as an URL in the `content` part of the user message:
3646

3747
```python
38-
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
39-
import torch
40-
41-
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
42-
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
43-
model = LlavaNextForConditionalGeneration.from_pretrained(
44-
"llava-hf/llava-v1.6-mistral-7b-hf",
45-
torch_dtype=torch.float16
46-
)
47-
model.to(device)
48+
messages = [
49+
{
50+
"role": "user",
51+
"content": [
52+
{
53+
"type": "image",
54+
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
55+
},
56+
{"type": "text", "text": "Describe this image."},
57+
],
58+
}
59+
]
60+
4861
```
4962

50-
We can infer by passing image and text dialogues.
63+
We can now directly pass in the messages to the pipeline to infer. The `return_full_text` flag is used to return the full prompt in the response, including the user input. Here we pass `False` to only return the generated text.
5164

5265
```python
53-
from PIL import Image
54-
import requests
66+
outputs = pipe(text=messages, max_new_tokens=60, return_full_text=False)
5567

56-
# image of a radar chart
57-
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
58-
image = Image.open(requests.get(url, stream=True).raw)
59-
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
68+
outputs[0]["generated_text"]
69+
# The image captures a moment of tranquility in nature. At the center of the frame, a pink flower with a yellow center is in full bloom. The flower is surrounded by a cluster of red flowers, their vibrant color contrasting with the pink of the flower. \n\nA black and yellow bee is per
70+
```
6071

61-
inputs = processor(prompt, image, return_tensors="pt").to(device)
62-
output = model.generate(**inputs, max_new_tokens=100)
72+
You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.
6373

64-
print(processor.decode(output[0], skip_special_tokens=True))
65-
# The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes
66-
# starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.
67-
# The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-V
74+
```bash
75+
curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \
76+
-X POST \
77+
-d '{"messages": [{"role": "user","content": [{"type": "image"}, {"type": "text", "text": "Can you describe the image?"}]}]}' \
78+
-H "Content-Type: application/json" \
79+
-H "Authorization: Bearer hf_***"
6880
```
6981

7082
## Useful Resources

0 commit comments

Comments
 (0)