You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: packages/tasks/src/tasks/image-text-to-text/about.md
+36-24Lines changed: 36 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,39 +32,51 @@ Vision language models can recognize images through descriptions. When given det
32
32
33
33
## Inference
34
34
35
-
You can use the Transformers library to interact with vision-language models. You can load the model like below.
35
+
You can use the Transformers library to interact with [vision-language models](https://huggingface.co/models?pipeline_tag=image-text-to-text&transformers). Specifically, `pipeline` makes it easy to infer models.
The model's built-in chat template will be used to format the conversational input. We can pass the image as an URL in the `content` part of the user message:
36
46
37
47
```python
38
-
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
We can now directly pass in the messages to the pipeline to infer. The `return_full_text` flag is used to return the full prompt in the response, including the user input. Here we pass `False` to only return the generated text.
prompt ="[INST] <image>\nWhat is shown in this image? [/INST]"
68
+
outputs[0]["generated_text"]
69
+
# The image captures a moment of tranquility in nature. At the center of the frame, a pink flower with a yellow center is in full bloom. The flower is surrounded by a cluster of red flowers, their vibrant color contrasting with the pink of the flower. \n\nA black and yellow bee is per
You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.
0 commit comments