You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: packages/tasks/src/tasks/image-text-to-text/about.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,12 +24,16 @@ Vision language models trained on image-text pairs can be used for visual questi
24
24
25
25
### Document Question Answering and Retrieval
26
26
27
-
Documents often consist of different layouts, charts, tables, images, and more. Vision language models trained on formatted documents can extract information from them. This is an OCR-free approach; the inputs skip OCR, and documents are directly fed to vision language models.
27
+
Documents often consist of different layouts, charts, tables, images, and more. Vision language models trained on formatted documents can extract information from them. This is an OCR-free approach; the inputs skip OCR, and documents are directly fed to vision language models. To find the relevant documents to be fed, models like [ColPali](https://huggingface.co/blog/manu/colpali) are used. An example workflow can be found [here](https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb).
28
28
29
29
### Image Recognition with Instructions
30
30
31
31
Vision language models can recognize images through descriptions. When given detailed descriptions of specific entities, it can classify the entities in an image.
32
32
33
+
### Computer Use
34
+
35
+
Image-text-to-text models can be used to control computers with agentic workflows. Models like [ShowUI](https://huggingface.co/showlab/ShowUI-2B) and [OmniParser](https://huggingface.co/microsoft/OmniParser) are used to parse screenshots to later take actions on the computer autonomously.
36
+
33
37
## Inference
34
38
35
39
You can use the Transformers library to interact with [vision-language models](https://huggingface.co/models?pipeline_tag=image-text-to-text&transformers). Specifically, `pipeline` makes it easy to infer models.
description: "An image-text-to-text application focused on documents.",
85
89
id: "stepfun-ai/GOT_official_online_demo",
86
90
},
87
-
{
88
-
description: "An application to compare outputs of different vision language models.",
89
-
id: "merve/compare_VLMs",
90
-
},
91
91
{
92
92
description: "An application for chatting with an image-text-to-text model.",
93
93
id: "GanymedeNil/Qwen2-VL-7B",
94
94
},
95
+
{
96
+
description: "An application that parses screenshots into actions.",
97
+
id: "showlab/ShowUI",
98
+
},
99
+
{
100
+
description: "An application that detects gaze.",
101
+
id: "smoondream/gaze-demo",
102
+
},
95
103
],
96
104
summary:
97
105
"Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.",
0 commit comments