Skip to content

Commit 4b2fbb6

Browse files
merveenoyanMerve NoyanVaibhavs10
authored
Update Tasks (#1098)
Update models, datasets, Spaces and to content where necessary --------- Co-authored-by: Merve Noyan <[email protected]> Co-authored-by: vb <[email protected]>
1 parent 697e9be commit 4b2fbb6

File tree

19 files changed

+120
-78
lines changed

19 files changed

+120
-78
lines changed

packages/tasks/src/tasks/audio-to-audio/data.ts

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,6 @@ const taskData: TaskDataCustom = {
3838
},
3939
],
4040
models: [
41-
{
42-
description: "A solid model of audio source separation.",
43-
id: "speechbrain/sepformer-wham",
44-
},
4541
{
4642
description: "A speech enhancement model.",
4743
id: "ResembleAI/resemble-enhance",

packages/tasks/src/tasks/fill-mask/data.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@ const taskData: TaskDataCustom = {
6161
],
6262
models: [
6363
{
64-
description: "The famous BERT model.",
65-
id: "google-bert/bert-base-uncased",
64+
description: "State-of-the-art masked language model.",
65+
id: "answerdotai/ModernBERT-large",
6666
},
6767
{
6868
description: "A multilingual model trained on 100 languages.",

packages/tasks/src/tasks/image-classification/data.ts

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -74,9 +74,8 @@ const taskData: TaskDataCustom = {
7474
],
7575
spaces: [
7676
{
77-
// TO DO: write description
78-
description: "An application that classifies what a given image is about.",
79-
id: "nielsr/perceiver-image-classification",
77+
description: "A leaderboard to evaluate different image classification models.",
78+
id: "timm/leaderboard",
8079
},
8180
],
8281
summary:

packages/tasks/src/tasks/image-feature-extraction/data.ts

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,20 @@ const taskData: TaskDataCustom = {
4343
id: "facebook/dino-vitb16",
4444
},
4545
{
46-
description: "Strong image feature extraction model made for information retrieval from documents.",
47-
id: "vidore/colpali",
46+
description: "Cutting-edge image feature extraction model.",
47+
id: "apple/aimv2-large-patch14-336-distilled",
4848
},
4949
{
5050
description: "Strong image feature extraction model that can be used on images and documents.",
5151
id: "OpenGVLab/InternViT-6B-448px-V1-2",
5252
},
5353
],
54-
spaces: [],
54+
spaces: [
55+
{
56+
description: "A leaderboard to evaluate different image-feature-extraction models on classification performances",
57+
id: "timm/leaderboard",
58+
},
59+
],
5560
summary: "Image feature extraction is the task of extracting features learnt in a computer vision model.",
5661
widgetModels: [],
5762
};

packages/tasks/src/tasks/image-text-to-text/about.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,16 @@ Vision language models trained on image-text pairs can be used for visual questi
2424

2525
### Document Question Answering and Retrieval
2626

27-
Documents often consist of different layouts, charts, tables, images, and more. Vision language models trained on formatted documents can extract information from them. This is an OCR-free approach; the inputs skip OCR, and documents are directly fed to vision language models.
27+
Documents often consist of different layouts, charts, tables, images, and more. Vision language models trained on formatted documents can extract information from them. This is an OCR-free approach; the inputs skip OCR, and documents are directly fed to vision language models. To find the relevant documents to be fed, models like [ColPali](https://huggingface.co/blog/manu/colpali) are used. An example workflow can be found [here](https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb).
2828

2929
### Image Recognition with Instructions
3030

3131
Vision language models can recognize images through descriptions. When given detailed descriptions of specific entities, it can classify the entities in an image.
3232

33+
### Computer Use
34+
35+
Image-text-to-text models can be used to control computers with agentic workflows. Models like [ShowUI](https://huggingface.co/showlab/ShowUI-2B) and [OmniParser](https://huggingface.co/microsoft/OmniParser) are used to parse screenshots to later take actions on the computer autonomously.
36+
3337
## Inference
3438

3539
You can use the Transformers library to interact with [vision-language models](https://huggingface.co/models?pipeline_tag=image-text-to-text&transformers). Specifically, `pipeline` makes it easy to infer models.
@@ -82,7 +86,8 @@ curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision
8286
## Useful Resources
8387

8488
- [Vision Language Models Explained](https://huggingface.co/blog/vlms)
85-
- [Open-source Multimodality and How to Achieve it using Hugging Face](https://www.youtube.com/watch?v=IoGaGfU1CIg&t=601s)
86-
- [Introducing Idefics2: A Powerful 8B Vision-Language Model for the community](https://huggingface.co/blog/idefics2)
89+
- [Welcome PaliGemma 2 – New vision language models by Google](https://huggingface.co/blog/paligemma2)
90+
- [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm)
91+
- [Multimodal RAG using ColPali and Qwen2-VL](https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb)
8792
- [Image-text-to-text task guide](https://huggingface.co/tasks/image-text-to-text)
8893
- [Preference Optimization for Vision Language Models with TRL](https://huggingface.co/blog/dpo_vlm)

packages/tasks/src/tasks/image-text-to-text/data.ts

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ const taskData: TaskDataCustom = {
77
id: "liuhaotian/LLaVA-Instruct-150K",
88
},
99
{
10-
description: "Conversation turns where questions involve image and text.",
11-
id: "liuhaotian/LLaVA-Pretrain",
10+
description: "Collection of image-text pairs on scientific topics.",
11+
id: "DAMO-NLP-SG/multimodal_textbook",
1212
},
1313
{
1414
description: "A collection of datasets made for model fine-tuning.",
@@ -43,11 +43,15 @@ const taskData: TaskDataCustom = {
4343
metrics: [],
4444
models: [
4545
{
46-
description: "Powerful vision language model with great visual understanding and reasoning capabilities.",
47-
id: "meta-llama/Llama-3.2-11B-Vision-Instruct",
46+
description: "Small and efficient yet powerful vision language model.",
47+
id: "HuggingFaceTB/SmolVLM-Instruct",
4848
},
4949
{
50-
description: "Cutting-edge vision language models.",
50+
description: "A screenshot understanding model used to control computers.",
51+
id: "showlab/ShowUI-2B",
52+
},
53+
{
54+
description: "Cutting-edge vision language model.",
5155
id: "allenai/Molmo-7B-D-0924",
5256
},
5357
{
@@ -59,8 +63,8 @@ const taskData: TaskDataCustom = {
5963
id: "Qwen/Qwen2-VL-7B-Instruct",
6064
},
6165
{
62-
description: "Strong image-text-to-text model.",
63-
id: "mistralai/Pixtral-12B-2409",
66+
description: "Image-text-to-text model with reasoning capabilities.",
67+
id: "Qwen/QVQ-72B-Preview",
6468
},
6569
{
6670
description: "Strong image-text-to-text model focused on documents.",
@@ -84,14 +88,18 @@ const taskData: TaskDataCustom = {
8488
description: "An image-text-to-text application focused on documents.",
8589
id: "stepfun-ai/GOT_official_online_demo",
8690
},
87-
{
88-
description: "An application to compare outputs of different vision language models.",
89-
id: "merve/compare_VLMs",
90-
},
9191
{
9292
description: "An application for chatting with an image-text-to-text model.",
9393
id: "GanymedeNil/Qwen2-VL-7B",
9494
},
95+
{
96+
description: "An application that parses screenshots into actions.",
97+
id: "showlab/ShowUI",
98+
},
99+
{
100+
description: "An application that detects gaze.",
101+
id: "smoondream/gaze-demo",
102+
},
95103
],
96104
summary:
97105
"Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.",

packages/tasks/src/tasks/image-to-3d/data.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ const taskData: TaskDataCustom = {
4141
id: "hwjiang/Real3D",
4242
},
4343
{
44-
description: "Generative 3D gaussian splatting model.",
45-
id: "ashawkey/LGM",
44+
description: "Consistent image-to-3d generation model.",
45+
id: "stabilityai/stable-point-aware-3d",
4646
},
4747
],
4848
spaces: [
@@ -55,8 +55,8 @@ const taskData: TaskDataCustom = {
5555
id: "TencentARC/InstantMesh",
5656
},
5757
{
58-
description: "Image-to-3D demo with mesh outputs.",
59-
id: "stabilityai/TripoSR",
58+
description: "Image-to-3D demo.",
59+
id: "stabilityai/stable-point-aware-3d",
6060
},
6161
{
6262
description: "Image-to-3D demo with mesh outputs.",

packages/tasks/src/tasks/image-to-image/data.ts

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ const taskData: TaskDataCustom = {
1010
description: "Multiple images of celebrities, used for facial expression translation",
1111
id: "huggan/CelebA-faces",
1212
},
13+
{
14+
description: "12M image-caption pairs.",
15+
id: "Spawning/PD12M",
16+
},
1317
],
1418
demo: {
1519
inputs: [
@@ -53,17 +57,20 @@ const taskData: TaskDataCustom = {
5357
id: "keras-io/super-resolution",
5458
},
5559
{
56-
description:
57-
"A model that creates a set of variations of the input image in the style of DALL-E using Stable Diffusion.",
58-
id: "lambdalabs/sd-image-variations-diffusers",
60+
description: "A model for applying edits to images through image controls.",
61+
id: "Yuanshi/OminiControl",
5962
},
6063
{
6164
description: "A model that generates images based on segments in the input image and the text prompt.",
6265
id: "mfidabel/controlnet-segment-anything",
6366
},
6467
{
65-
description: "A model that takes an image and an instruction to edit the image.",
66-
id: "timbrooks/instruct-pix2pix",
68+
description: "Strong model for inpainting and outpainting.",
69+
id: "black-forest-labs/FLUX.1-Fill-dev",
70+
},
71+
{
72+
description: "Strong model for image editing using depth maps.",
73+
id: "black-forest-labs/FLUX.1-Depth-dev-lora",
6774
},
6875
],
6976
spaces: [

packages/tasks/src/tasks/index.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
132132
"video-classification": ["transformers"],
133133
"mask-generation": ["transformers"],
134134
"multiple-choice": ["transformers"],
135-
"object-detection": ["transformers", "transformers.js"],
135+
"object-detection": ["transformers", "transformers.js", "ultralytics"],
136136
other: [],
137137
"question-answering": ["adapter-transformers", "allennlp", "transformers", "transformers.js"],
138138
robotics: [],

packages/tasks/src/tasks/keypoint-detection/data.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ const taskData: TaskDataCustom = {
3131
description: "Strong keypoint detection model used to detect human pose.",
3232
id: "facebook/sapiens-pose-1b",
3333
},
34+
{
35+
description: "Powerful keypoint detection model used to detect human pose.",
36+
id: "usyd-community/vitpose-plus-base",
37+
},
3438
],
3539
spaces: [
3640
{

0 commit comments

Comments
 (0)