Skip to content

Commit 9b84d7b

Browse files
authored
Add support for CLIPSeg models (#478)
* Add support for CLIPSeg models * Update JSDoc
1 parent 80af1c4 commit 9b84d7b

File tree

5 files changed

+71
-0
lines changed

5 files changed

+71
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
278278
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
279279
1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
280280
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
281+
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
281282
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
282283
1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
283284
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1414
1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
1515
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
16+
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
1617
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
1718
1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
1819
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

scripts/supported_models.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,14 @@
204204
'openai/clip-vit-large-patch14-336',
205205
],
206206
},
207+
'clipseg': {
208+
# Image segmentation
209+
'image-segmentation': [
210+
'CIDAS/clipseg-rd64-refined',
211+
'CIDAS/clipseg-rd64',
212+
'CIDAS/clipseg-rd16',
213+
],
214+
},
207215
'codegen': {
208216
# Text generation
209217
'text-generation': [

src/models.js

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3167,6 +3167,62 @@ export class ChineseCLIPModel extends ChineseCLIPPreTrainedModel { }
31673167
//////////////////////////////////////////////////
31683168

31693169

3170+
//////////////////////////////////////////////////
3171+
// CLIPSeg models
3172+
export class CLIPSegPreTrainedModel extends PreTrainedModel { }
3173+
3174+
export class CLIPSegModel extends CLIPSegPreTrainedModel { }
3175+
3176+
/**
3177+
* CLIPSeg model with a Transformer-based decoder on top for zero-shot and one-shot image segmentation.
3178+
*
3179+
* **Example:** Perform zero-shot image segmentation with a `CLIPSegForImageSegmentation` model.
3180+
*
3181+
* ```javascript
3182+
* import { AutoTokenizer, AutoProcessor, CLIPSegForImageSegmentation, RawImage } from '@xenova/transformers';
3183+
*
3184+
* // Load tokenizer, processor, and model
3185+
* const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clipseg-rd64-refined');
3186+
* const processor = await AutoProcessor.from_pretrained('Xenova/clipseg-rd64-refined');
3187+
* const model = await CLIPSegForImageSegmentation.from_pretrained('Xenova/clipseg-rd64-refined');
3188+
*
3189+
* // Run tokenization
3190+
* const texts = ['a glass', 'something to fill', 'wood', 'a jar'];
3191+
* const text_inputs = tokenizer(texts, { padding: true, truncation: true });
3192+
*
3193+
* // Read image and run processor
3194+
* const image = await RawImage.read('https://github.com/timojl/clipseg/blob/master/example_image.jpg?raw=true');
3195+
* const image_inputs = await processor(image);
3196+
*
3197+
* // Run model with both text and pixel inputs
3198+
* const { logits } = await model({ ...text_inputs, ...image_inputs });
3199+
* // logits: Tensor {
3200+
* // dims: [4, 352, 352],
3201+
* // type: 'float32',
3202+
* // data: Float32Array(495616) [ ... ],
3203+
* // size: 495616
3204+
* // }
3205+
* ```
3206+
*
3207+
* You can visualize the predictions as follows:
3208+
* ```javascript
3209+
* const preds = logits
3210+
* .unsqueeze_(1)
3211+
* .sigmoid_()
3212+
* .mul_(255)
3213+
* .round_()
3214+
* .to('uint8');
3215+
*
3216+
* for (let i = 0; i < preds.dims[0]; ++i) {
3217+
* const img = RawImage.fromTensor(preds[i]);
3218+
* img.save(`prediction_${i}.png`);
3219+
* }
3220+
* ```
3221+
*/
3222+
export class CLIPSegForImageSegmentation extends CLIPSegPreTrainedModel { }
3223+
//////////////////////////////////////////////////
3224+
3225+
31703226
//////////////////////////////////////////////////
31713227
// GPT2 models
31723228
export class GPT2PreTrainedModel extends PreTrainedModel {
@@ -4844,6 +4900,7 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
48444900
['xlm-roberta', ['XLMRobertaModel', XLMRobertaModel]],
48454901
['clap', ['ClapModel', ClapModel]],
48464902
['clip', ['CLIPModel', CLIPModel]],
4903+
['clipseg', ['CLIPSegModel', CLIPSegModel]],
48474904
['chinese_clip', ['ChineseCLIPModel', ChineseCLIPModel]],
48484905
['mobilebert', ['MobileBertModel', MobileBertModel]],
48494906
['squeezebert', ['SqueezeBertModel', SqueezeBertModel]],
@@ -5056,6 +5113,7 @@ const MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = new Map([
50565113

50575114
const MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES = new Map([
50585115
['detr', ['DetrForSegmentation', DetrForSegmentation]],
5116+
['clipseg', ['CLIPSegForImageSegmentation', CLIPSegForImageSegmentation]],
50595117
]);
50605118

50615119
const MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES = new Map([

src/processors.js

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -682,6 +682,8 @@ export class ChineseCLIPFeatureExtractor extends ImageFeatureExtractor { }
682682
export class ConvNextFeatureExtractor extends ImageFeatureExtractor { }
683683
export class ConvNextImageProcessor extends ConvNextFeatureExtractor { } // NOTE extends ConvNextFeatureExtractor
684684
export class ViTFeatureExtractor extends ImageFeatureExtractor { }
685+
export class ViTImageProcessor extends ImageFeatureExtractor { }
686+
685687
export class MobileViTFeatureExtractor extends ImageFeatureExtractor { }
686688
export class OwlViTFeatureExtractor extends ImageFeatureExtractor {
687689
/** @type {post_process_object_detection} */
@@ -1775,6 +1777,7 @@ export class AutoProcessor {
17751777
DonutFeatureExtractor,
17761778
NougatImageProcessor,
17771779

1780+
ViTImageProcessor,
17781781
VitMatteImageProcessor,
17791782
SamImageProcessor,
17801783
Swin2SRImageProcessor,

0 commit comments

Comments
 (0)