Skip to content

Commit 80af1c4

Browse files
authored
Add support for Segformer (#480)
* Add support for Segformer * Add semantic segmentation unit test * Update pipelines.test.js
1 parent 1394f73 commit 80af1c4

File tree

7 files changed

+206
-5
lines changed

7 files changed

+206
-5
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
326326
1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
327327
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
328328
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
329+
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
329330
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
330331
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
331332
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@
6161
1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
6262
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
6363
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
64+
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
6465
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
6566
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
6667
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

scripts/supported_models.py

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -362,7 +362,7 @@
362362
'distilbert-base-cased',
363363
],
364364
},
365-
'dit': { # NOTE: DiT has the same architecture as BEiT.
365+
'dit': { # NOTE: DiT has the same architecture as BEiT.
366366
# Feature extraction
367367
# NOTE: requires --task feature-extraction
368368
'feature-extraction': [
@@ -680,8 +680,8 @@
680680
'hf-tiny-model-private/tiny-random-RoFormerForTokenClassification',
681681
],
682682

683-
# TODO
684-
# # Text generation
683+
# TODO
684+
# # Text generation
685685
# 'text-generation': [
686686
# 'hf-tiny-model-private/tiny-random-RoFormerForCausalLM',
687687
# ],
@@ -736,6 +736,40 @@
736736
# 'facebook/sam-vit-large',
737737
# 'facebook/sam-vit-huge',
738738
# ],
739+
'segformer': {
740+
# Image segmentation
741+
'image-segmentation': [
742+
'mattmdjaga/segformer_b0_clothes',
743+
'mattmdjaga/segformer_b2_clothes',
744+
'jonathandinu/face-parsing',
745+
746+
'nvidia/segformer-b0-finetuned-cityscapes-768-768',
747+
'nvidia/segformer-b0-finetuned-cityscapes-512-1024',
748+
'nvidia/segformer-b0-finetuned-cityscapes-640-1280',
749+
'nvidia/segformer-b0-finetuned-cityscapes-1024-1024',
750+
'nvidia/segformer-b1-finetuned-cityscapes-1024-1024',
751+
'nvidia/segformer-b2-finetuned-cityscapes-1024-1024',
752+
'nvidia/segformer-b3-finetuned-cityscapes-1024-1024',
753+
'nvidia/segformer-b4-finetuned-cityscapes-1024-1024',
754+
'nvidia/segformer-b5-finetuned-cityscapes-1024-1024',
755+
'nvidia/segformer-b0-finetuned-ade-512-512',
756+
'nvidia/segformer-b1-finetuned-ade-512-512',
757+
'nvidia/segformer-b2-finetuned-ade-512-512',
758+
'nvidia/segformer-b3-finetuned-ade-512-512',
759+
'nvidia/segformer-b4-finetuned-ade-512-512',
760+
'nvidia/segformer-b5-finetuned-ade-640-640',
761+
],
762+
763+
# Image classification
764+
'image-classification': [
765+
'nvidia/mit-b0',
766+
'nvidia/mit-b1',
767+
'nvidia/mit-b2',
768+
'nvidia/mit-b3',
769+
'nvidia/mit-b4',
770+
'nvidia/mit-b5',
771+
],
772+
},
739773

740774
'speecht5': {
741775
# Text-to-audio/Text-to-speech

src/models.js

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4736,6 +4736,27 @@ export class VitsModel extends VitsPreTrainedModel {
47364736
}
47374737
//////////////////////////////////////////////////
47384738

4739+
//////////////////////////////////////////////////
4740+
// Segformer models
4741+
export class SegformerPreTrainedModel extends PreTrainedModel { }
4742+
4743+
/**
4744+
* The bare SegFormer encoder (Mix-Transformer) outputting raw hidden-states without any specific head on top.
4745+
*/
4746+
export class SegformerModel extends SegformerPreTrainedModel { }
4747+
4748+
/**
4749+
* SegFormer Model transformer with an image classification head on top (a linear layer on top of the final hidden states) e.g. for ImageNet.
4750+
*/
4751+
export class SegformerForImageClassification extends SegformerPreTrainedModel { }
4752+
4753+
/**
4754+
* SegFormer Model transformer with an all-MLP decode head on top e.g. for ADE20k, CityScapes.
4755+
*/
4756+
export class SegformerForSemanticSegmentation extends SegformerPreTrainedModel { }
4757+
4758+
//////////////////////////////////////////////////
4759+
47394760

47404761
//////////////////////////////////////////////////
47414762
// AutoModels, used to simplify construction of PreTrainedModels
@@ -5020,6 +5041,7 @@ const MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES = new Map([
50205041
['dinov2', ['Dinov2ForImageClassification', Dinov2ForImageClassification]],
50215042
['resnet', ['ResNetForImageClassification', ResNetForImageClassification]],
50225043
['swin', ['SwinForImageClassification', SwinForImageClassification]],
5044+
['segformer', ['SegformerForImageClassification', SegformerForImageClassification]],
50235045
]);
50245046

50255047
const MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = new Map([
@@ -5036,6 +5058,10 @@ const MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES = new Map([
50365058
['detr', ['DetrForSegmentation', DetrForSegmentation]],
50375059
]);
50385060

5061+
const MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES = new Map([
5062+
['segformer', ['SegformerForSemanticSegmentation', SegformerForSemanticSegmentation]],
5063+
]);
5064+
50395065
const MODEL_FOR_MASK_GENERATION_MAPPING_NAMES = new Map([
50405066
['sam', ['SamModel', SamModel]],
50415067
]);
@@ -5081,6 +5107,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
50815107
[MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES, MODEL_TYPES.Vision2Seq],
50825108
[MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
50835109
[MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
5110+
[MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
50845111
[MODEL_FOR_IMAGE_MATTING_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
50855112
[MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
50865113
[MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
@@ -5260,6 +5287,17 @@ export class AutoModelForImageSegmentation extends PretrainedMixin {
52605287
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES];
52615288
}
52625289

5290+
/**
5291+
* Helper class which is used to instantiate pretrained image segmentation models with the `from_pretrained` function.
5292+
* The chosen model class is determined by the type specified in the model config.
5293+
*
5294+
* @example
5295+
* let model = await AutoModelForSemanticSegmentation.from_pretrained('nvidia/segformer-b3-finetuned-cityscapes-1024-1024');
5296+
*/
5297+
export class AutoModelForSemanticSegmentation extends PretrainedMixin {
5298+
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES];
5299+
}
5300+
52635301
/**
52645302
* Helper class which is used to instantiate pretrained object detection models with the `from_pretrained` function.
52655303
* The chosen model class is determined by the type specified in the model config.

src/pipelines.js

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ import {
3333
AutoModelForVision2Seq,
3434
AutoModelForImageClassification,
3535
AutoModelForImageSegmentation,
36+
AutoModelForSemanticSegmentation,
3637
AutoModelForObjectDetection,
3738
AutoModelForZeroShotObjectDetection,
3839
AutoModelForDocumentQuestionAnswering,
@@ -1710,8 +1711,26 @@ export class ImageSegmentationPipeline extends Pipeline {
17101711
}
17111712

17121713
} else if (subtask === 'semantic') {
1713-
throw Error(`semantic segmentation not yet supported.`);
1714+
const { segmentation, labels } = fn(output, target_sizes ?? imageSizes)[0];
17141715

1716+
const id2label = this.model.config.id2label;
1717+
1718+
for (let label of labels) {
1719+
const maskData = new Uint8ClampedArray(segmentation.data.length);
1720+
for (let i = 0; i < segmentation.data.length; ++i) {
1721+
if (segmentation.data[i] === label) {
1722+
maskData[i] = 255;
1723+
}
1724+
}
1725+
1726+
const mask = new RawImage(maskData, segmentation.dims[1], segmentation.dims[0], 1);
1727+
1728+
annotation.push({
1729+
score: null,
1730+
label: id2label[label],
1731+
mask: mask
1732+
});
1733+
}
17151734
} else {
17161735
throw Error(`Subtask ${subtask} not supported.`);
17171736
}
@@ -2488,7 +2507,7 @@ const SUPPORTED_TASKS = {
24882507
"image-segmentation": {
24892508
// no tokenizer
24902509
"pipeline": ImageSegmentationPipeline,
2491-
"model": AutoModelForImageSegmentation,
2510+
"model": [AutoModelForImageSegmentation, AutoModelForSemanticSegmentation],
24922511
"processor": AutoProcessor,
24932512
"default": {
24942513
// TODO: replace with original

src/processors.js

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -609,6 +609,71 @@ export class ImageFeatureExtractor extends FeatureExtractor {
609609

610610
}
611611

612+
export class SegformerFeatureExtractor extends ImageFeatureExtractor {
613+
614+
/**
615+
* Converts the output of `SegformerForSemanticSegmentation` into semantic segmentation maps.
616+
* @param {*} outputs Raw outputs of the model.
617+
* @param {number[][]} [target_sizes=null] List of tuples corresponding to the requested final size
618+
* (height, width) of each prediction. If unset, predictions will not be resized.
619+
* @returns {{segmentation: Tensor; labels: number[]}[]} The semantic segmentation maps.
620+
*/
621+
post_process_semantic_segmentation(outputs, target_sizes = null) {
622+
623+
const logits = outputs.logits;
624+
const batch_size = logits.dims[0];
625+
626+
if (target_sizes !== null && target_sizes.length !== batch_size) {
627+
throw Error("Make sure that you pass in as many target sizes as the batch dimension of the logits")
628+
}
629+
630+
const toReturn = [];
631+
for (let i = 0; i < batch_size; ++i) {
632+
const target_size = target_sizes !== null ? target_sizes[i] : null;
633+
634+
let data = logits[i];
635+
636+
// 1. If target_size is not null, we need to resize the masks to the target size
637+
if (target_size !== null) {
638+
// resize the masks to the target size
639+
data = interpolate(data, target_size, 'bilinear', false);
640+
}
641+
const [height, width] = target_size ?? data.dims.slice(-2);
642+
643+
const segmentation = new Tensor(
644+
'int32',
645+
new Int32Array(height * width),
646+
[height, width]
647+
);
648+
649+
// Buffer to store current largest value
650+
const buffer = data[0].data;
651+
for (let j = 1; j < data.dims[0]; ++j) {
652+
const row = data[j].data;
653+
for (let k = 0; k < row.length; ++k) {
654+
if (row[k] > buffer[k]) {
655+
buffer[k] = row[k];
656+
segmentation.data[k] = j;
657+
}
658+
}
659+
}
660+
661+
// Store which objects have labels
662+
// This is much more efficient that creating a set of the final values
663+
const hasLabel = new Array(data.dims[0]);
664+
const out = segmentation.data;
665+
for (let j = 0; j < out.length; ++j) {
666+
const index = out[j];
667+
hasLabel[index] = index;
668+
}
669+
/** @type {number[]} The unique list of labels that were detected */
670+
const labels = hasLabel.filter(x => x !== undefined);
671+
672+
toReturn.push({ segmentation, labels });
673+
}
674+
return toReturn;
675+
}
676+
}
612677
export class BitImageProcessor extends ImageFeatureExtractor { }
613678
export class DPTFeatureExtractor extends ImageFeatureExtractor { }
614679
export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
@@ -1699,6 +1764,7 @@ export class AutoProcessor {
16991764
ChineseCLIPFeatureExtractor,
17001765
ConvNextFeatureExtractor,
17011766
ConvNextImageProcessor,
1767+
SegformerFeatureExtractor,
17021768
BitImageProcessor,
17031769
DPTFeatureExtractor,
17041770
GLPNFeatureExtractor,

tests/pipelines.test.js

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1164,6 +1164,7 @@ describe('Pipelines', () => {
11641164
// List all models which will be tested
11651165
const models = [
11661166
'facebook/detr-resnet-50-panoptic',
1167+
'mattmdjaga/segformer_b2_clothes',
11671168
];
11681169

11691170
it(models[0], async () => {
@@ -1195,6 +1196,47 @@ describe('Pipelines', () => {
11951196
await segmenter.dispose();
11961197

11971198
}, MAX_TEST_EXECUTION_TIME);
1199+
1200+
it(models[1], async () => {
1201+
let segmenter = await pipeline('image-segmentation', m(models[1]));
1202+
let img = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/young-man-standing-and-leaning-on-car.jpg';
1203+
1204+
// single
1205+
{
1206+
let outputs = await segmenter(img);
1207+
1208+
let expected = [
1209+
{ label: 'Background' },
1210+
{ label: 'Hair' },
1211+
{ label: 'Upper-clothes' },
1212+
{ label: 'Pants' },
1213+
{ label: 'Left-shoe' },
1214+
{ label: 'Right-shoe' },
1215+
{ label: 'Face' },
1216+
{ label: 'Left-leg' },
1217+
{ label: 'Right-leg' },
1218+
{ label: 'Left-arm' },
1219+
{ label: 'Right-arm' },
1220+
];
1221+
1222+
let outputLabels = outputs.map(x => x.label);
1223+
let expectedLabels = expected.map(x => x.label);
1224+
1225+
expect(outputLabels).toHaveLength(expectedLabels.length);
1226+
expect(outputLabels.sort()).toEqual(expectedLabels.sort())
1227+
1228+
// check that all scores are null, and masks have correct dimensions
1229+
for (let output of outputs) {
1230+
expect(output.score).toBeNull();
1231+
expect(output.mask.width).toEqual(970);
1232+
expect(output.mask.height).toEqual(1455);
1233+
expect(output.mask.channels).toEqual(1);
1234+
}
1235+
}
1236+
1237+
await segmenter.dispose();
1238+
1239+
}, MAX_TEST_EXECUTION_TIME);
11981240
});
11991241

12001242
describe('Zero-shot image classification', () => {

0 commit comments

Comments
 (0)