Skip to content

Commit 0d35d0d

Browse files
authored
Replace feature extractor by image processor (#1510)
* Replace feature extractor * More improvements
1 parent e9fe1c7 commit 0d35d0d

11 files changed

+77
-78
lines changed

fine-tune-segformer.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -197,31 +197,31 @@ label2id = {v: k for k, v in id2label.items()}
197197
num_labels = len(id2label)
198198
```
199199

200-
## Feature extractor & data augmentation
200+
## Image processor & data augmentation
201201

202-
A SegFormer model expects the input to be of a certain shape. To transform our training data to match the expected shape, we can use `SegFormerFeatureExtractor`. We could use the `ds.map` function to apply the feature extractor to the whole training dataset in advance, but this can take up a lot of disk space. Instead, we'll use a *transform*, which will only prepare a batch of data when that data is actually used (on-the-fly). This way, we can start training without waiting for further data preprocessing.
202+
A SegFormer model expects the input to be of a certain shape. To transform our training data to match the expected shape, we can use `SegFormerImageProcessor`. We could use the `ds.map` function to apply the image processor to the whole training dataset in advance, but this can take up a lot of disk space. Instead, we'll use a *transform*, which will only prepare a batch of data when that data is actually used (on-the-fly). This way, we can start training without waiting for further data preprocessing.
203203

204204
In our transform, we'll also define some data augmentations to make our model more resilient to different lighting conditions. We'll use the [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) function from `torchvision` to randomly change the brightness, contrast, saturation, and hue of the images in the batch.
205205

206206

207207
```python
208208
from torchvision.transforms import ColorJitter
209-
from transformers import SegformerFeatureExtractor
209+
from transformers import SegformerImageProcessor
210210

211-
feature_extractor = SegformerFeatureExtractor()
211+
processor = SegformerImageProcessor()
212212
jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1)
213213

214214
def train_transforms(example_batch):
215215
images = [jitter(x) for x in example_batch['pixel_values']]
216216
labels = [x for x in example_batch['label']]
217-
inputs = feature_extractor(images, labels)
217+
inputs = processor(images, labels)
218218
return inputs
219219

220220

221221
def val_transforms(example_batch):
222222
images = [x for x in example_batch['pixel_values']]
223223
labels = [x for x in example_batch['label']]
224-
inputs = feature_extractor(images, labels)
224+
inputs = processor(images, labels)
225225
return inputs
226226

227227

@@ -324,7 +324,7 @@ def compute_metrics(eval_pred):
324324
references=labels,
325325
num_labels=len(id2label),
326326
ignore_index=0,
327-
reduce_labels=feature_extractor.do_reduce_labels,
327+
reduce_labels=processor.do_reduce_labels,
328328
)
329329

330330
# add per category metrics as individual key-value pairs
@@ -359,7 +359,7 @@ Now that our trainer is set up, training is as simple as calling the `train` fun
359359
trainer.train()
360360
```
361361

362-
When we're done with training, we can push our fine-tuned model and the feature extractor to the Hub.
362+
When we're done with training, we can push our fine-tuned model and the image processor to the Hub.
363363

364364
This will also automatically create a model card with our results. We'll supply some extra information in `kwargs` to make the model card more complete.
365365

@@ -371,7 +371,7 @@ kwargs = {
371371
"dataset": hf_dataset_identifier,
372372
}
373373

374-
feature_extractor.push_to_hub(hub_model_id)
374+
processor.push_to_hub(hub_model_id)
375375
trainer.push_to_hub(**kwargs)
376376
```
377377

@@ -396,9 +396,9 @@ However, you can also try out your model directly on the Hugging Face Hub, thank
396396
We'll first load the model from the Hub using `SegformerForSemanticSegmentation.from_pretrained()`.
397397

398398
```python
399-
from transformers import SegformerFeatureExtractor, SegformerForSemanticSegmentation
399+
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
400400

401-
feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
401+
processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
402402
model = SegformerForSemanticSegmentation.from_pretrained(f"{hf_username}/{hub_model_id}")
403403
```
404404

@@ -411,15 +411,15 @@ gt_seg = test_ds[0]['label']
411411
image
412412
```
413413

414-
To segment this test image, we first need to prepare the image using the feature extractor. Then we forward it through the model.
414+
To segment this test image, we first need to prepare the image using the image processor. Then we forward it through the model.
415415

416416
We also need to remember to upscale the output logits to the original image size. In order to get the actual category predictions, we just have to apply an `argmax` on the logits.
417417

418418

419419
```python
420420
from torch import nn
421421

422-
inputs = feature_extractor(images=image, return_tensors="pt")
422+
inputs = processor(images=image, return_tensors="pt")
423423
outputs = model(**inputs)
424424
logits = outputs.logits # shape (batch_size, num_labels, height/4, width/4)
425425

fine-tune-vit.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -171,29 +171,28 @@ From what I'm seeing,
171171
- Bean Rust: Has circular brown spots surrounded with a white-ish yellow ring
172172
- Healthy: ...looks healthy. 🤷‍♂️
173173

174-
## Loading ViT Feature Extractor
174+
## Loading ViT Image Processor
175175

176176
Now we know what our images look like and better understand the problem we're trying to solve. Let's see how we can prepare these images for our model!
177177

178178
When ViT models are trained, specific transformations are applied to images fed into them. Use the wrong transformations on your image, and the model won't understand what it's seeing! 🖼 ➡️ 🔢
179179

180-
To make sure we apply the correct transformations, we will use a [`ViTFeatureExtractor`](https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTFeatureExtractor) initialized with a configuration that was saved along with the pretrained model we plan to use. In our case, we'll be using the [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) model, so let's load its feature extractor from the Hugging Face Hub.
180+
To make sure we apply the correct transformations, we will use a [`ViTImageProcessor`](https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTImageProcessor) initialized with a configuration that was saved along with the pretrained model we plan to use. In our case, we'll be using the [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) model, so let's load its image processor from the Hugging Face Hub.
181181

182182

183183
```python
184-
from transformers import ViTFeatureExtractor
184+
from transformers import ViTImageProcessor
185185

186186
model_name_or_path = 'google/vit-base-patch16-224-in21k'
187-
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name_or_path)
187+
processor = ViTImageProcessor.from_pretrained(model_name_or_path)
188188
```
189189

190-
You can see the feature extractor configuration by printing it.
190+
You can see the image processor configuration by printing it.
191191

192192

193-
ViTFeatureExtractor {
193+
ViTImageProcessor {
194194
"do_normalize": true,
195195
"do_resize": true,
196-
"feature_extractor_type": "ViTFeatureExtractor",
197196
"image_mean": [
198197
0.5,
199198
0.5,
@@ -210,14 +209,14 @@ You can see the feature extractor configuration by printing it.
210209

211210

212211

213-
To process an image, simply pass it to the feature extractor's call function. This will return a dict containing `pixel values`, which is the numeric representation to be passed to the model.
212+
To process an image, simply pass it to the image processor's call function. This will return a dict containing `pixel values`, which is the numeric representation to be passed to the model.
214213

215214
You get a NumPy array by default, but if you add the `return_tensors='pt'` argument, you'll get back `torch` tensors instead.
216215

217216

218217

219218
```python
220-
feature_extractor(image, return_tensors='pt')
219+
processor(image, return_tensors='pt')
221220
```
222221

223222
Should give you something like...
@@ -235,7 +234,7 @@ Now that you know how to read images and transform them into inputs, let's write
235234

236235
```python
237236
def process_example(example):
238-
inputs = feature_extractor(example['image'], return_tensors='pt')
237+
inputs = processor(example['image'], return_tensors='pt')
239238
inputs['labels'] = example['labels']
240239
return inputs
241240
```
@@ -263,7 +262,7 @@ ds = load_dataset('beans')
263262

264263
def transform(example_batch):
265264
# Take a list of PIL images and turn them to pixel values
266-
inputs = feature_extractor([x for x in example_batch['image']], return_tensors='pt')
265+
inputs = processor([x for x in example_batch['image']], return_tensors='pt')
267266

268267
# Don't forget to include the labels!
269268
inputs['labels'] = example_batch['labels']
@@ -399,7 +398,7 @@ trainer = Trainer(
399398
compute_metrics=compute_metrics,
400399
train_dataset=prepared_ds["train"],
401400
eval_dataset=prepared_ds["validation"],
402-
tokenizer=feature_extractor,
401+
tokenizer=processor,
403402
)
404403
```
405404

image-similarity.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,11 @@ To compute the embeddings from the images, we'll use a vision model that has som
4545
For loading the model, we leverage the [`AutoModel` class](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel). It provides an interface for us to load any compatible model checkpoint from the Hugging Face Hub. Alongside the model, we also load the processor associated with the model for data preprocessing.
4646

4747
```py
48-
from transformers import AutoFeatureExtractor, AutoModel
48+
from transformers import AutoImageProcessor, AutoModel
4949

5050

5151
model_ckpt = "nateraw/vit-base-beans"
52-
extractor = AutoFeatureExtractor.from_pretrained(model_ckpt)
52+
processor = AutoImageProcessor.from_pretrained(model_ckpt)
5353
model = AutoModel.from_pretrained(model_ckpt)
5454
```
5555

notebooks/111_tf_serving_vision.ipynb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@
6767
},
6868
"outputs": [],
6969
"source": [
70-
"from transformers import ViTFeatureExtractor, TFViTForImageClassification\n",
70+
"from transformers import ViTImageProcessor, TFViTForImageClassification\n",
7171
"import tensorflow as tf\n",
7272
"import tempfile\n",
7373
"import requests\n",
@@ -288,8 +288,8 @@
288288
}
289289
],
290290
"source": [
291-
"feature_extractor = ViTFeatureExtractor()\n",
292-
"feature_extractor"
291+
"processor = ViTImageProcessor()\n",
292+
"processor"
293293
]
294294
},
295295
{
@@ -301,7 +301,7 @@
301301
"outputs": [],
302302
"source": [
303303
"CONCRETE_INPUT = \"pixel_values\"\n",
304-
"SIZE = feature_extractor.size\n",
304+
"SIZE = processor.size[\"height\"]\n",
305305
"INPUT_SHAPE = (SIZE, SIZE, 3)"
306306
]
307307
},
@@ -314,7 +314,7 @@
314314
"outputs": [],
315315
"source": [
316316
"def normalize_img(\n",
317-
" img, mean=feature_extractor.image_mean, std=feature_extractor.image_std\n",
317+
" img, mean=processor.image_mean, std=processor.image_std\n",
318318
"):\n",
319319
" # Scale to the value range of [0, 1] first and then normalize.\n",
320320
" img = img / 255\n",
@@ -1609,4 +1609,4 @@
16091609
},
16101610
"nbformat": 4,
16111611
"nbformat_minor": 0
1612-
}
1612+
}

notebooks/112_vertex_ai_vision.ipynb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@
132132
}
133133
],
134134
"source": [
135-
"from transformers import ViTFeatureExtractor, TFViTForImageClassification\n",
135+
"from transformers import ViTImageProcessor, TFViTForImageClassification\n",
136136
"import tensorflow as tf\n",
137137
"import tempfile\n",
138138
"import requests\n",
@@ -442,8 +442,8 @@
442442
}
443443
],
444444
"source": [
445-
"feature_extractor = ViTFeatureExtractor()\n",
446-
"feature_extractor"
445+
"processor = ViTImageProcessor()\n",
446+
"processor"
447447
]
448448
},
449449
{
@@ -456,7 +456,7 @@
456456
"outputs": [],
457457
"source": [
458458
"CONCRETE_INPUT = \"pixel_values\"\n",
459-
"SIZE = feature_extractor.size\n",
459+
"SIZE = processor.size[\"height\"]\n",
460460
"INPUT_SHAPE = (SIZE, SIZE, 3)"
461461
]
462462
},
@@ -469,7 +469,7 @@
469469
},
470470
"outputs": [],
471471
"source": [
472-
"def normalize_img(img, mean=feature_extractor.image_mean, std=feature_extractor.image_std):\n",
472+
"def normalize_img(img, mean=processor.image_mean, std=processor.image_std):\n",
473473
" # Scale to the value range of [0, 1] first and then normalize.\n",
474474
" img = img / 255\n",
475475
" mean = tf.constant(mean)\n",
@@ -1019,4 +1019,4 @@
10191019
},
10201020
"nbformat": 4,
10211021
"nbformat_minor": 5
1022-
}
1022+
}

notebooks/56_fine_tune_segformer.ipynb

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1201,7 +1201,7 @@
12011201
"id": "EobXJvy2EAQy"
12021202
},
12031203
"source": [
1204-
"## Feature extractor & data augmentation"
1204+
"## Image processor & data augmentation"
12051205
]
12061206
},
12071207
{
@@ -1210,7 +1210,7 @@
12101210
"id": "Za3n6MH1UuDb"
12111211
},
12121212
"source": [
1213-
"A SegFormer model expects the input to be of a certain shape. To transform our training data to match the expected shape, we can use `SegFormerFeatureExtractor`. We could use the `ds.map` function to apply the feature extractor to the whole training dataset in advance, but this can take up a lot of disk space. Instead, we'll use a *transform*, which will only prepare a batch of data when that data is actually used (on-the-fly). This way, we can start training without waiting for further data preprocessing.\n",
1213+
"A SegFormer model expects the input to be of a certain shape. To transform our training data to match the expected shape, we can use `SegFormerImageProcessor`. We could use the `ds.map` function to apply the image processor to the whole training dataset in advance, but this can take up a lot of disk space. Instead, we'll use a *transform*, which will only prepare a batch of data when that data is actually used (on-the-fly). This way, we can start training without waiting for further data preprocessing.\n",
12141214
"\n",
12151215
"In our transform, we'll also define some data augmentations to make our model more resilient to different lighting conditions. We'll use the [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) function from `torchvision` to randomly change the brightness, contrast, saturation, and hue of the images in the batch."
12161216
]
@@ -1243,23 +1243,23 @@
12431243
"source": [
12441244
"from torchvision.transforms import ColorJitter\n",
12451245
"from transformers import (\n",
1246-
" SegformerFeatureExtractor,\n",
1246+
" SegformerImageProcessor,\n",
12471247
")\n",
12481248
"\n",
1249-
"feature_extractor = SegformerFeatureExtractor()\n",
1249+
"processor = SegformerImageProcessor()\n",
12501250
"jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) \n",
12511251
"\n",
12521252
"def train_transforms(example_batch):\n",
12531253
" images = [jitter(x) for x in example_batch['pixel_values']]\n",
12541254
" labels = [x for x in example_batch['label']]\n",
1255-
" inputs = feature_extractor(images, labels)\n",
1255+
" inputs = processor(images, labels)\n",
12561256
" return inputs\n",
12571257
"\n",
12581258
"\n",
12591259
"def val_transforms(example_batch):\n",
12601260
" images = [x for x in example_batch['pixel_values']]\n",
12611261
" labels = [x for x in example_batch['label']]\n",
1262-
" inputs = feature_extractor(images, labels)\n",
1262+
" inputs = processor(images, labels)\n",
12631263
" return inputs\n",
12641264
"\n",
12651265
"\n",
@@ -1488,7 +1488,7 @@
14881488
" references=labels,\n",
14891489
" num_labels=len(id2label),\n",
14901490
" ignore_index=0,\n",
1491-
" reduce_labels=feature_extractor.do_reduce_labels,\n",
1491+
" reduce_labels=processor.do_reduce_labels,\n",
14921492
" )\n",
14931493
" \n",
14941494
" # add per category metrics as individual key-value pairs\n",
@@ -1565,7 +1565,7 @@
15651565
"id": "YlOal7giORmw"
15661566
},
15671567
"source": [
1568-
"When we're done with training, we can push our fine-tuned model and the feature extractor to the Hugging Face hub.\n",
1568+
"When we're done with training, we can push our fine-tuned model and the image processor to the Hugging Face hub.\n",
15691569
"\n",
15701570
"This will also automatically create a model card with our results. We'll supply some extra information in `kwargs` to make the model card more complete."
15711571
]
@@ -1584,7 +1584,7 @@
15841584
" \"dataset\": hf_dataset_identifier,\n",
15851585
"}\n",
15861586
"\n",
1587-
"feature_extractor.push_to_hub(hub_model_id)\n",
1587+
"processor.push_to_hub(hub_model_id)\n",
15881588
"trainer.push_to_hub(**kwargs)"
15891589
]
15901590
},
@@ -1645,9 +1645,9 @@
16451645
},
16461646
"outputs": [],
16471647
"source": [
1648-
"from transformers import SegformerFeatureExtractor, SegformerForSemanticSegmentation\n",
1648+
"from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation\n",
16491649
"\n",
1650-
"feature_extractor = SegformerFeatureExtractor.from_pretrained(\"nvidia/segformer-b0-finetuned-ade-512-512\")\n",
1650+
"processor = SegformerImageProcessor.from_pretrained(\"nvidia/segformer-b0-finetuned-ade-512-512\")\n",
16511651
"model = SegformerForSemanticSegmentation.from_pretrained(f\"{hf_username}/{hub_model_id}\")"
16521652
]
16531653
},
@@ -1679,7 +1679,7 @@
16791679
"id": "7m7IfMv6R3_5"
16801680
},
16811681
"source": [
1682-
"To segment this test image, we first need to prepare the image using the feature extractor. Then we forward it through the model.\n",
1682+
"To segment this test image, we first need to prepare the image using the image processor. Then we forward it through the model.\n",
16831683
"\n",
16841684
"We also need to remember to upscale the output logits to the original image size. In order to get the actual category predictions, we just have to apply an `argmax` on the logits."
16851685
]
@@ -1694,7 +1694,7 @@
16941694
"source": [
16951695
"from torch import nn\n",
16961696
"\n",
1697-
"inputs = feature_extractor(images=image, return_tensors=\"pt\")\n",
1697+
"inputs = processor(images=image, return_tensors=\"pt\")\n",
16981698
"outputs = model(**inputs)\n",
16991699
"logits = outputs.logits # shape (batch_size, num_labels, height/4, width/4)\n",
17001700
"\n",

0 commit comments

Comments
 (0)