5
5
6
6
**Author:** `WoongJoon Choi <https://github.com/woongjoonchoi>`_
7
7
8
- In this tutorial, we will embark on an exciting journey to build and
9
- train a VGG network from scratch using Python and popular deep learning
10
- libraries such as PyTorch. We will dive into the details of the VGG
8
+ VGG (Visual Geometry Group) is a convolutional neural network architecture that is particularly
9
+ efficient in image classification tasks. In this tutorial, we will guide you through building
10
+ and training a VGG network from scratch using Python and PyTorch. We will dive into the details of the VGG
11
11
architecture, understanding its components and the rationale behind its
12
12
design.
13
13
14
14
Our tutorial is designed for both beginners who are new to deep learning
15
15
and seasoned practitioners looking to deepen their understanding of CNN
16
16
architectures.
17
17
18
- Before you start
18
+ .. grid:: 2
19
+
20
+ .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
21
+ :class-card: card-prerequisites
22
+
23
+ * Understand the VGG architecture and train it from scratch using PyTorch.
24
+ * Use PyTorch tools to evaluate the VGG model's performance
25
+
26
+ .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
27
+ :class-card: card-prerequisites
28
+
29
+ * Complete the `Learn the Basics tutorials <https://pytorch.org/tutorials/beginner/basics/intro.html>`__
30
+ * Familiarity with basic machine learning concepts and terms
31
+
32
+ If you are running this in Google Colab, install albumentations
19
33
20
- .. code-block:: sh
21
34
22
- pip install albumentations
23
35
"""
24
36
import subprocess
25
37
import sys
55
67
56
68
57
69
######################################################################
58
- # Worth point of this tutorial
70
+ # Purpose point of this tutorial
59
71
# ----------------------------
60
72
#
61
73
77
89
78
90
79
91
######################################################################
80
- # Why VGG is so popular ?
92
+ # Background
81
93
# -----------------------
82
94
#
83
95
84
96
85
97
######################################################################
86
98
# VGG became a model that attracted attention because it succeeded in
87
99
# building deeper layers and dramatically shortening the training time
88
- # compared to ``alexnet ``, which was the SOTA model at the time.:
100
+ # compared to ``AlexNet ``, which was the SOTA model at the time.
89
101
#
90
-
102
+ # Unlike ``AlexNet``'s 5x5 9x9 filters, VGG only uses 3x3 filters.
103
+ # Using multiple 3x3 filters can obtain the same receptive field as using a 5x5 filter, but it is effective in reducing the number of parameters.
104
+ # In addition, since it passes through multiple nonlinear functions, the nonlinearity increases even more.
105
+ #
106
+ # VGG applied a max pooling layer after multiple convolutional layers to reduce the spatial size.
107
+ # This allowed the feature map to be downsampled while preserving important information.
108
+ # Thanks to this, the network could learn high-dimensional features in deeper layers and prevent overfitting.
91
109
92
110
######################################################################
93
111
# VGG Configuration
96
114
97
115
98
116
######################################################################
99
- # We define some configurations suggested in VGG paper . Details about
117
+ # We define some configurations suggested in VGG paper. Details of
100
118
# this configuration will be explained below section.
101
119
#
102
120
103
- DatasetName = 'Cifar ' # CIFAR , CIFAR10, MNIST , ImageNet
121
+ DatasetName = 'CIFAR ' # CIFAR, CIFAR10, MNIST, ImageNet
104
122
105
123
## model configuration
106
124
143
161
144
162
145
163
######################################################################
146
- # | If your GPU memory is 24GB ,The maximum batch size is 128. But, if you
147
- # use Colab , We recommend using 32 .
164
+ # | If your GPU memory is 24GB, the maximum batch size is 128. But if you
165
+ # use Colab, we recommend using 32GB .
148
166
# | You can modify the batch size according to your preference.
149
167
#
150
168
156
174
157
175
158
176
######################################################################
159
- # We use ``CIFAR100`` Dataset in this tutorial. In VGG paper , the authors
160
- # scales image ``isotropically`` . Then , they apply
161
- # ``Normalization``,``RandomCrop``,``HorizontalFlip`` . So , we need to override
162
- # CIFAR100 class to apply preprocessing.
177
+ # We use the ``CIFAR100`` dataset in this tutorial. In VGG paper, the authors scales image ``isotropically`` .
178
+ # ``Isotropiclay scale up`` is a method of increasing the size of an image while maintaining its proportions, preventing distortion and maintaining the consistency of the object.
179
+ # Then they apply ``Normalization``,``RandomCrop``,``HorizontalFlip`` .
180
+ # Normalizing input data to a range of 0 to 1 tends to lead to faster convergence of the model. This is because the weight updates are more uniform. In particular, neural network models can have significantly different weight updates depending on the range of input values.
181
+ # Neural network models generally work best when the input data is within a certain range (e.g. 0 to 1). If RGB values are not normalized, the model is fed input values of different ranges, which makes it difficult for the model to process the data in a consistent manner. Normalization allows all data to be scaled to the same scale, which allows the model to treat each feature more evenly, improving performance.
182
+ # If the training and test data have different ranges, the model may have difficulty generalizing. Therefore, it is important to fit the values to the same range across all data. This allows the model to perform well on both test data and real data.
183
+ # Using normalized images as input allows the neural network to learn more effectively and show stable performance.
184
+ # Data augmentation, such as ``RandomCrop`` and ``HorizontalFlip``, is a very useful technique for improving the performance of deep learning models, preventing overfitting, and helping models to work robustly in various environments. In particular, when the dataset is small or limited, data augmentation can secure more data, and the model can show better generalization performance by learning various transformed data.
185
+ # So we need to override CIFAR100 class to apply preprocessing.
163
186
#
164
187
165
188
class Custom_Cifar (CIFAR100 ) :
@@ -224,10 +247,8 @@ def __getitem__(self, index: int) :
224
247
225
248
226
249
######################################################################
227
- # | In VGG paper, they do experiment over 6 models. model A is 11 layers,
228
- # model B is 13 layers, model C is 16 layers , model D is 16 layers and
229
- # model D is 19 layers . you can train all version of models to
230
- # reproduce VGG .
250
+ # | The VGG paper experiments over 6 models of varying layer depth. The various configurations
251
+ # | are enumerated below for full reproduction of the results.
231
252
# | ``Config_Channels`` means output channels and ``Config_kernels`` means
232
253
# kernel size.
233
254
#
@@ -236,26 +257,26 @@ def __getitem__(self, index: int) :
236
257
from torch import nn
237
258
238
259
239
- Config_channels = {
240
- "A" : [64 ,"M" , 128 , "M" , 256 ,256 ,"M" ,512 ,512 ,"M" , 512 ,512 ,"M" ] ,
241
- "A_lrn" : [64 ,"LRN" ,"M" , 128 , "M" , 256 ,256 ,"M" ,512 ,512 ,"M" , 512 ,512 ,"M" ] ,
242
- "B" :[64 ,64 ,"M" , 128 ,128 , "M" , 256 ,256 ,"M" ,512 ,512 ,"M" , 512 ,512 ,"M" ] ,
243
- "C" : [64 ,64 ,"M" , 128 ,128 , "M" , 256 ,256 ,256 ,"M" ,512 ,512 ,512 ,"M" , 512 ,512 ,512 ,"M" ] ,
244
- "D" :[64 ,64 ,"M" , 128 ,128 , "M" , 256 ,256 ,256 ,"M" ,512 ,512 ,512 ,"M" , 512 ,512 ,512 ,"M" ] ,
245
- "E" :[64 ,64 ,"M" , 128 ,128 , "M" , 256 ,256 ,256 ,256 ,"M" ,512 ,512 ,512 ,512 ,"M" , 512 ,512 ,512 ,512 ,"M" ] ,
260
+ # Config_channels -> number : output_channels , "M": max_pooling layer
246
261
262
+ Config_channels = {
263
+ "A" :[64 ,"M" ,128 ,"M" ,256 ,256 ,"M" ,512 ,512 ,"M" ,512 ,512 ,"M" ],
264
+ "A_lrn" :[64 ,"LRN" ,"M" ,128 ,"M" ,256 ,256 ,"M" ,512 ,512 ,"M" ,512 ,512 ,"M" ],
265
+ "B" :[64 ,64 ,"M" ,128 ,128 ,"M" ,256 ,256 ,"M" ,512 ,512 ,"M" ,512 ,512 ,"M" ],
266
+ "C" :[64 ,64 ,"M" ,128 ,128 ,"M" ,256 ,256 ,256 ,"M" ,512 ,512 ,512 ,"M" ,512 ,512 ,512 ,"M" ],
267
+ "D" :[64 ,64 ,"M" ,128 ,128 ,"M" ,256 ,256 ,256 ,"M" ,512 ,512 ,512 ,"M" ,512 ,512 ,512 ,"M" ],
268
+ "E" :[64 ,64 ,"M" ,128 ,128 ,"M" ,256 ,256 ,256 ,256 ,"M" ,512 ,512 ,512 ,512 ,"M" ,512 ,512 ,512 ,512 ,"M" ],
247
269
}
248
270
249
271
250
-
272
+ # Config_kernel -> kernel_size
251
273
Config_kernel = {
252
- "A" : [3 ,2 , 3 , 2 , 3 ,3 ,2 ,3 ,3 ,2 , 3 ,3 ,2 ] ,
253
- "A_lrn" : [3 ,2 ,2 , 3 , 2 , 3 ,3 ,2 ,3 ,3 ,2 , 3 ,3 ,2 ] ,
254
- "B" :[3 ,3 ,2 , 3 ,3 , 2 , 3 ,3 ,2 ,3 ,3 ,2 , 3 ,3 ,2 ] ,
255
- "C" : [3 ,3 ,2 , 3 ,3 , 2 , 3 ,3 ,1 ,2 ,3 ,3 ,1 ,2 , 3 ,3 ,1 ,2 ] ,
256
- "D" :[3 ,3 ,2 , 3 ,3 , 2 , 3 ,3 ,3 ,2 ,3 ,3 ,3 ,2 , 3 ,3 ,3 ,2 ] ,
257
- "E" :[3 ,3 ,2 , 3 ,3 , 2 , 3 ,3 ,3 ,3 ,2 ,3 ,3 ,3 ,3 ,2 , 3 ,3 ,3 ,3 ,2 ] ,
258
-
274
+ "A" :[3 ,2 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,2 ],
275
+ "A_lrn" :[3 ,2 ,2 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,2 ],
276
+ "B" :[3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,2 ],
277
+ "C" :[3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,1 ,2 ,3 ,3 ,1 ,2 ,3 ,3 ,1 ,2 ],
278
+ "D" :[3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,3 ,2 ,3 ,3 ,3 ,2 ,3 ,3 ,3 ,2 ],
279
+ "E" :[3 ,3 ,2 ,3 ,3 ,2 ,3 ,3 ,3 ,3 ,2 ,3 ,3 ,3 ,3 ,2 ,3 ,3 ,3 ,3 ,2 ],
259
280
}
260
281
261
282
@@ -284,7 +305,8 @@ def make_feature_extractor(cfg_c,cfg_k):
284
305
285
306
286
307
class Model_vgg (nn .Module ) :
287
- def __init__ (self ,version , num_classes ):
308
+ # def __init__(self,version , num_classes):
309
+ def __init__ (self ,conf_channels ,conf_kernels , num_classes ):
288
310
conv_5_out_w ,conv_5_out_h = 7 ,7
289
311
conv_5_out_dim = 512
290
312
conv_1_by_1_1_outchannel = 4096
@@ -296,7 +318,7 @@ def __init__(self,version , num_classes):
296
318
self .except_xavier = except_xavier
297
319
298
320
super ().__init__ ()
299
- self .feature_extractor = make_feature_extractor (Config_channels [ version ] , Config_kernel [ version ] )
321
+ self .feature_extractor = make_feature_extractor (conf_channels , conf_kernels )
300
322
self .avgpool = nn .AdaptiveAvgPool2d ((1 ,1 ))
301
323
self .output_layer = nn .Sequential (
302
324
nn .Conv2d (conv_5_out_dim ,conv_1_by_1_1_outchannel ,7 ) ,
@@ -354,17 +376,12 @@ def _init_weights(self,m):
354
376
nn .init .constant_ (m .bias , 0 )
355
377
356
378
357
- ######################################################################
358
- # Parameter Initialization
359
- # ~~~~~~~~~~~~~~~~~~~~~~~~
360
- #
361
-
362
379
363
380
######################################################################
364
- # When training VGG , the authors first train model A , then initialized
365
- # the weights of other models with the weights of model A . Waiting for
381
+ # When training VGG, the authors first train model A, then continue training from
382
+ # the resultant weights for other variants . Waiting for
366
383
# Model A to be trained takes a long time . The authors mention how to
367
- # train with ``xavier `` initialization rather than initializing with the
384
+ # train with ``Xavier `` initialization rather than initializing with the
368
385
# weights of model A. But, they do not mention how to initialize .
369
386
#
370
387
# | To Reproduce VGG , we use ``xavier`` initialization method to initialize
@@ -418,7 +435,7 @@ def accuracy(output, target, topk=(1,)):
418
435
#
419
436
420
437
model_version = 'B'
421
- model = Model_vgg (model_version ,num_classes )
438
+ model = Model_vgg (Config_channels [ model_version ], Config_kernel [ model_version ] ,num_classes )
422
439
criterion = nn .CrossEntropyLoss ()
423
440
424
441
optimizer = optim .SGD (model .parameters (), lr = lr , weight_decay = weight_decay ,momentum = momentum )
@@ -575,7 +592,7 @@ def accuracy(output, target, topk=(1,)):
575
592
# -------------------
576
593
#
577
594
578
- class Cusotm_ImageNet (ImageNet ) :
595
+ class Custom_ImageNet (ImageNet ) :
579
596
def __init__ (self ,root ,transform = None ,multi = False ,s_max = None ,s_min = 256 ,split = None ,val = False ):
580
597
581
598
self .multi = multi
@@ -615,7 +632,6 @@ def __getitem__(self, index: int) :
615
632
if img .mode == 'L' : img = img .convert ('RGB' )
616
633
img = np .array (img ,dtype = np .float32 )
617
634
618
- # print(self.transform)
619
635
620
636
if self .transform is not None :
621
637
img = self .transform (image = img )
@@ -631,8 +647,8 @@ def __getitem__(self, index: int) :
631
647
return img , target
632
648
633
649
if DatasetName == 'ImageNet' :
634
- train_data = Cusotm_ImageNet (root = 'ImageNet' ,split = 'train' )
635
- val_data = Cusotm_ImageNet ('ImageNet' ,split = 'val' ,val = True )
650
+ train_data = Custom_ImageNet (root = 'ImageNet' ,split = 'train' )
651
+ val_data = Custom_ImageNet ('ImageNet' ,split = 'val' ,val = True )
636
652
val_data .val = True
637
653
val_data .s_min = test_min
638
654
val_data .transform = A .Compose (
@@ -647,14 +663,15 @@ def __getitem__(self, index: int) :
647
663
######################################################################
648
664
# Conclusion
649
665
# ----------
650
- # We have seen how ``pretraining`` VGG from scratch . This Tutorial will be helpful to reproduce another Foundation Model .
666
+ # We have seen how ``pretraining`` VGG from scratch .
667
+ # This Tutorial will be helpful to reproduce another Foundation Model .
651
668
652
669
######################################################################
653
670
# More things to try
654
671
# ------------------
655
- # - Trying On ImageNet
656
- # - Try All version of Model
657
- # - Try All evaluation method in VGG paper
672
+ # - Apply model to ImageNet
673
+ # - Try all model variants
674
+ # - Try additional evaluation method
658
675
659
676
660
677
######################################################################
0 commit comments